Giter VIP home page Giter VIP logo

lifeiteng / vall-e Goto Github PK

View Code? Open in Web Editor NEW
1.8K 50.0 309.0 95.12 MB

PyTorch implementation of VALL-E(Zero-Shot Text-To-Speech), Reproduced Demo https://lifeiteng.github.io/valle/index.html

Home Page: https://lifeiteng.github.io/valle/index.html

License: Apache License 2.0

Shell 4.14% Python 95.61% Dockerfile 0.25%
in-context-learning large-language-models text-to-speech tts chatgpt vall-e valle

vall-e's Introduction

Language : 🇺🇸 | 🇨🇳

An unofficial PyTorch implementation of VALL-E(Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers).

We can train the VALL-E model on one GPU.

model

Demo

Buy Me A Coffee

Broader impacts

Since VALL-E could synthesize speech that maintains speaker identity, it may carry potential risks in misuse of the model, such as spoofing voice identification or impersonating a specific speaker.

To avoid abuse, Well-trained models and services will not be provided.

Install Deps

To get up and running quickly just follow the steps below:

# PyTorch
pip install torch==1.13.1 torchaudio==0.13.1 --extra-index-url https://download.pytorch.org/whl/cu116
pip install torchmetrics==0.11.1
# fbank
pip install librosa==0.8.1

# phonemizer pypinyin
apt-get install espeak-ng
## OSX: brew install espeak
pip install phonemizer==3.2.1 pypinyin==0.48.0

# lhotse update to newest version
# https://github.com/lhotse-speech/lhotse/pull/956
# https://github.com/lhotse-speech/lhotse/pull/960
pip uninstall lhotse
pip uninstall lhotse
pip install git+https://github.com/lhotse-speech/lhotse

# k2
# find the right version in https://huggingface.co/csukuangfj/k2
pip install https://huggingface.co/csukuangfj/k2/resolve/main/cuda/k2-1.23.4.dev20230224+cuda11.6.torch1.13.1-cp310-cp310-linux_x86_64.whl

# icefall
git clone https://github.com/k2-fsa/icefall
cd icefall
pip install -r requirements.txt
export PYTHONPATH=`pwd`/../icefall:$PYTHONPATH
echo "export PYTHONPATH=`pwd`/../icefall:\$PYTHONPATH" >> ~/.zshrc
echo "export PYTHONPATH=`pwd`/../icefall:\$PYTHONPATH" >> ~/.bashrc
cd -
source ~/.zshrc

# valle
git clone https://github.com/lifeiteng/valle.git
cd valle
pip install -e .

Training&Inference

  • Prefix Mode 0 1 2 4 for NAR Decoder

    Paper Chapter 5.1 "The average length of the waveform in LibriLight is 60 seconds. During training, we randomly crop the waveform to a random length between 10 seconds and 20 seconds. For the NAR acoustic prompt tokens, we select a random segment waveform of 3 seconds from the same utterance."
    • 0: no acoustic prompt tokens
    • 1: random prefix of current batched utterances (This is recommended)
    • 2: random segment of current batched utterances
    • 4: same as the paper (As they randomly crop the long waveform to multiple utterances, so the same utterance means pre or post utterance in the same long waveform.)
      # If train NAR Decoders with prefix_mode 4
      python3 bin/trainer.py --prefix_mode 4 --dataset libritts --input-strategy PromptedPrecomputedFeatures ...
      

LibriTTS demo Trained on one GPU with 24G memory

cd examples/libritts

# step1 prepare dataset
bash prepare.sh --stage -1 --stop-stage 3

# step2 train the model on one GPU with 24GB memory
exp_dir=exp/valle

## Train AR model
python3 bin/trainer.py --max-duration 80 --filter-min-duration 0.5 --filter-max-duration 14 --train-stage 1 \
      --num-buckets 6 --dtype "bfloat16" --save-every-n 10000 --valid-interval 20000 \
      --model-name valle --share-embedding true --norm-first true --add-prenet false \
      --decoder-dim 1024 --nhead 16 --num-decoder-layers 12 --prefix-mode 1 \
      --base-lr 0.05 --warmup-steps 200 --average-period 0 \
      --num-epochs 20 --start-epoch 1 --start-batch 0 --accumulate-grad-steps 4 \
      --exp-dir ${exp_dir}

## Train NAR model
cp ${exp_dir}/best-valid-loss.pt ${exp_dir}/epoch-2.pt  # --start-epoch 3=2+1
python3 bin/trainer.py --max-duration 40 --filter-min-duration 0.5 --filter-max-duration 14 --train-stage 2 \
      --num-buckets 6 --dtype "float32" --save-every-n 10000 --valid-interval 20000 \
      --model-name valle --share-embedding true --norm-first true --add-prenet false \
      --decoder-dim 1024 --nhead 16 --num-decoder-layers 12 --prefix-mode 1 \
      --base-lr 0.05 --warmup-steps 200 --average-period 0 \
      --num-epochs 40 --start-epoch 3 --start-batch 0 --accumulate-grad-steps 4 \
      --exp-dir ${exp_dir}

# step3 inference
python3 bin/infer.py --output-dir infer/demos \
    --checkpoint=${exp_dir}/best-valid-loss.pt \
    --text-prompts "KNOT one point one five miles per hour." \
    --audio-prompts ./prompts/8463_294825_000043_000000.wav \
    --text "To get up and running quickly just follow the steps below." \

# Demo Inference
https://github.com/lifeiteng/lifeiteng.github.com/blob/main/valle/run.sh#L68

train

Troubleshooting

  • SummaryWriter segmentation fault (core dumped)
    file=`python  -c 'import site; print(f"{site.getsitepackages()[0]}/tensorboard/summary/writer/event_file_writer.py")'`
    sed -i 's/import tf/import tensorflow_stub as tf/g' $file
    

Training on a custom dataset?

  • prepare the dataset to lhotse manifests
  • python3 bin/tokenizer.py ...
  • python3 bin/trainer.py ...

Contributing

  • Parallelize bin/tokenizer.py on multi-GPUs
  • Buy Me A Coffee

Citing

To cite this repository:

@misc{valle,
  author={Feiteng Li},
  title={VALL-E: A neural codec language model},
  year={2023},
  url={http://github.com/lifeiteng/vall-e}
}
@article{VALL-E,
  title     = {Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers},
  author    = {Chengyi Wang, Sanyuan Chen, Yu Wu,
               Ziqiang Zhang, Long Zhou, Shujie Liu,
               Zhuo Chen, Yanqing Liu, Huaming Wang,
               Jinyu Li, Lei He, Sheng Zhao, Furu Wei},
  year      = {2023},
  eprint    = {2301.02111},
  archivePrefix = {arXiv},
  volume    = {abs/2301.02111},
  url       = {http://arxiv.org/abs/2301.02111},
}

Star History

Star History Chart

vall-e's People

Contributors

chenht2021 avatar chenjiasheng avatar guoaoo avatar harryhe11 avatar ifsheldon avatar jiazj-jiazj avatar junzhan2000 avatar lifeiteng avatar runtimeracer avatar zhaomingwork avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

vall-e's Issues

The relationship between AR model and the NAR model

Hi! Thank you for you code! But I am confused that the input of NAR model is the discrete tokens in the paper while in this repo the input of NAR model are the hidden state of AR model. Besides, the gradients will backward from the NAR model to the AR model. Have I misunderstand it?

[Q]: How long does the "bash run.sh --stage -1 --stop-stage 3" take to finish?

It's been running for good 4 hours on the rtx 4090 under wsl2.

Currently its at

Extracting and storing features:  77%|████████████████████████████████████████████████████████████████▎                   | 157105/205044 [1:11:35<18:37, 42.88it/s]

Infer: Segmentation fault

python3 bin/infer.py \
    --decoder-dim 128 --nhead 4 --num-decoder-layers 4 --model-name valle \
    --text-prompts "Go to her." \
    --audio-prompts ./prompts/61_70970_000007_000001.wav \
    --text "To get up and running quickly just follow the steps below." \
    --output-dir infer/demo_valle_epoch20 \
    --checkpoint exp/valle_nano_v2/epoch-20.pt
    ```
    
   i got Segmentation fault

There any ways this error can be fixed?

ImportError: cannot import name 'environmentfilter' from 'jinja2' (/usr/local/lib/python3.9/dist-packages/jinja2/__init__.py)
I got this while trying to set it up.
(I did the build and install command and yet still got the same error)

Words Mismatch during preparing LJSpeech dataset

Hi, I got this warning when prepare LJSpeech Dataset:

2023-03-07 11:27:29,886 WARNING [words_mismatch.py:88] words count mismatch on 100.0% of the lines (1/1)
...
2023-03-07 11:27:29,964 WARNING [words_mismatch.py:88] words count mismatch on 100.0% of the lines (1/1)
2023-03-07 11:27:29,964 WARNING [words_mismatch.py:88] words count mismatch on 100.0% of the lines (1/1)
2023-03-07 11:27:29,965 WARNING [words_mismatch.py:88] words count mismatch on 100.0% of the lines (1/1)
2023-03-07 11:27:29,966 WARNING [words_mismatch.py:88] words count mismatch on 100.0% of the lines (1/1)
100%|█████████████████████████████████████| 200/200 [00:00<00:00, 2466.56it/s]
2023-03-07 11:27:29,999 INFO [tokenizer.py:190] unique phonemes: {'t', 's', 'v', 'ʒ', 'l', 'ɐ', ':', 'o', '?', '"', 'ʊ', 'ʌ', 'ʔ', 'r', 'ɹ', ';', '̩', 'ɚ', '!', 'j', 'ɪ', 'ŋ', 'b', ',', 'f', 'z', 'k', 'ɾ', 'ə', 'ʃ', 'ᵻ', 'ː', 'θ', 'x', 'u', 'e', 'm', 'ð', 'a', 'ɑ', 'i', 'p', 'ɛ', '.', 'w', 'ɜ', 'ɡ', 'h', 'd', 'æ', '_', 'ɔ', 'n'}

How to fix that? If I do the training, have I got the problem later? And what is the minimum vram used for training with LJSpeech?

Error during Prepare Dataset LJSpeech

I want to do a training with LJSpeech dataset and get an error like this when the preparation:

bash run.sh --stage -1 --stop-stage 3     --audio_extractor "Fbank"     --audio_feats_dir data/fbank
2023-03-06 13:20:42 (run.sh:45:main) dl_dir: /media/7A76476176471D6F/VALL-E/valle/egs/ljspeech/download
2023-03-06 13:20:42 (run.sh:48:main) Stage 0: Download data
2023-03-06 13:20:42 (run.sh:61:main) Stage 1: Prepare LJSpeech manifest
2023-03-06 13:20:42 (run.sh:73:main) Stage 2: Split LJSpeech
2023-03-06 13:20:42 (run.sh:94:main) Stage 3: Fbank LJSpeech
2023-03-06 13:21:15,837 INFO [tokenizer.py:140] Processing partition: train CUDA: True
Extracting and storing features: 100%|██| 12500/12500 [10:26<00:00, 19.95it/s]
  0%|                                               | 0/12500 [00:00<?, ?it/s]
Traceback (most recent call last):
  File "bin/tokenizer.py", line 201, in <module>
    main()
  File "bin/tokenizer.py", line 173, in main
    text = c.supervisions[0].custom["normalized_text"]
TypeError: 'NoneType' object is not subscriptable

How do I solve this error?

Very large CUDA memory consumption in rank:0

Hi, great works! Thanks for your sharing!

I have managed to run your preprocessing code. And I have successfully run the training code in a single card.

But when I run training in multi 3090 cards. I found the CUDA memory consumption in rank:0 is very large when executing train_one_epoch and after minutes it failed with OOM. Then I found even when the training exited. There are still some processes running in rank:0.

Could you give me some advice solving this problem? Thank you!
image

Error: Prepare LibriTTS train/dev/test

Stage 3: Prepare LibriTTS train/dev/test
Traceback (most recent call last):
  File "/opt/conda/envs/tts/bin/lhotse", line 8, in <module>
    sys.exit(cli())
  File "/opt/conda/envs/tts/lib/python3.10/site-packages/click/core.py", line 1130, in __call__
    return self.main(*args, **kwargs)
  File "/opt/conda/envs/tts/lib/python3.10/site-packages/click/core.py", line 1055, in main
    rv = self.invoke(ctx)
  File "/opt/conda/envs/tts/lib/python3.10/site-packages/click/core.py", line 1657, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/opt/conda/envs/tts/lib/python3.10/site-packages/click/core.py", line 1404, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/opt/conda/envs/tts/lib/python3.10/site-packages/click/core.py", line 760, in invoke
    return __callback(*args, **kwargs)
  File "/opt/conda/envs/tts/lib/python3.10/site-packages/lhotse/bin/modes/manipulation.py", line 251, in combine
    data_set = combine_manifests(*[load_manifest_lazy_or_eager(m) for m in manifests])
  File "/opt/conda/envs/tts/lib/python3.10/site-packages/lhotse/manipulation.py", line 30, in combine
    return reduce(add, manifests)
  File "/opt/conda/envs/tts/lib/python3.10/site-packages/lhotse/lazy.py", line 123, in __add__
    return cls(LazyIteratorChain(self.data, other.data))
AttributeError: 'NoneType' object has no attribute 'data'

What do the results sound like?

Hi! I was curious if someone has results of this implementation, I currently can't install it nor test it out.

Thanks so much!

Infer not working

If i run this

    python3 bin/infer.py \
        --decoder-dim 128 --nhead 4 --num-decoder-layers 4 --model-name valle \
        --text-prompts "Go to her." \
        --audio-prompts ./prompts/61_70970_000007_000001.wav \
        --output-dir infer/demo_valle_epoch20 \
        --checkpoint exp/valle_nano_v2/epoch-20.pt

i get this error:

  File "/valle/valle/egs/libritts/bin/infer.py", line 170, in <module>
    main()
  File "/opt/miniconda3/envs/my_numba_env/lib/python3.10/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
    return func(*args, **kwargs)
  File "/valle/valle/egs/libritts/bin/infer.py", line 119, in main
    text_prompts, text_prompts_lens = text_collater(
  File "/valle/valle/valle/data/collation.py", line 101, in __call__
    [[self.token2idx[token] for token in seq] for seq in seqs],
  File "/valle/valle/valle/data/collation.py", line 101, in <listcomp>
    [[self.token2idx[token] for token in seq] for seq in seqs],
  File "/valle/valle/valle/data/collation.py", line 101, in <listcomp>
    [[self.token2idx[token] for token in seq] for seq in seqs],
KeyError: 'ɡ'

Discrete code vs mel-spectrogram

I have one question about the title.
In the thesis, they use the discrete code to change the mel-spectrogram in the end.
What's the benefit of this choose?
What did you think about the difference this two presentation?

RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cpu and cuda:0! (when checking argument for argument weight in method wrapper___slow_conv2d_forward)

I get this error after trying:

python3 bin/infer.py
--decoder-dim 128 --nhead 4 --num-decoder-layers 4 --model-name valle
--text-prompts "Go to her."
--audio-prompts ./prompts/61_70970_000007_000001.wav
--text "To get up and running quickly just follow the steps below."
--output-dir infer/demo_valle_PostNorm_epoch10
--checkpoint exp/valle_nano_v41_PostNorm/epoch-10.pt

I tried poking around infer.py and found:

device = torch.device("cpu")

if torch.cuda.is_available():
    device = torch.device("cuda", 0)

model = get_model(args)
if args.checkpoint.is_file():
    checkpoint = torch.load(args.checkpoint, map_location=device)
    missing_keys, unexpected_keys = model.load_state_dict(checkpoint["model"], strict=True)
    assert not missing_keys
    # from icefall.checkpoint import save_checkpoint
    # save_checkpoint(f"{args.checkpoint}", model=model)

model.to(device)
model.eval()

but I don't see anything wrong with this...

Got silence audio after inference

I followed the steps in installation section in README and trained a valle model with this config
python3 bin/trainer.py \ --decoder-dim 128 --nhead 4 --num-decoder-layers 4 \ --max-duration 40 --model-name valle \ --exp-dir exp/valle_nano_full
Then I tried to synthesize audio with the inference code given but got an audio file with nothing in it. The command line looks like this
image
What could I do to solve this problem? Thanks!

times cost and device cost

I want to ask about the data, the time cost and the device cost.
how many hours Librispeech data did you use?
How many time cost if your model 100x smaller than the thesis config?
When you train on Librispeech data, Can i know the device cost with memory and batchsize and so on.

lhotse issue

I get the error in the Traceback below when using the current version of lhotse (from pip install git+https://github.com/lhotse-speech/lhotse) while running bash run.sh --stage -1 --stop-stage 3

Scanning transcript files (progbar per speaker): 72it [00:01, 61.14it/s]
Scanning transcript files (progbar per speaker): 79it [00:01, 44.47it/s]
Preparing LibriTTS parts: 14% 1/7 [00:11<01:09, 11.67s/it]
Traceback (most recent call last):
File "/usr/local/bin/lhotse", line 8, in
sys.exit(cli())
File "/usr/local/lib/python3.9/dist-packages/click/core.py", line 1130, in call
return self.main(*args, **kwargs)
File "/usr/local/lib/python3.9/dist-packages/click/core.py", line 1055, in main
rv = self.invoke(ctx)
File "/usr/local/lib/python3.9/dist-packages/click/core.py", line 1657, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/usr/local/lib/python3.9/dist-packages/click/core.py", line 1657, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/usr/local/lib/python3.9/dist-packages/click/core.py", line 1404, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/usr/local/lib/python3.9/dist-packages/click/core.py", line 760, in invoke
return __callback(*args, **kwargs)
File "/usr/local/lib/python3.9/dist-packages/lhotse/bin/modes/recipes/libritts.py", line 49, in libritts
prepare_libritts(
File "/usr/local/lib/python3.9/dist-packages/lhotse/recipes/libritts.py", line 198, in prepare_libritts
duration=recordings[rec_id].duration,
File "/usr/local/lib/python3.9/dist-packages/lhotse/audio.py", line 1201, in getitem
return self.recordings[recording_id_or_index]
KeyError: '8288_274162_000003_000002'

Prompt concatenation

I'm just wondering what are you doing during training and inference,

during the training you are using the the same audio (including the text of that audio) as the text and the prompt right?

I mean
let's assume you have the below data

Audio A | speaker x | chunk1 | some text here 1 |
Audio A | speaker x | chunk2 | some text here 2 |
Audio A | speaker x | chunk3 | some text here 3|
Audio B | speaker x | chunk1 | some text here 4|
Audio B | speaker x | chunk2 | some text here 5|

during training you are passing the below (for one example of the batch)

the input text

some text here 3 some text here 2 

the input speech

chunk3 + gap + chunk2

and you only mask the acoustic content of chunk2 to be predicted, while chunk3 is the acoustic input prompt and the concatenation of both text 3 and text 2 is the input prompt

is that what are you doing or I'm wrong?

Infer strange result

I get a strange result:

0.webm

With this call:

python3 bin/infer.py \
    --decoder-dim 128 --nhead 4 --num-decoder-layers 4 --model-name valle \
    --text-prompts "Go to her." \
    --audio-prompts ./prompts/61_70970_000007_000001.wav \
    --text "To get up and running quickly just follow the steps below. Hello world." \
    --output-dir infer/demo_valle_epoch20 \
    --checkpoint exp/valle_nano_v2/epoch-20.pt

Inference

The results of inference are not the same with the same config!

Unexpected interruption during model inference

Hi, When I run the inferential code, I get an interrupt here because this condition is true:samples[0, 0] == NUM_AUDIO_TOKENS, that is, the 1024+1 token is predicted.
Why is ar_predict_layer set to predict 1024+1 values instead of 1024 values?
Also my model was trained on 20h data only for testing,Is it because my model is not learning well enough?
image

image

Prepare Dataset not working

Check out the repo, fallow the installation.
Run the command:
bash run.sh --stage -1 --stop-stage 3

Stage 3: Prepare LibriTTS train/dev/test
Traceback (most recent call last):
  File "/opt/conda/envs/tts/bin/lhotse", line 8, in <module>
    sys.exit(cli())
  File "/opt/conda/envs/tts/lib/python3.10/site-packages/click/core.py", line 1130, in __call__
    return self.main(*args, **kwargs)
  File "/opt/conda/envs/tts/lib/python3.10/site-packages/click/core.py", line 1055, in main
    rv = self.invoke(ctx)
  File "/opt/conda/envs/tts/lib/python3.10/site-packages/click/core.py", line 1657, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/opt/conda/envs/tts/lib/python3.10/site-packages/click/core.py", line 1404, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/opt/conda/envs/tts/lib/python3.10/site-packages/click/core.py", line 760, in invoke
    return __callback(*args, **kwargs)
  File "/opt/conda/envs/tts/lib/python3.10/site-packages/lhotse/bin/modes/manipulation.py", line 251, in combine
    data_set = combine_manifests(*[load_manifest_lazy_or_eager(m) for m in manifests])
  File "/opt/conda/envs/tts/lib/python3.10/site-packages/lhotse/manipulation.py", line 30, in combine
    return reduce(add, manifests)
  File "/opt/conda/envs/tts/lib/python3.10/site-packages/lhotse/lazy.py", line 123, in __add__
    return cls(LazyIteratorChain(self.data, other.data))
AttributeError: 'NoneType' object has no attribute 'data'

tokenizer error

(prepare.sh:67:main) Stage 2: Tokenize LibriTTS
[W ParallelNative.cpp:229] Warning: Cannot set number of intraop threads after parallel work has started or after set_num_threads call when using native parallel backend (function set_num_threads)
Traceback (most recent call last):
  File "/valle/valle/egs/libritts/bin/tokenizer.py", line 172, in <module>
    main()
  File "/valle/valle/egs/libritts/bin/tokenizer.py", line 108, in main
    assert len(manifests) == len(dataset_parts)
AssertionError

Error: Invalid value for '[MANIFESTS]...': File 'data/tokenized/libritts_cuts_train-clean-100.jsonl.gz' does not exist.

cuts seams to be missing, but why?

Sanity check with VALL-F after 17 epochs

Training Status

The training is still not converging, I have add some Metrics to minor training #31.

This is a freshly built code repository, it may take a little while to get it working.

I will try to solve the problem of convergence on the weekend.

Problems in tokenizing LibriTTS

Thanks for your reproduction of the VALL-E paper! When I tried to prepare LibriTTS data with prepare.sh I encountered this problem:
image
I'm only using 4 tar files (train-clean-100, train-clean-360, test-clean and dev-clean) out of 7 in LibriTTS. Could you give me some suggestions about what's going on? Thanks!

After 100 epochs training, the model can synthesize natural speech on LibriTTS

I trained vall-e on LibriTTS about 100 epochs (took almost 4 days on 8 A100 GPUs) and I obtained plausible synthesized audio.

Here is a demo.
[1]
prompt : prompt_link
synthesized audio : synt_link

[2]
prompt : prompt_link
ground truth : gt_link
synthesized audio : synt_link

[3]
prompt : prompt_link
synthesized audio : synt_link

[4]
prompt : prompt_link
ground truth : gt_link
synthesized audio : synt_link

The model I trained has worse quality than original vall-e because of dataset amount. However, It has a promising quality in clean audio.
I'm not sure whether I can share my pre-trained LibriTTS model. If I can, I would like to share the pre-trained LibriTTS model.

A solution to the bug of AR and NAR training separately on DDP

modify train.py
image
to

    if params.train_stage:
        if world_size > 1:
            stage_named_parameters = model.module.stage_named_parameters(params.train_stage)
            model_parameters = model.module.stage_parameters(params.train_stage)
        else:
            stage_named_parameters = model.stage_named_parameters(params.train_stage)
            model_parameters = model.stage_parameters(params.train_stage)
    else:
        model_parameters = model.parameters()


    if params.optimizer_name == "ScaledAdam":
        parameters_names = []
        if params.train_stage:  # != 0
            parameters_names.append(
                [
                    name_param_pair[0]
                    for name_param_pair in stage_named_parameters
                ]
            )

Training errors

I encounter this error while attempting to train, any ideas?
looks like the torch.nn.transofrmer forward func is retrunning Tensor, yet the code is expecting Tuple

File "bin/trainer.py", line 1046, in
main()
File "bin/trainer.py", line 1039, in main
run(rank=0, world_size=1, args=args)
File "bin/trainer.py", line 902, in run
scan_pessimistic_batches_for_oom(
File "bin/trainer.py", line 1004, in scan_pessimistic_batches_for_oom
_, loss, _ = compute_loss(
File "bin/trainer.py", line 451, in compute_loss
predicts, loss, metrics = model(
File "/home/zhangjb2/ml/audio/virtual_valle/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/home/zhangjb2/ml/audio/vall-e/valle/models/valle.py", line 575, in forward
xy_dec, _ = self.ar_decoder(
File "/home/zhangjb2/ml/audio/virtual_valle/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/home/zhangjb2/ml/audio/virtual_valle/lib/python3.8/site-packages/torch/nn/modules/transformer.py", line 219, in forward
target_type=src.dtype
AttributeError: 'tuple' object has no attribute 'dtype'

[prepare data ] phonemizer :words count mismatch

when I run stage 2 with LibriTTS , the following warning appears. Is this normal? if not, how can i fix it?

2023-02-23 00:24:36,998 WARNING [words_mismatch.py:88] words count mismatch on 200.0% of the lines (2/1)
2023-02-23 00:24:37,002 WARNING [words_mismatch.py:88] words count mismatch on 100.0% of the lines (1/1)
2023-02-23 00:24:37,003 WARNING [words_mismatch.py:88] words count mismatch on 100.0% of the lines (1/1)
2023-02-23 00:24:37,005 WARNING [words_mismatch.py:88] words count mismatch on 100.0% of the lines (1/1)

image

grad_scale is too small, exiting

Hi!

I use the recent v0.4.0 version to train on LibriTTS dataset. Firstly I ran

bash run.sh --stage 4 --stop-stage 4 \
    --num-decoder-layers 8 \
    --decoder_dim 1024 \
    --nhead 8 \
    --max-duration 24 --use-fp16 true \
    --GPUID 1,2,3,4 --world_size 4

where GPUID and world_size are used for CUDA_VISIBLE_DEVICES and world-size option.

The program could run several thousands batches, but then reported grad_scale is too small, exiting.

I try to comment this RuntimeError in trainer.py, but it seems loss still becomes nan after some batches.

2023-03-06 12:41:16,652 WARNING [trainer.py:680] (2/4) Grad scale is small: 0.0
2023-03-06 12:41:16,652 WARNING [trainer.py:680] (0/4) Grad scale is small: 0.0
2023-03-06 12:41:16,653 INFO [trainer.py:694] (2/4) Epoch 1, batch 8500, train_loss[loss=nan, ArTop10Accuracy=0, NarTop10Accuracy=0.007109, over 1266.00 frames. ], tot_loss[loss=nan, ArTop10Accuracy=0.0009642, NarTop10Accuracy=0.006076, over 265830.71 frames. ], batch size: 2, lr: 3.56e-02, grad_scale: 0.0
2023-03-06 12:41:16,653 INFO [trainer.py:694] (0/4) Epoch 1, batch 8500, train_loss[loss=nan, ArTop10Accuracy=0.001555, NarTop10Accuracy=0.003882, over 1288.00 frames. ], tot_loss[loss=nan, ArTop10Accuracy=0.0009282, NarTop10Accuracy=0.005928, over 265602.21 frames. ], batch size: 2, lr: 3.56e-02, grad_scale: 0.0
2023-03-06 12:41:16,653 WARNING [trainer.py:680] (1/4) Grad scale is small: 0.0
2023-03-06 12:41:16,653 WARNING [trainer.py:680] (3/4) Grad scale is small: 0.0
2023-03-06 12:41:16,653 INFO [trainer.py:694] (1/4) Epoch 1, batch 8500, train_loss[loss=nan, ArTop10Accuracy=0, NarTop10Accuracy=0.002804, over 1070.00 frames. ], tot_loss[loss=nan, ArTop10Accuracy=0.001054, NarTop10Accuracy=0.00589, over 261683.73 frames. ], batch size: 8, lr: 3.56e-02, grad_scale: 0.0
2023-03-06 12:41:16,653 INFO [trainer.py:694] (3/4) Epoch 1, batch 8500, train_loss[loss=nan, ArTop10Accuracy=0.004217, NarTop10Accuracy=0.009025, over 1662.00 frames. ], tot_loss[loss=nan, ArTop10Accuracy=0.00114, NarTop10Accuracy=0.005829, over 263296.87 frames. ], batch size: 2, lr: 3.56e-02, grad_scale: 0.0

Any idea handling this problem? Thank you for your help.

AttributeError: 'CutSet' object has no attribute 'find'

Traceback (most recent call last):
File "/workspaces/vall-e/egs/libritts/bin/tokenizer.py", line 204, in
main()
File "/workspaces/vall-e/egs/libritts/bin/tokenizer.py", line 146, in main
cut_set = CutSet.from_manifests(
File "/root/miniconda3/envs/valle/lib/python3.10/site-packages/lhotse/cut/set.py", line 317, in from_manifests
return create_cut_set_eager(
File "/root/miniconda3/envs/valle/lib/python3.10/site-packages/lhotse/cut/set.py", line 2937, in create_cut_set_eager
supervisions.find(
AttributeError: 'CutSet' object has no attribute 'find'

getting this error when I try to run
bash prepare.sh --stage -1 --stop-stage 3

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.