jasonppy / voicecraft Goto Github PK

Zero-Shot Speech Editing and Text-to-Speech in the Wild

License: Other

Python 39.22% Jupyter Notebook 60.16% Shell 0.41% Batchfile 0.07% Dockerfile 0.14%

voicecraft's Introduction

VoiceCraft: Zero-Shot Speech Editing and Text-to-Speech in the Wild

TL;DR

VoiceCraft is a token infilling neural codec language model, that achieves state-of-the-art performance on both speech editing and zero-shot text-to-speech (TTS) on in-the-wild data including audiobooks, internet videos, and podcasts.

To clone or edit an unseen voice, VoiceCraft needs only a few seconds of reference.

How to run inference

There are three ways (besides running Gradio in Colab):

More flexible inference beyond Gradio UI in Google Colab. see quickstart colab
with docker. see quickstart docker
without docker. see environment setup. You can also run gradio locally if you choose this option
As a standalone script that you can easily integrate into other projects. see quickstart command line.

When you are inside the docker image or you have installed all dependencies, Checkout inference_tts.ipynb.

If you want to do model development such as training/finetuning, I recommend following envrionment setup and training.

News

⭐ 04/22/2024: 330M/830M TTS Enhanced Models are up here, load them through gradio_app.py or inference_tts.ipynb! Replicate demo is up, major thanks to @chenxwh!

⭐ 04/11/2024: VoiceCraft Gradio is now available on HuggingFace Spaces here! Major thanks to @zuev-stepan, @Sewlell, @pgsoar @Ph0rk0z.

⭐ 04/05/2024: I finetuned giga330M with the TTS objective on gigaspeech and 1/5 of librilight. Weights are here. Make sure maximal prompt + generation length <= 16 seconds (due to our limited compute, we had to drop utterances longer than 16s in training data). Even stronger models forthcomming, stay tuned!

⭐ 03/28/2024: Model weights for giga330M and giga830M are up on HuggingFace🤗 here!

TODO

QuickStart Colab

⭐ To try out speech editing or TTS Inference with VoiceCraft, the simplest way is using Google Colab. Instructions to run are on the Colab itself.

To try Speech Editing
To try TTS Inference

QuickStart Command Line

⭐ To use it as a standalone script, check out tts_demo.py and speech_editing_demo.py. Be sure to first setup your environment. Without arguments, they will run the standard demo arguments used as an example elsewhere in this repository. You can use the command line arguments to specify unique input audios, target transcripts, and inference hyperparameters. Run the help command for more information: python3 tts_demo.py -h

QuickStart Docker

⭐ To try out TTS inference with VoiceCraft, you can also use docker. Thank @ubergarm and @jayc88 for making this happen.

Tested on Linux and Windows and should work with any host with docker installed.

# 1. clone the repo on in a directory on a drive with plenty of free space
git clone [email protected]:jasonppy/VoiceCraft.git
cd VoiceCraft

# 2. assumes you have docker installed with nvidia container container-toolkit (windows has this built into the driver)
# https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/1.13.5/install-guide.html
# sudo apt-get install -y nvidia-container-toolkit-base || yay -Syu nvidia-container-toolkit || echo etc...

# 3. First build the docker image
docker build --tag "voicecraft" .

# 4. Try to start an existing container otherwise create a new one passing in all GPUs
./start-jupyter.sh  # linux
start-jupyter.bat   # windows

# 5. now open a webpage on the host box to the URL shown at the bottom of:
docker logs jupyter

# 6. optionally look inside from another terminal
docker exec -it jupyter /bin/bash
export USER=(your_linux_username_used_above)
export HOME=/home/$USER
sudo apt-get update

# 7. confirm video card(s) are visible inside container
nvidia-smi

# 8. Now in browser, open inference_tts.ipynb and work through one cell at a time
echo GOOD LUCK

Environment setup

conda create -n voicecraft python=3.9.16
conda activate voicecraft

pip install -e git+https://github.com/facebookresearch/audiocraft.git@c5157b5bf14bf83449c17ea1eeb66c19fb4bc7f0#egg=audiocraft
pip install xformers==0.0.22
pip install torchaudio==2.0.2 torch==2.0.1 # this assumes your system is compatible with CUDA 11.7, otherwise checkout https://pytorch.org/get-started/previous-versions/#v201
apt-get install ffmpeg # if you don't already have ffmpeg installed
apt-get install espeak-ng # backend for the phonemizer installed below
pip install tensorboard==2.16.2
pip install phonemizer==3.2.1
pip install datasets==2.16.0
pip install torchmetrics==0.11.1
pip install huggingface_hub==0.22.2
# install MFA for getting forced-alignment, this could take a few minutes
conda install -c conda-forge montreal-forced-aligner=2.2.17 openfst=1.8.2 kaldi=5.5.1068
# install MFA english dictionary and model
mfa model download dictionary english_us_arpa
mfa model download acoustic english_us_arpa
# pip install huggingface_hub
# conda install pocl # above gives an warning for installing pocl, not sure if really need this

# to run ipynb
conda install -n voicecraft ipykernel --no-deps --force-reinstall

If you have encountered version issues when running things, checkout environment.yml for exact matching.

Inference Examples

Checkout inference_speech_editing.ipynb and inference_tts.ipynb

Gradio

Run in colab

Run locally

After environment setup install additional dependencies:

apt-get install -y espeak espeak-data libespeak1 libespeak-dev
apt-get install -y festival*
apt-get install -y build-essential
apt-get install -y flac libasound2-dev libsndfile1-dev vorbis-tools
apt-get install -y libxml2-dev libxslt-dev zlib1g-dev
pip install -r gradio_requirements.txt

Run gradio server from terminal or gradio_app.ipynb:

python gradio_app.py

It is ready to use on default url.

How to use it

(optionally) Select models
Load models
Transcribe
(optionally) Tweak some parameters
Run
(optionally) Rerun part-by-part in Long TTS mode

Some features

Smart transcript: write only what you want to generate

TTS mode: Zero-shot TTS

Edit mode: Speech editing

Long TTS mode: Easy TTS on long texts

Training

To train an VoiceCraft model, you need to prepare the following parts:

utterances and their transcripts
encode the utterances into codes using e.g. Encodec
convert transcripts into phoneme sequence, and a phoneme set (we named it vocab.txt)
manifest (i.e. metadata)

Step 1,2,3 are handled in ./data/phonemize_encodec_encode_hf.py, where

Gigaspeech is downloaded through HuggingFace. Note that you need to sign an agreement in order to download the dataset (it needs your auth token)
phoneme sequence and encodec codes are also extracted using the script.

An example run:

conda activate voicecraft
export CUDA_VISIBLE_DEVICES=0
cd ./data
python phonemize_encodec_encode_hf.py \
--dataset_size xs \
--download_to path/to/store_huggingface_downloads \
--save_dir path/to/store_extracted_codes_and_phonemes \
--encodec_model_path path/to/encodec_model \
--mega_batch_size 120 \
--batch_size 32 \
--max_len 30000

where encodec_model_path is avaliable here. This model is trained on Gigaspeech XL, it has 56M parameters, 4 codebooks, each codebook has 2048 codes. Details are described in our paper. If you encounter OOM during extraction, try decrease the batch_size and/or max_len. The extracted codes, phonemes, and vocab.txt will be stored at path/to/store_extracted_codes_and_phonemes/${dataset_size}/{encodec_16khz_4codebooks,phonemes,vocab.txt}.

As for manifest, please download train.txt and validation.txt from here, and put them under path/to/store_extracted_codes_and_phonemes/manifest/. Please also download vocab.txt from here if you want to use our pretrained VoiceCraft model (so that the phoneme-to-token matching is the same).

Now, you are good to start training!

conda activate voicecraft
cd ./z_scripts
bash e830M.sh

It's the same procedure to prepare your own custom dataset. Make sure that if

Finetuning

You also need to do step 1-4 as Training, and I recommend to use AdamW for optimization if you finetune a pretrained model for better stability. checkout script ./z_scripts/e830M_ft.sh.

If your dataset introduce new phonemes (which is very likely) that doesn't exist in the giga checkpoint, make sure you combine the original phonemes with the phoneme from your data when construction vocab. And you need to adjust --text_vocab_size and --text_pad_token so that the former is bigger than or equal to you vocab size, and the latter has the same value as --text_vocab_size (i.e. --text_pad_token is always the last token). Also since the text embedding are now of a different size, make sure you modify the weights loading part so that I won't crash (you could skip loading text_embedding or only load the existing part, and randomly initialize the new)

License

The codebase is under CC BY-NC-SA 4.0 (LICENSE-CODE), and the model weights are under Coqui Public Model License 1.0.0 (LICENSE-MODEL). Note that we use some of the code from other repository that are under different licenses: ./models/codebooks_patterns.py is under MIT license; ./models/modules, ./steps/optim.py, data/tokenizer.py are under Apache License, Version 2.0; the phonemizer we used is under GNU 3.0 License.

Acknowledgement

We thank Feiteng for his VALL-E reproduction, and we thank audiocraft team for open-sourcing encodec.

Citation

@article{peng2024voicecraft,
  author    = {Peng, Puyuan and Huang, Po-Yao and Mohamed, Abdelrahman and Harwath, David},
  title     = {VoiceCraft: Zero-Shot Speech Editing and Text-to-Speech in the Wild},
  journal   = {arXiv},
  year      = {2024},
}

Disclaimer

Any organization or individual is prohibited from using any technology mentioned in this paper to generate or edit someone's speech without his/her consent, including but not limited to government leaders, political figures, and celebrities. If you do not comply with this item, you could be in violation of copyright laws.

voicecraft's People

Contributors

Stargazers

Watchers

Forkers

cat-stack-boop ishine render-ai wendonggan segmentationfaults del18687058912 toannguyen247 kustomzone kingfener xzm2004260 anthonyyuan songguoguo ofirkris asakrg v6p tarasmetal buyersystem hotmiaow avtobus entn-at sorokinvld aminecs beimingmaster ericismyeldestson whitefu cryptoxunm qoboty howardbaek yomaser jackie666666 bridge01 hamedmoo talenhuang misterypoem lplzyp eltociear davidmartinrius seawolf2357 moluyouwo alignment-lab-ai ariesw nooseok szhowardhuang azgo14 mariambasents shiyukonghui shaun95 hhy5277 yanxg mecasual19 dealexpesh derekjhunt keystoneinfosec mbrukman ukaserge wolvend jjhw jakubik2023 veryvanya segmond jbalber keyman9848 kafkaqin b08240 lycsqq jmaigc chenchy hongwen-sun ailabteam maxmax2016 folkevil hufangfang1 gaomingxing cellsplit welovehiro pourmoeziashkan furinomartina631 tavisfendler514 oytunturk objsgit foundations macroustc liuxing9848 lukin-kirill jossion12 ego ajits-github speechprojects thanhkm rapidai mikeknows rooben-me raymondgp berkeleynerd chesketh76 taimurayaz ralf12358 yonnym tspannhw yif0

voicecraft's Issues

Where are the model weights ?

How to install mfa's english_us_arpa, I tried to run commond "mfa model download dictionary english_us_arpa" and "mfa model download acoustic english_us_arpa" with github token, but it doesn't work.

Few questions about the paper. [Encodec;inference speed; model parameters]

Hi @jasonppy , great work and samples, thanks for sharing the code!

Introduction of causal masking for TTS - is an elegant approach for contextualization. Bravo!

I'm curious about few aspects of your work at the moment:

Did you train Encodec as well? To my knowledge the parameters are released too. But looking into your code, it seams that you trained it too. Now I wonder what might be a reason for this. A hypothesis: no parameters for 16 kHz sampling rate?
When it comes to inference you mention that you run it multiple times. Can you share inference speed for say 10 seconds long utterance on 820M model?
Is there any estimate when model parameters will be released?

Have a good one!
Best, Taras

support Chinese?

any plan to support Chinese language?

MFA alignment temp file

I am trying to clone my own voice and when i use my own file mfa outputs the default demo text "but when I approaches...." instead of mine

some Voice editing problem

I have noticed some testing and demo issues regarding voice editing
I would like to ask you about when you edit the last part of the text, for example: https://youtu.be/PJ2qSjycLcw?t=353, after starting at 5:50, there will be a problem with the synthesis quality at the end of the sentence, I prefer the bad audio is not mask in two parts but edit it the end of the sentence. So I found your demo "this was george steers the son of a british naval captain and ship modeler who had become an american naval officer and was entrusted with the prestigious role of overseeing the operations at the renowned naval headquarters" editing in the end of the sentence.There will also be strange pauses at the end of the sentence between the last few words.

seed - magic number

In the Jupyter inference the seed is set but never used. To me it looks like setting the seed makes no difference on the end result or am I missing something?

Multi-language model

Hi, thank you for your excellent repo!
Do you have any plan to develop a multi-language model?

gradio port

I did not like having to mess with jupyter and having to run whisper separately, so I made a gradio version. Will submit pull request eventually. you can try it out here for now.
note that the conda env is slightly different in my fork
https://github.com/friendlyFriend4000/VoiceCraft

Finetuning on custom voice

Hi, thanks for your amazing work. Can't wait to try it out.
I am wondering if it's possible to finetune your pretrained model on a custom voice and, if so, if you can upload a notebook to follow.
I am reading the training section but I'm not sure I completely understood how I could finetune a custom voice. It would be nice.

Thank you again.

RuntimeError: espeak not installed on your system

Environment

macOS Sonoma 14.3.1
M1 Max 64GB

Issue Description
I am attempting to run the inference_tts.ipynb notebook on an Apple Silicon Mac. As part of adapting the code for Apple Silicon, I replaced CUDA references with MPS (or CPU where MPS isn't an option). However, I encountered a runtime error related to espeak not being recognized by the system despite being installed.

Steps taken

Installed espeak via Homebrew using brew install espeak.
Confirmed espeak installation with espeak --version, which outputs 'eSpeak NG text-to-speech: 1.51.1'
Ran the inference_tts.ipynb notebook after making necessary modifications for MPS compatibility.
Encountered the RuntimeError: espeak not installed on your system upon executing Cell 4.

Behavior
Despite espeak being installed (confirmed via command line), a RuntimeError is thrown indicating that espeak is not installed on the system.

Troubleshooting Steps Taken

Verified that espeak is accessible via the command line and shows the installed version.
Attempted to reinstall espeak through Homebrew.
Checked the system's PATH to ensure it includes the directory where espeak is installed.

Full output with error

RuntimeError                              Traceback (most recent call last)
Cell In[4], line 31
     27 model.eval()
     29 phn2num = ckpt['phn2num']
---> 31 text_tokenizer = TextTokenizer(backend="espeak")
     32 audio_tokenizer = AudioTokenizer(signature=encodec_fn) # will also put the neural codec model on gpu
     34 # run the model to get the output

File ~/VoiceCraft_things/VoiceCraft-master/data/tokenizer.py:48, in TextTokenizer.__init__(self, language, backend, separator, preserve_punctuation, punctuation_marks, with_stress, tie, language_switch, words_mismatch)
     36 def __init__(
     37     self,
     38     language="en-us",
   (...)
     46     words_mismatch: WordMismatch = "ignore",
     47 ) -> None:
---> 48     phonemizer = EspeakBackend(
     49         language,
     50         punctuation_marks=punctuation_marks,
     51         preserve_punctuation=preserve_punctuation,
     52         with_stress=with_stress,
     53         tie=tie,
     54         language_switch=language_switch,
     55         words_mismatch=words_mismatch,
     56     )
...
     81 self._logger.info(
     82     'initializing backend %s-%s',
     83     'espeak', '.'.join(str(v) for v in self.version()))

RuntimeError: espeak not installed on your system

I would also really like to do it without e.g. Docker, because I want to try to use MPS for the Apple Silicon GPU.

Would appreciate any guidance on resolving this issue so that espeak is correctly recognized by the system and the notebook can run as intended.

What languages does it support?

We would love if it supported any language like RVC And especially in Hebrew

Usage instructions.

(really really impressed by the demo, so much further than the best SOTA model I found so far, congrats on the great work).

Running with docker/jupyter.

I followed the Docker/jupyter instructions, to the letter (I'm not at all familiar with jupyter, very with docker).

It went mostly well.

I keep running cells/advancing, again and again, until I get at the bottom.

And then nothing? What's supposed to happen? I don't see any new instructions, no new files, anything, I'm fairly lost.

Running as a script.

The jupyter stuff is great to get to know the project, but (unless I don't understand what jupyter is), it won't really help getting voicecraft integrated into my project / enable me to generate thousands of files / "call" voicecraft programmatically from my nodejs system.

In other, there is something like:

python3 voicecraft/bin/inference.py --text="Read this text" --model_path="voicecraft/model/file.something" --voice_sample="/tmp/voices/robert.wav" --output="/tmp/sample_voicecraft_output.wav" --device=cpu

What's the equivalent for voicecraft, and how do I get to the point where it'll agree to run? (running inside docker is fine, or outside docker too, just need to get it to run).

I found main.py, and I think the options for the command line are in config.py, but I don't know which options I need and which I don't / I don't know how to use the script. I didn't find an example of how to use it, but I'll keep looking.

Intonation.

I might be getting a bit ahead of myself here since I don't have it running yet, but maybe you know: will intonation/style transfer through? Like if my voice sample has the person whispering, will the output be whispering? Same for shouting, crying, etc. That's really the big thing missing from my system, is there any way to get that to work with voicecraft, do you know?

Thanks a lot in advance!
Cheers.

Does the licence allow Youtube voice over?

Hello there, I was reading the Licence and I couldn't figure out weather I am allowed or not. My idea is to use the voice for doing voiceovers in youtube videos. I could argue that the channel is not monetized and that the videos are not like audiobooks which will only relly on the speach.
I raised the issue because if you don't go through with doing the proyect completely open source (not saying you should) it could be use full for the comunity.
Thanks for such an amazing proyect.

Request for a requirements.txt

Ran into several issues with imports failing after following the instructions in the readme. Installation would have been smoother with exact versions of the offending packages.

pytorch version clash

The instructions say to install torch==2.0.1.

But while installing

pip install -e git+https://github.com/facebookresearch/audiocraft.git@c5157b5bf14bf83449c17ea1eeb66c19fb4bc7f0#egg=audiocraft

the different version pytorch-2.2.2 is forcibly installed over the initially installed version.

Then import torchaudio results in the error

 undefined symbol: _ZN2at4_ops15sum_dim_IntList4callERKNS_6TensorEN3c1016OptionalArrayRefIlEEbNS5_8optionalINS5_10ScalarTypeEEE

Error in loading your tuned EnCodec from Huggingface

Hi, @jasonppy, thanks for the model and open-sourcing the code to inspire ML Speech engineers to investigate it!

My question is about loading your pretrained EnCodec model that is stored in HF hub. I installed the env and main packages, downloaded checkpoint from https://huggingface.co/pyp1/VoiceCraft/tree/main and tried to use it. But got the following error:

phonemization

load tokenizer

load the encodec model

from audiocraft.solvers import CompressionSolver
model = CompressionSolver.model_from_checkpoint("/home/jovyan/kda/VoiceCraft/exp/encodec_4cb2048_giga.th")
model = model.cuda()
model = model.eval()

Output:

MissingConfigException Traceback (most recent call last)
Cell In[9], line 5
1 ### phonemization
2 # load tokenizer
3 # load the encodec model
4 from audiocraft.solvers import CompressionSolver
----> 5 model = CompressionSolver.model_from_checkpoint("/home/jovyan/kda/VoiceCraft/exp/encodec_4cb2048_giga.th")
6 model = model.cuda()
7 model = model.eval()

File /home/user/conda/lib/python3.9/site-packages/audiocraft/solvers/compression.py:287, in CompressionSolver.model_from_checkpoint(checkpoint_path, device)
285 logger = logging.getLogger(name)
286 logger.info(f"Loading compression model from checkpoint: {checkpoint_path}")
--> 287 _checkpoint_path = checkpoint.resolve_checkpoint_path(checkpoint_path, use_fsdp=False)
288 assert _checkpoint_path is not None, f"Could not resolve compression model checkpoint path: {checkpoint_path}"
289 state = checkpoint.load_checkpoint(_checkpoint_path)

File /home/user/conda/lib/python3.9/site-packages/audiocraft/utils/checkpoint.py:68, in resolve_checkpoint_path(sig_or_path, name, use_fsdp)
56 def resolve_checkpoint_path(sig_or_path: tp.Union[Path, str], name: tp.Optional[str] = None,
57 use_fsdp: bool = False) -> tp.Optional[Path]:
58 """Resolve a given checkpoint path for a provided dora sig or path.
59
60 Args:
(...)
66 Path, optional: Resolved checkpoint path, if it exists.
67 """
---> 68 from audiocraft import train
69 xps_root = train.main.dora.dir / 'xps'
70 sig_or_path = str(sig_or_path)

File /home/user/conda/lib/python3.9/site-packages/audiocraft/train.py:131
126 logger.info("Changing tmpdir to %s", tmpdir)
127 os.environ['TMPDIR'] = str(tmpdir)
130 @hydra_main(config_path='../config', config_name='config', version_base='1.1')
--> 131 def main(cfg):
132 init_seed_and_system(cfg)
134 # Setup logging both to XP specific folder, and to stderr.

File /home/user/conda/lib/python3.9/site-packages/dora/hydra.py:308, in hydra_main.._decorator(main)
307 def _decorator(main: MainFun):
--> 308 return HydraMain(main, config_name=config_name, config_path=config_path,
309 **kwargs)

File /home/user/conda/lib/python3.9/site-packages/dora/hydra.py:161, in HydraMain.init(self, main, config_name, config_path, **kwargs)
158 self.full_config_path = self.full_config_path / config_path
160 self._initialized = False
--> 161 self._base_cfg = self._get_config()
162 self._config_groups = self._get_config_groups()
163 dora = self._get_dora()

File /home/user/conda/lib/python3.9/site-packages/dora/hydra.py:281, in HydraMain._get_config(self, overrides)
275 """
276 Internal method, returns the config for the given override,
277 but without the dora.sig field filled.
278 """
279 with initialize_config_dir(str(self.full_config_path), job_name=self._job_name,
280 **self.hydra_kwargs):
--> 281 return self._get_config_noinit(overrides)

File /home/user/conda/lib/python3.9/site-packages/dora/hydra.py:289, in HydraMain._get_config_noinit(self, overrides)
287 cfg = copy.deepcopy(cfg)
288 else:
--> 289 cfg = compose(self.config_name, overrides) # type: ignore
290 return cfg

File /home/user/conda/lib/python3.9/site-packages/hydra/compose.py:38, in compose(config_name, overrides, return_hydra_config, strict)
36 gh = GlobalHydra.instance()
37 assert gh.hydra is not None
---> 38 cfg = gh.hydra.compose_config(
39 config_name=config_name,
40 overrides=overrides,
41 run_mode=RunMode.RUN,
42 from_shell=False,
43 with_log_configuration=False,
44 )
45 assert isinstance(cfg, DictConfig)
47 if not return_hydra_config:

File /home/user/conda/lib/python3.9/site-packages/hydra/_internal/hydra.py:594, in Hydra.compose_config(self, config_name, overrides, run_mode, with_log_configuration, from_shell, validate_sweep_overrides)
576 def compose_config(
577 self,
578 config_name: Optional[str],
(...)
583 validate_sweep_overrides: bool = True,
584 ) -> DictConfig:
585 """
586 :param config_name:
587 :param overrides:
(...)
591 :return:
592 """
--> 594 cfg = self.config_loader.load_configuration(
595 config_name=config_name,
596 overrides=overrides,
597 run_mode=run_mode,
598 from_shell=from_shell,
599 validate_sweep_overrides=validate_sweep_overrides,
600 )
601 if with_log_configuration:
602 configure_log(cfg.hydra.hydra_logging, cfg.hydra.verbose)

File /home/user/conda/lib/python3.9/site-packages/hydra/_internal/config_loader_impl.py:142, in ConfigLoaderImpl.load_configuration(self, config_name, overrides, run_mode, from_shell, validate_sweep_overrides)
133 def load_configuration(
134 self,
135 config_name: Optional[str],
(...)
139 validate_sweep_overrides: bool = True,
140 ) -> DictConfig:
141 try:
--> 142 return self._load_configuration_impl(
143 config_name=config_name,
144 overrides=overrides,
145 run_mode=run_mode,
146 from_shell=from_shell,
147 validate_sweep_overrides=validate_sweep_overrides,
148 )
149 except OmegaConfBaseException as e:
150 raise ConfigCompositionException().with_traceback(sys.exc_info()[2]) from e

File /home/user/conda/lib/python3.9/site-packages/hydra/_internal/config_loader_impl.py:243, in ConfigLoaderImpl._load_configuration_impl(self, config_name, overrides, run_mode, from_shell, validate_sweep_overrides)
233 def _load_configuration_impl(
234 self,
235 config_name: Optional[str],
(...)
239 validate_sweep_overrides: bool = True,
240 ) -> DictConfig:
241 from hydra import version, version
--> 243 self.ensure_main_config_source_available()
244 parsed_overrides, caching_repo = self._parse_overrides_and_create_caching_repo(
245 config_name, overrides
246 )
248 if validate_sweep_overrides:

File /home/user/conda/lib/python3.9/site-packages/hydra/_internal/config_loader_impl.py:129, in ConfigLoaderImpl.ensure_main_config_source_available(self)
123 else:
124 msg = (
125 "Primary config directory not found.\nCheck that the"
126 f" config directory '{source.path}' exists and readable"
127 )
--> 129 self._missing_config_error(
130 config_name=None, msg=msg, with_search_path=False
131 )

File /home/user/conda/lib/python3.9/site-packages/hydra/_internal/config_loader_impl.py:102, in ConfigLoaderImpl._missing_config_error(self, config_name, msg, with_search_path)
99 else:
100 return msg
--> 102 raise MissingConfigException(
103 missing_cfg_file=config_name, message=add_search_path()
104 )

MissingConfigException: Primary config directory not found.
Check that the config directory '/home/user/conda/lib/python3.9/site-packages/audiocraft/../config' exists and readable

So do you have any special config files for your encodec or this is the error with Audiocraft/Hydra package?

Why would someone willingly make a tool that is only going to be abused and make the world worse?

Apple SoC support M1/M2/M3

Hi,

Has anyone managed to run this successfully on macOS with the Apple's M series GPU?

Thanks!

Installation on windows native

Hi,

This issue is a installation solution for installing on windows

preferably, if possible at all

without using WSL / docker / conda

Just stock python & pip, maybe a venv, maybe some powershell but preferably pure batch install.

In reference to previous attempts

#28
#29

Is it possible to run this without docker?

I get this error:

docker: Error response from daemon: could not select device driver "" with capabilities: [[gpu]].
I blame docker.

Why would I need docker for this? What purpose does it serve?

non-speech sounds like suno bark ?

is it possible to generate interjections with this model like mhmm or aha ?

Possible Use Before Defined

https://github.com/jasonppy/VoiceCraft/blob/master/inference_speech_editing_scale.py#L127

Which all languages does it support

Add option for fp16 kv cache

License

Hi,
Thank you for releasing VoiceCraft! It's super cool and I'm really impressed by the quality. Do you have any plans to open source it by switching to an open source license? (Are the model weights based off of XTTS?)
Thanks!

PostgreSQL/pgvector requirement?

Is the full blown Postgres really needed for this, especially for non-cloud local system executing, and couldn't it be simplified with something like SQLite-VSS (https://github.com/asg017/sqlite-vss)?

Supported emotions

How many emotions are supported?

Can we get list of these?

How to trigger specific emotional Audio? Maybe some prompt engineering?

samples to review quality before download and installation

for projects like this, with big installations and download there should always be a sample to determine quality beforehand.

Generating long speeches

Would there be a way to generate long speeches ?

Because right now, it requires to be fed with at least 3 seconds of speech each time you want to inference something new. And if the length of the desired generation is too long, it hallucinates and ends up doing gibberish stuff.

One way to solve this issue would be to generate speeches sentence by sentence. One issue with that is that it'll still require those 3 seconds of base speech each time. The other one is the consistency of the generated speech at the end, as the different intonations between the sentences would be immensely off.

Anyone has an idea ?

AttributeError: module 'torch' has no attribute 'compiler' and other various issue

System

Windows 11
NVIDIA MX130
i5-10210U
12GB RAM

Error Code

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
Cell In[11], [line 12](vscode-notebook-cell:?execution_count=11&line=12)
      [8](vscode-notebook-cell:?execution_count=11&line=8) prompt_end_frame = int(cut_off_sec * info.sample_rate)
     [11](vscode-notebook-cell:?execution_count=11&line=11) # # load model, tokenizer, and other necessary files
---> [12](vscode-notebook-cell:?execution_count=11&line=12) from models import voicecraft
     [13](vscode-notebook-cell:?execution_count=11&line=13) voicecraft_name="giga830M.pth"
     [14](vscode-notebook-cell:?execution_count=11&line=14) ckpt_fn =f"[./pretrained_models/](https://file+.vscode-resource.vscode-cdn.net/c%3A/Users/NAME/TTS/src/audiocraft/audiocraft/pretrained_models/){voicecraft_name}"

File [c:\Users\PEY3C\TTS\src\audiocraft\audiocraft\models\__init__.py:10](file:///C:/Users/NAME/TTS/src/audiocraft/audiocraft/models/__init__.py:10)
      [6](file:///C:/Users/NAME/TTS/src/audiocraft/audiocraft/models/__init__.py:6) """
      [7](file:///C:/Users/NAME/TTS/src/audiocraft/audiocraft/models/__init__.py:7) Models for EnCodec, AudioGen, MusicGen, as well as the generic LMModel.
      [8](file:///C:/Users/NAME/TTS/src/audiocraft/audiocraft/models/__init__.py:8) """
      [9](file:///C:/Users/NAME/TTS/src/audiocraft/audiocraft/models/__init__.py:9) # flake8: noqa
---> [10](file:///C:/Users/NAME/TTS/src/audiocraft/audiocraft/models/__init__.py:10) from . import builders, loaders
     [11](file:///C:/Users/NAME/TTS/src/audiocraft/audiocraft/models/__init__.py:11) from .encodec import (
     [12](file:///C:/Users/NAME/TTS/src/audiocraft/audiocraft/models/__init__.py:12)     CompressionModel, EncodecModel, DAC,
     [13](file:///C:/Users/NAME/TTS/src/audiocraft/audiocraft/models/__init__.py:13)     HFEncodecModel, HFEncodecCompressionModel)
     [14](file:///C:/Users/NAME/TTS/src/audiocraft/audiocraft/models/__init__.py:14) from .audiogen import AudioGen

File [c:\Users\NAME\TTS\src\audiocraft\audiocraft\models\builders.py:14](file:///C:/Users/NAME/TTS/src/audiocraft/audiocraft/models/builders.py:14)
      [7](file:///C:/Users/NAME/TTS/src/audiocraft/audiocraft/models/builders.py:7) """
      [8](file:///C:/Users/NAME/TTS/src/audiocraft/audiocraft/models/builders.py:8) All the functions to build the relevant models and modules
      [9](file:///C:/Users/NAME/TTS/src/audiocraft/audiocraft/models/builders.py:9) from the Hydra config.
     [10](file:///C:/Users/NAME/TTS/src/audiocraft/audiocraft/models/builders.py:10) """
     [12](file:///C:/Users/NAME/TTS/src/audiocraft/audiocraft/models/builders.py:12) import typing as tp
---> [14](file:///C:/Users/NAME/TTS/src/audiocraft/audiocraft/models/builders.py:14) import audiocraft
     [15](file:///C:/Users/NAME/TTS/src/audiocraft/audiocraft/models/builders.py:15) import omegaconf
     [16](file:///C:/Users/NAME/TTS/src/audiocraft/audiocraft/models/builders.py:16) import torch

File [c:\users\name\tts\src\audiocraft\audiocraft\__init__.py:24](file:///C:/users/pey3c/tts/src/audiocraft/audiocraft/__init__.py:24)
      [6](file:///C:/users/name/tts/src/audiocraft/audiocraft/__init__.py:6) """
      [7](file:///C:/users/name/tts/src/audiocraft/audiocraft/__init__.py:7) AudioCraft is a general framework for training audio generative models.
      [8](file:///C:/users/name/tts/src/audiocraft/audiocraft/__init__.py:8) At the moment we provide the training code for:
   (...)
     [20](file:///C:/users/name/tts/src/audiocraft/audiocraft/__init__.py:20)     improves the perceived quality and reduces the artifacts coming from adversarial decoders.
     [21](file:///C:/users/name/tts/src/audiocraft/audiocraft/__init__.py:21) """
     [23](file:///C:/users/name/tts/src/audiocraft/audiocraft/__init__.py:23) # flake8: noqa
---> [24](file:///C:/users/name/tts/src/audiocraft/audiocraft/__init__.py:24) from . import data, modules, models
     [26](file:///C:/users/name/tts/src/audiocraft/audiocraft/__init__.py:26) __version__ = '1.0.0'

File [c:\users\name\tts\src\audiocraft\audiocraft\data\__init__.py:10](file:///C:/users/name/tts/src/audiocraft/audiocraft/data/__init__.py:10)
      [6](file:///C:/users/name/tts/src/audiocraft/audiocraft/data/__init__.py:6) """Audio loading and writing support. Datasets for raw audio
      [7](file:///C:/users/name/tts/src/audiocraft/audiocraft/data/__init__.py:7) or also including some metadata."""
      [9](file:///C:/users/name/tts/src/audiocraft/audiocraft/data/__init__.py:9) # flake8: noqa
---> [10](file:///C:/users/name/tts/src/audiocraft/audiocraft/data/__init__.py:10) from . import audio, audio_dataset, info_audio_dataset, music_dataset, sound_dataset

File [c:\users\name\tts\src\audiocraft\audiocraft\data\info_audio_dataset.py:19](file:///C:/users/name/tts/src/audiocraft/audiocraft/data/info_audio_dataset.py:19)
     [17](file:///C:/users/name/tts/src/audiocraft/audiocraft/data/info_audio_dataset.py:17) from .audio_dataset import AudioDataset, AudioMeta
     [18](file:///C:/users/name/tts/src/audiocraft/audiocraft/data/info_audio_dataset.py:18) from ..environment import AudioCraftEnvironment
---> [19](file:///C:/users/name/tts/src/audiocraft/audiocraft/data/info_audio_dataset.py:19) from ..modules.conditioners import SegmentWithAttributes, ConditioningAttributes
     [22](file:///C:/users/name/tts/src/audiocraft/audiocraft/data/info_audio_dataset.py:22) logger = logging.getLogger(__name__)
     [25](file:///C:/users/name/tts/src/audiocraft/audiocraft/data/info_audio_dataset.py:25) def _clusterify_meta(meta: AudioMeta) -> AudioMeta:

File [c:\users\name\tts\src\audiocraft\audiocraft\modules\__init__.py:22](file:///C:/users/name/tts/src/audiocraft/audiocraft/modules/__init__.py:22)
     [20](file:///C:/users/name/tts/src/audiocraft/audiocraft/modules/__init__.py:20) from .lstm import StreamableLSTM
     [21](file:///C:/users/name/tts/src/audiocraft/audiocraft/modules/__init__.py:21) from .seanet import SEANetEncoder, SEANetDecoder
---> [22](file:///C:/users/name/tts/src/audiocraft/audiocraft/modules/__init__.py:22) from .transformer import StreamingTransformer

File [c:\users\name\tts\src\audiocraft\audiocraft\modules\transformer.py:23](file:///C:/users/name/tts/src/audiocraft/audiocraft/modules/transformer.py:23)
     [21](file:///C:/users/name/tts/src/audiocraft/audiocraft/modules/transformer.py:21) from torch.nn import functional as F
     [22](file:///C:/users/name/tts/src/audiocraft/audiocraft/modules/transformer.py:22) from torch.utils.checkpoint import checkpoint as torch_checkpoint
---> [23](file:///C:/users/name/tts/src/audiocraft/audiocraft/modules/transformer.py:23) from xformers import ops
     [25](file:///C:/users/name/tts/src/audiocraft/audiocraft/modules/transformer.py:25) from .rope import RotaryEmbedding
     [26](file:///C:/users/name/tts/src/audiocraft/audiocraft/modules/transformer.py:26) from .streaming import StreamingModule

File [c:\Users\NAME\miniconda3\envs\voicecraft\lib\site-packages\xformers\__init__.py:12](file:///C:/Users/NAME/miniconda3/envs/voicecraft/lib/site-packages/xformers/__init__.py:12)
      [9](file:///C:/Users/NAME/miniconda3/envs/voicecraft/lib/site-packages/xformers/__init__.py:9) import torch
     [11](file:///C:/Users/NAME/miniconda3/envs/voicecraft/lib/site-packages/xformers/__init__.py:11) from . import _cpp_lib
---> [12](file:///C:/Users/NAME/miniconda3/envs/voicecraft/lib/site-packages/xformers/__init__.py:12) from .checkpoint import (  # noqa: E402, F401
     [13](file:///C:/Users/NAME/miniconda3/envs/voicecraft/lib/site-packages/xformers/__init__.py:13)     checkpoint,
     [14](file:///C:/Users/NAME/miniconda3/envs/voicecraft/lib/site-packages/xformers/__init__.py:14)     get_optimal_checkpoint_policy,
     [15](file:///C:/Users/NAME/miniconda3/envs/voicecraft/lib/site-packages/xformers/__init__.py:15)     list_operators,
     [16](file:///C:/Users/NAME/miniconda3/envs/voicecraft/lib/site-packages/xformers/__init__.py:16)     selective_checkpoint_wrapper,
     [17](file:///C:/Users/NAME/miniconda3/envs/voicecraft/lib/site-packages/xformers/__init__.py:17) )
     [19](file:///C:/Users/NAME/miniconda3/envs/voicecraft/lib/site-packages/xformers/__init__.py:19) try:
     [20](file:///C:/Users/NAME/miniconda3/envs/voicecraft/lib/site-packages/xformers/__init__.py:20)     from .version import __version__  # noqa: F401

File [c:\Users\NAME\miniconda3\envs\voicecraft\lib\site-packages\xformers\checkpoint.py:464](file:///C:/Users/NAME/miniconda3/envs/voicecraft/lib/site-packages/xformers/checkpoint.py:464)
    [460](file:///C:/Users/NAME/miniconda3/envs/voicecraft/lib/site-packages/xformers/checkpoint.py:460)         self.counter += 1
    [461](file:///C:/Users/NAME/miniconda3/envs/voicecraft/lib/site-packages/xformers/checkpoint.py:461)         return self.optim_output[count] == 1
--> [464](file:///C:/Users/NAME/miniconda3/envs/voicecraft/lib/site-packages/xformers/checkpoint.py:464) class SelectiveCheckpointWrapper(ActivationWrapper):
    [465](file:///C:/Users/NAME/miniconda3/envs/voicecraft/lib/site-packages/xformers/checkpoint.py:465)     def __init__(self, mod, memory_budget=None, policy_fn=None):
    [466](file:///C:/Users/NAME/miniconda3/envs/voicecraft/lib/site-packages/xformers/checkpoint.py:466)         if torch.__version__ < (2, 1):

File [c:\Users\NAME\miniconda3\envs\voicecraft\lib\site-packages\xformers\checkpoint.py:481](file:///C:/Users/NAME/miniconda3/envs/voicecraft/lib/site-packages/xformers/checkpoint.py:481), in SelectiveCheckpointWrapper()
    [476](file:///C:/Users/NAME/miniconda3/envs/voicecraft/lib/site-packages/xformers/checkpoint.py:476)     # TODO: this should be enabled by default in PyTorch
    [477](file:///C:/Users/NAME/miniconda3/envs/voicecraft/lib/site-packages/xformers/checkpoint.py:477)     torch._dynamo.config._experimental_support_context_fn_in_torch_utils_checkpoint = (
    [478](file:///C:/Users/NAME/miniconda3/envs/voicecraft/lib/site-packages/xformers/checkpoint.py:478)         True
    [479](file:///C:/Users/NAME/miniconda3/envs/voicecraft/lib/site-packages/xformers/checkpoint.py:479)     )
--> [481](file:///C:/Users/NAME/miniconda3/envs/voicecraft/lib/site-packages/xformers/checkpoint.py:481) @torch.compiler.disable
    [482](file:///C:/Users/NAME/miniconda3/envs/voicecraft/lib/site-packages/xformers/checkpoint.py:482) def _get_policy_fn(self, *args, **kwargs):
    [483](file:///C:/Users/NAME/miniconda3/envs/voicecraft/lib/site-packages/xformers/checkpoint.py:483)     if not torch.is_grad_enabled():
    [484](file:///C:/Users/NAME/miniconda3/envs/voicecraft/lib/site-packages/xformers/checkpoint.py:484)         # no need to compute a policy as it won't be used
    [485](file:///C:/Users/NAME/miniconda3/envs/voicecraft/lib/site-packages/xformers/checkpoint.py:485)         return []

AttributeError: module 'torch' has no attribute 'compiler'

Description

Man, it driven me to insanity when almost every stage of inference_tts.ipynb have their own errors. I have tried troubleshooting with my knowledge about Python package, and compatibility issues with the rest. Here is what I have counted:

 from data.tokenizer import (
    AudioTokenizer,
    TextTokenizer,
)

Unclear instruction of where to put the inference_tts.ipynb. I supposed it supposed to be src/audiocraft/audiocraft/inference_tts.ipynb. Also ImportError: attempted relative import beyond top-level package

Adding Absolute Import like this could help prevent this issue but it will raising another issue, which is ModuleNotFoundError: No module named 'AudioTokenizer'

import sys
sys.path.append('C:\\Users\\NAME\\TTS\\src\\audiocraft\\audiocraft\\data')

# # load model, tokenizer, and other necessary files
from models import voicecraft
voicecraft_name="giga830M.pth"
ckpt_fn =f"./pretrained_models/{voicecraft_name}"
encodec_fn = "./pretrained_models/encodec_4cb2048_giga.th"
if not os.path.exists(ckpt_fn):
    os.system(f"wget https://huggingface.co/pyp1/VoiceCraft/resolve/main/{voicecraft_name}\?download\=true")
    os.system(f"mv {voicecraft_name}\?download\=true ./pretrained_models/{voicecraft_name}")
if not os.path.exists(encodec_fn):
    os.system(f"wget https://huggingface.co/pyp1/VoiceCraft/resolve/main/encodec_4cb2048_giga.th")
    os.system(f"mv encodec_4cb2048_giga.th ./pretrained_models/encodec_4cb2048_giga.th")

from models import voicecraft doesn't seem like it is working like it should. Probably same package issue with Stage 1's AudioTokenizer and TextTokenizer.

Which is the one you see on Error Code. It is... ridiculous. The thing with AttributeError: module 'torch' has no attribute 'compiler' usually caused by torch version that does not support compiler, which is PyTorch 2.0 >.

But hell, transformer of mine is 4.38.1, xformers is 0.0.25.post1, and my torch is 2.2.2+cu121. Which supposedly should able to have compiler. There may be other causes and well, I don't have any ideas.

Minor one, what is the thing with apt-get install ffmpeg and apt-get install espeak-ng? It doesn't recognized as any in my system. I think it supposed to be Linux command?

Post-script

You may find this entire issue look like a rant, but no, I didn't want to mean like that. Sure it's a little bit hideous when all of this happened. But all thing considered, it is an amazing project that could probably join in the current stance between Coqui and Tortoise especially the zero-shot part. It will be much popular that it was if someone eventually got this to hook up on their webui, like rsxdalv's TTS Generation Webui.

Of course I would say we still need the fixing. You can ask me for more context or information if you wanted it to fix this.

Update 1 : fixing my wording

RealEdit Dataset Release

Hi,

Thank you for sharing this remarkable work.
I am wondering if there are any plans to make the RealEdit dataset publicly available.
I am interested in utilizing the RealEdit dataset for academic research purposes.

where model

train.txt and validation.txt generation from extracted_codes_and_phonemes

Thanks for this amazing work to benefit the speech research community.

Just wondering, is the provided train.txt and validation extracted from the XL split of gigaspeech? In the manifest file, are the three columns "0 name codec_number"? Could you maybe also provide the script to generate them from the processed feature folder path/to/store_extracted_codes_and_phonemes please? Just in case someone wants to test it on a smaller dataset split or on a different dataset? Thank you.

How to fine tune with our own voice

Your guide is terrible. Lets say I have 2 hours of speech of myself. How can I train my own voice? Fine tune the base model?

License

Hey! Incredible work and results, and amazing due dilligence in the paper - really appreciated, and putting together RealEdit for evaluating results and fairly training and comparing to other SoTA models is so nice to see! Thanks!

Wondering if you are planning on switching over to a more allowing license that would allow the use of your work in commercial practices? :-)

mfa: command not found

This line of code fails:

os.system(f"mfa align -j 1 --output_format csv {temp_folder} english_us_arpa english_us_arpa {align_temp}")

what 'mfa' should stand for?

Thanks

Validation loss Divergence?

Thanks for your great work!
Now, I'm training 100M voicecraft using ljspeech and custom data(32 hours Maybe ?)
But, I faced a issue about validation loss divergence.

I think the cause it delay stacking which changed the sequnece every epoch described in your paper. If the train-accuracy of all 4 codebooks reaches 1, it is predicted that the validation loss will decrease..

For this reason, I have two questions.

Could you explain whether my training is right or not ? ( loss curve and analysis etc..)
Could you share your train and validation curve ?

Best regards

Seung Woo Yu

Could not find a model named "english_us_arpa" for dictionary.

when I run the 4th cell i get this error.

espeak not working as backend on Windows OS

hi there,

Thanks for open sourcing this, I have everything installed and building perfectly. But espeak isn't supported on windows, is there a way to use a different backend for the text tokenizer? I've tried nltk and failed :(

These two lines:

text_tokenizer = TextTokenizer(backend="espeak")
audio_tokenizer = AudioTokenizer(signature=encodec_fn)  # Will also put the neural codec model on GPU

Everything else is working perfectly:

RuntimeError: Calculated padded input size per channel: (6). Kernel size: (7). Kernel size can't be greater than actual input size

I'm getting this error regardless of the wav file I use, including the demo file:

RuntimeError: Calculated padded input size per channel: (6). Kernel size: (7). Kernel size can't be greater than actual input size

Have you encountered this before?

Bash Error with New inference_tts.ipynb

Hello, I was going through the readme, and everything seemed to be working fine, until I got an error on cell 2. I'm not sure if it's an environment error, so apologies if that's the case.

Platform: Windows WSL2
File: inference_tts.ipynb
Code:

# install MFA models and dictionaries if you haven't done so already
!source ~/.bashrc && \
    conda activate voicecraft && \
    mfa model download dictionary english_us_arpa && \
    mfa model download acoustic english_us_arpa

Error: Output:

/bin/bash: line 1: /home/zak/.bashrc: No such file or directory

FluentSpeech model trained on GigaSpeech

Thanks for open sourcing this for better research.

Could you open source FluentSpeech model weights that you trained on GigaSpeech together with your model?

MFA not compatible with hugging face space?

I've been working on creating a hugging face space that uses Voicecraft, but there seems to be an issue where MFA can't be installed via conda since hugging face spaces only allow you to install via apt-get and pip. Have you guys figured out how to work around this issue?

colab demo

Can someone share a colab to test this

performance

on top hardware and with compilation, inference speed is still too slow to be competitive or support real-time applications. a long sentence can take anywhere from 4 to 10 seconds.

i will say the quality is quite good, and the zero-shot capability is impressive.

metadata-generation-failed during environment setup

During environment setup, after run
pip install -e git+https://github.com/facebookresearch/audiocraft.git@c5157b5bf14bf83449c17ea1eeb66c19fb4bc7f0#egg=audiocraft

I get the following problem:

Would you please help,
Thanks

examples are using cpu as default

Hi!, thanks for releasing the weights, having some fun with the model so far!

I wanted to ask if there's a particular reason as to why the examples are currently set to use CPU instead of Cuda?

Thanks!

train other languages

It would be great to have some tips on how to train different languages... I have datasets of different languages and would be happy to train with those datasets, but I don't know where to start

Add pip as a secondary installation method.

Conda is not available on all systems. Pip requirements.txt would be nice.
Thank you.

jasonppy / voicecraft Goto Github PK

voicecraft's Introduction

VoiceCraft: Zero-Shot Speech Editing and Text-to-Speech in the Wild

TL;DR

How to run inference

News

TODO

QuickStart Colab

QuickStart Command Line

QuickStart Docker

Environment setup

Inference Examples

Gradio

Run in colab

Run locally

How to use it

Some features

Training

Finetuning

License

Acknowledgement

Citation

Disclaimer

voicecraft's People

Contributors

Stargazers

Watchers

Forkers

voicecraft's Issues

Running with docker/jupyter.

Running as a script.

Intonation.

phonemization

load tokenizer

load the encodec model

Output:

Recommend Projects

Recommend Topics

Recommend Org