152334h / dl-art-school Goto Github PK

View Code? Open in Web Editor NEW

This project forked from neonbjb/dl-art-school

203.0 203.0 74.0 12.13 MB

TorToiSe fine-tuning with DLAS

License: GNU Affero General Public License v3.0

Python 99.69% Batchfile 0.31%

deep-learning machine-learning text-to-speech tortoise tts

dl-art-school's Introduction

`whoami`

I'm Sherman Chann (main), a former infosec/SWE guy now working as a Machine Learning Engineer at ElevenLabs.

AI Pivot

Some time in early 2022 -- around the time stuff like Copilot or Gato or Clippy were published, I realised what the important people had figured long ago: AGI is coming, nothing else matters, drop everything and work on this.

So, I did. Not in the strongest sense, but by the time ChatGPT rolled around, I had made:

Some TalkNet TTS models.
Minor contributions and friends from the Stable Diffusion community
Some basic fastapi wrappers around LLMs for Code generation and writing from huggingface

After that, I

wasted about a month attempting to build a GPT-3-enabled AI tutor.
Made some discord bot wrappers for multimodal LLMs like OpenFlamingo and MiniGPT-4
Involved myself with ML discord/twitter a lot. Eleuther/Nous.
Improved 🐢 TorToiSe-TTS with fast inference && fine-tuning

Many TTS companies reached out to me for my work on Tortoise, and I eventually accepted a few offers. Although I'm deeply interested in the frontier space of LLM development, working for private startups has mostly hindered me from making any public contributions to that space since ~April.

Pre-2022 era

If you're interested in the work I was doing (webdev, CTF, general software engineering, competitive programming, game dev, etc) prior to 2022, you can read a more comprehensive account at my about page.

dl-art-school's People

Contributors

Stargazers

Watchers

dl-art-school's Issues

path to save the model?

Where can I change the path to save the model? I'm training on Colab and I want it to save on Google Drive because when the environment disconnects, I lose all the checkpoints

Seeds and candidates.

A couple of question about seeds and candidates.

From my understanding, Seeds are random, and candidates are ordered from best to worst.

Is there a range of seed numbers? Or can it just be any number?
Is there a way to, say... have the playground spit out a number of candidates where each is a different seed? That would be helpful
when trying to find the right pronunciation or inflection.

Unless there is some other way to request emphasis (like html tags or italics or something.

Or does the Voice Directory do all the work on these things?

find out why triton (in lion) sometimes causes gpu stalling

set triton: true and optimizer: lion in conf to repro

CUDA out of memory

I'm getting following error when starting the training

OutOfMemoryError: CUDA out of memory. Tried to allocate 1.55 GiB (GPU 0; 23.99 GiB total capacity; 20.44 GiB already allocated; 0 bytes free; 22.59 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF Drücken Sie eine beliebige Taste . . .

Training batch size is 188
validation batch size is 48
Training settings, 500
nothing else changed.

Can you add to a trained model?

Once a model is trained, can you add to it, or do you just need to rebuild the dataset with new entries and then retrain the whole thing?

Batch Size for 24 GB VRAM

What batch size do you use on 24 GB of vram?

How to resume training

Hi,
Thank you for your work.

I was able to follow the steps given in Readme.md file and start training the autoregressive.pth. But, my training crashed mid-way due to lack of server space. Therefore, I would like to resume training from the latest epoch.

My question is : - Should I just set the 'resume_state' value to point to the latest training state file
OR
Should I also change the 'pretrain_model_gpt' to point to the latest epoch PTH file rather than the downloaded autoregressive.pth

Please help RecursionError: maximum recursion depth exceeded while calling a Python object

what ever I do I always get this error

Is there anyone out there to give a hand
RecursionError: maximum recursion depth exceeded while calling a Python object

Press any key to continue . . .

Multispeaker dataset

What should a dataset for Multispeaker look like?
Should each speaker have an identifier at the end, for example:

ModuleNotFoundError: No module named 'lion_pytorch'

As of this morning after the update - at the start of the training, it exists with the message - ModuleNotFoundError: No module named 'lion_pytorch'
Double checked, lion_pytorch exists under Python.

Work with tortoise-tts?

This great project. I wonder can I use with normal tortoise or is this require tortoise fast?

Clipped ending or doubled ending

I'm having awesome results with fine-tuning datasets, but I am running into a couple issues:

If I enter more than one sentence of text, the audio of the second line repeats itself. Not a big deal and easily editable but is there a setting/config change that might fix this?
The last word of dialogue is clipped short. The above issue helps because I can delete the repeated phrase, but in cases where it's just one sentence... Is there a setting that can be changed to fix this?

finetune on single speaker

Hi @152334H is there a way to finetune on single speaker but keep the model's zero shot capacity on other speakers as well?

The addition of 'bitsandbytes' may have broke training

Yesterday and this morning I was troubleshooting a training issue, but nonetheless DLAS was working. This afternoon I saw an update in the Windows GUI, so I applied it. Ever since then, I've been running into this issue:

Environment name is set as "DLAS" as per environment.yaml
anaconda3/miniconda3 detected in C:\ProgramData\miniconda3
Starting conda environment "DLAS" from C:\ProgramData\miniconda3
Latest git hash: 43f445d
Disabled distributed training.

===================================BUG REPORT===================================
Welcome to bitsandbytes. For bug reports, please submit your error trace to: https://github.com/TimDettmers/bitsandbytes/issues
For effortless bug reporting copy-paste your error into this form: https://docs.google.com/forms/d/e/1FAIpQLScPB8emS3Thkp66nvqwmjTEgxp8Y9ufuWTzFyr9kJ5AoI47dQ/viewform?usp=sf_link
================================================================================
C:\Users\james\.conda\envs\DLAS\lib\site-packages\bitsandbytes\cuda_setup\paths.py:27: UserWarning: WARNING: The following directories listed in your path were found to be non-existent: {WindowsPath('C')}
  warn(
C:\Users\james\.conda\envs\DLAS\lib\site-packages\bitsandbytes\cuda_setup\paths.py:93: UserWarning: C:\Users\james\.conda\envs\DLAS did not contain libcudart.so as expected! Searching further paths...
  warn(
CUDA_SETUP: WARNING! libcudart.so not found in any environmental path. Searching /usr/local/cuda/lib64...
C:\Users\james\.conda\envs\DLAS\lib\site-packages\bitsandbytes\cuda_setup\paths.py:27: UserWarning: WARNING: The following directories listed in your path were found to be non-existent: {WindowsPath('/usr/local/cuda/lib64')}
  warn(
WARNING: No libcudart.so found! Install CUDA or the cudatoolkit package (anaconda)!
CUDA SETUP: Loading binary C:\Users\james\.conda\envs\DLAS\lib\site-packages\bitsandbytes\libbitsandbytes_cpu.so...
Traceback (most recent call last):
  File "C:\Users\james\Desktop\DL-Art-School\codes\train.py", line 386, in <module>
    trainer.init(args.opt, opt, args.launcher)
  File "C:\Users\james\Desktop\DL-Art-School\codes\train.py", line 38, in init
    maybe_bnb.populate()
  File "C:\Users\james\Desktop\DL-Art-School\codes\maybe_bnb.py", line 15, in populate
    import bitsandbytes as bnb
  File "C:\Users\james\.conda\envs\DLAS\lib\site-packages\bitsandbytes\__init__.py", line 6, in <module>
    from .autograd._functions import (
  File "C:\Users\james\.conda\envs\DLAS\lib\site-packages\bitsandbytes\autograd\_functions.py", line 5, in <module>
    import bitsandbytes.functional as F
  File "C:\Users\james\.conda\envs\DLAS\lib\site-packages\bitsandbytes\functional.py", line 13, in <module>
    from .cextension import COMPILED_WITH_CUDA, lib
  File "C:\Users\james\.conda\envs\DLAS\lib\site-packages\bitsandbytes\cextension.py", line 41, in <module>
    lib = CUDALibrary_Singleton.get_instance().lib
  File "C:\Users\james\.conda\envs\DLAS\lib\site-packages\bitsandbytes\cextension.py", line 37, in get_instance
    cls._instance.initialize()
  File "C:\Users\james\.conda\envs\DLAS\lib\site-packages\bitsandbytes\cextension.py", line 31, in initialize
    self.lib = ct.cdll.LoadLibrary(binary_path)
  File "C:\Users\james\.conda\envs\DLAS\lib\ctypes\__init__.py", line 452, in LoadLibrary
    return self._dlltype(name)
  File "C:\Users\james\.conda\envs\DLAS\lib\ctypes\__init__.py", line 364, in __init__
    if '/' in name or '\\' in name:
TypeError: argument of type 'WindowsPath' is not iterable
Press any key to continue . . .

Initially I thought maybe it was CUDA, Miniconda, or Python since I had so many different versions installed and probably broken libraries/packages. I uninstalled everything, started with a clean slate, and I still get this error. The longer I look into it, the more it seems to be related to 'bitsandbytes' given the stack trace and the commit history in the last push showing that it was recently added

Reverting back to a previous commit works:

git checkout 83b901c656447126d5a0877639d394335204e1ac

This is Windows 10, Python 10, CUDA 11.7.

RuntimeError: [enforce fail at C:\cb\pytorch_1000000000000\work\caffe2\serialize\inline_container.cc:450] . PytorchStreamWriter failed writing file data/10: file write failed

Hello!
So I tried it and when I launched it for the first time everything went fine, the results were great.
But when I tried to train some othe voice I got this error

Is there anything I can do?
I'm sorry maybe this is very stupid issue but I'm complete noob at this

ModuleNotFoundError: No module named 'tortoise'

I get that error when trying to create voice/text in the playground, when I click generate nothing happens and that error is in the cli. I've already deleted the tortoise-tts-fast folder so it could be installed again, didn't fix it.

EDIT:

could be related, this is during the install of tortoise

  × Running setup.py install for pesq did not run successfully.
  │ exit code: 1
  ╰─> [25 lines of output]
      C:\Users\Dominik\anaconda3\envs\DLAS\lib\site-packages\setuptools\__init__.py:85: _DeprecatedInstaller: setuptools.installer and fetch_build_eggs are deprecated. Requirements should be satisfied by a PEP 517 installer. If you are using pip, you can try `pip install --use-pep517`.
        dist.fetch_build_eggs(dist.setup_requires)
      running install
      C:\Users\Dominik\anaconda3\envs\DLAS\lib\site-packages\setuptools\command\install.py:34: SetuptoolsDeprecationWarning: setup.py install is deprecated. Use build and pip and other standards-based tools.
        warnings.warn(
      running build
      running build_py
      creating build
      creating build\lib.win-amd64-cpython-310
      creating build\lib.win-amd64-cpython-310\pesq
      copying pesq\_pesq.py -> build\lib.win-amd64-cpython-310\pesq
      copying pesq\__init__.py -> build\lib.win-amd64-cpython-310\pesq
      copying pesq\cypesq.pyx -> build\lib.win-amd64-cpython-310\pesq
      copying pesq\dsp.h -> build\lib.win-amd64-cpython-310\pesq
      copying pesq\pesq.h -> build\lib.win-amd64-cpython-310\pesq
      copying pesq\pesqio.h -> build\lib.win-amd64-cpython-310\pesq
      copying pesq\pesqmain.h -> build\lib.win-amd64-cpython-310\pesq
      copying pesq\pesqpar.h -> build\lib.win-amd64-cpython-310\pesq
      copying pesq\dsp.c -> build\lib.win-amd64-cpython-310\pesq
      copying pesq\pesqdsp.c -> build\lib.win-amd64-cpython-310\pesq
      copying pesq\pesqmod.c -> build\lib.win-amd64-cpython-310\pesq
      running build_ext
      skipping 'pesq\cypesq.c' Cython extension (up-to-date)
      building 'cypesq' extension
      error: Microsoft Visual C++ 14.0 or greater is required. Get it with "Microsoft C++ Build Tools": https://visualstudio.microsoft.com/visual-cpp-build-tools/
      [end of output]

  note: This error originates from a subprocess, and is likely not a problem with pip.
error: legacy-install-failure

× Encountered error while trying to install package.
╰─> pesq

note: This is an issue with the package mentioned above, not pip.
hint: See above for output from the failure.

English with an accent.

Could you replace the autoregressive.pth file with a custom file (still named autoregressive.pth) that was trained only on speakers with a specific accent (Southern, or German for example)... and then train an accented speaker against that? Or would it break.

The problem is the original custom file would have been trained against the original autoregressive file.

It seems then the processing doesn't "understand" the accent, it just makes it sound British. At best, accents are smoothed out, like an average between American English and whatever accent is applied...trying to get a thick accent on a few voices.

Separating

This is really a great effort and help.

Just a question about training models

Since training generates huge files, and you can only link one (I'm guessing the last one created) can the rest be deleted to save space? I'm new to this, but I'm assuming each subsequent .pth file is "better" than the one before it?

Correctly load similar clips for fine-tuning

PROBLEM

Let's start from the EXAMPLE_gpt.yml file.

There's a parameter named num_conditioning_candidates under datasets.train && datasets.val. It (ostensibly) determines the number of conditioning wav files that are piped to the gpt model during training.

How does it work? In paired_voice_audio_dataset.py, there's a line in __getitem__ that grabs num_conditioning_candidates similar clips from the current dataset:

            cond, cond_is_self = load_similar_clips(self.audiopaths_and_text[index][0], self.conditioning_length, self.sample_rate,
                                      n=self.conditioning_candidates) if self.load_conditioning else (None, False)

Or at least, that's what it's supposed to in theory. In practice, it ignores the value of n passed, because of the lack of a file named similarities.pth:

def load_similar_clips(path, sample_length, sample_rate, n=3, fallback_to_self=True):
    sim_path = os.path.join(os.path.dirname(path), 'similarities.pth')
    candidates = []
    if os.path.exists(sim_path): # obviously ignored
    if len(candidates) == 0: # true
        if fallback_to_self: # true
            candidates = [path] # ONLY 1 CANDIDATE USED
    #...

(i also printed out the values for cond_is_self and verified that it's true right now)

The similarities.pth file(s?) are supposed to be generated by a script titled phase_3_generate_similarities.py. I saw the preparation scripts earlier, but to be real I have no idea how to use them yet.

why is this a problem?

Plausibly could lead to the cheater latents problem on larger datasets. Also probably reduces the zero shot vc capabilities.

Discrepencies between unifiedvoice2 and tortoise-tts

Two things I have discovered so far:

the wav_lengths are supposed to be multiplied by self.mel_length_compression
the things returned on return_latent are supposed to be subscripted with -2, not -1

I might just grab the definition from tortoise-tts instead.

Best Training settings batch size steps etc

Is there somewhere a board discussion where people talk about the "best" training settings, or at least what worked for them.
How many steps are optimal which batch sizes. For example. I've around 40 minutes of audio as sample which I will beforehand run through ozone to get the dataset, but then it's the questing what setting to use for a dataset like that to get a perfect/good outcome.

CUDA OOM Issues

Sorry to be here again.

I have a 3070 8GB

Now my dataset is fine. I keep getting cuda errros. I've identified 3 places in the yml I can edit to reduce batch sizes but even putting it to 1 gets me an error.

I've also tried changing mega_batch_factor: as your notes.

I tried a much smaller dataset of 600 wav files.

I get this :

╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮ │ H:\DL-Art-School\codes\train.py:370 in <module> │ │ │ │ 367 │ │ torch.cuda.set_device(torch.distributed.get_rank()) │ │ 368 │ │ │ 369 │ trainer.init(args.opt, opt, args.launcher) │ │ ❱ 370 │ trainer.do_training() │ │ 371 │ │ │ │ H:\DL-Art-School\codes\train.py:325 in do_training │ │ │ │ 322 │ │ │ │ │ 323 │ │ │ _t = time() │ │ 324 │ │ │ for train_data in tq_ldr: │ │ ❱ 325 │ │ │ │ self.do_step(train_data) │ │ 326 │ │ │ 327 │ def create_training_generator(self, index): │ │ 328 │ │ self.logger.info('Start training from epoch: {:d}, iter: {:d}'.format(self.start │ │ │ │ H:\DL-Art-School\codes\train.py:206 in do_step │ │ │ │ 203 │ │ │ print("Update LR: %f" % (time() - _t)) │ │ 204 │ │ _t = time() │ │ 205 │ │ self.model.feed_data(train_data, self.current_step) │ │ ❱ 206 │ │ gradient_norms_dict = self.model.optimize_parameters(self.current_step, return_g │ │ 207 │ │ iteration_rate = (time() - _t) / batch_size │ │ 208 │ │ if self._profile: │ │ 209 │ │ │ print("Model feed + step: %f" % (time() - _t)) │ │ │ │ H:\DL-Art-School\codes\trainer\ExtensibleTrainer.py:302 in optimize_parameters │ │ │ │ 299 │ │ │ new_states = {} │ │ 300 │ │ │ self.batch_size_optimizer.focus(net) │ │ 301 │ │ │ for m in range(self.batch_factor): │ │ ❱ 302 │ │ │ │ ns = step.do_forward_backward(state, m, step_num, train=train_step, no_d │ │ 303 │ │ │ │ # Call into post-backward hooks. │ │ 304 │ │ │ │ for name, net in self.networks.items(): │ │ 305 │ │ │ │ │ if hasattr(net.module, "after_backward"): │ │ │ │ H:\DL-Art-School\codes\trainer\steps.py:214 in do_forward_backward │ │ │ │ 211 │ │ local_state = {} # <-- Will store the entire local state to be passed to inject │ │ 212 │ │ new_state = {} # <-- Will store state values created by this step for returning │ │ 213 │ │ for k, v in state.items(): │ │ ❱ 214 │ │ │ local_state[k] = v[grad_accum_step] │ │ 215 │ │ local_state['train_nets'] = str(self.get_networks_trained()) │ │ 216 │ │ loss_accumulator = self.loss_accumulator if loss_accumulator is None else loss_a │ │ 217 │ ╰──────────────────────────────────────────────────────────────────────────────────────────────────╯ IndexError: list index out of range

it's seemingly doing nothing, any idea as to why?

https://files.catbox.moe/vet45t.png

Add Colab notebook to repository

We should add colab notebook to repository to have history of changes in notebook and so we could add additional fields to colab (we don't have access to Colab notebook which is shared)

I can create PR with improved Colab notebook.

Error no kernel image is available for execution on the device

Hi, when I try to train, it throws an error:

C:\Users\PC\anaconda3\envs\DLAS\lib\site-packages\torch\optim\lr_scheduler.py:138: 
UserWarning: Detected call of `lr_scheduler.step()` before `optimizer.step()`. 
In PyTorch 1.1.0 and later, you should call them in the opposite order: 
`optimizer.step()` before `lr_scheduler.step()`.  
Failure to do this will result in PyTorch skipping the first value of the learning rate schedule. 
See more details at https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate
  warnings.warn("Detected call of `lr_scheduler.step()` before `optimizer.step()`. "
Error no kernel image is available for execution on the device at line 167 in file 
D:\ai\tool\bitsandbytes\csrc\ops.cu
Press any key to continue

I thought this may be wrong CUDA version installed but I checked and my 1070 seem to be compatible with with CUDA 8+.

Anyone encountered this error and found a way to fix it?
Thanks.

Triton in Requirements File Doesnt Work

Removing Triton from the end of the 'requirements.laxed_edited.txt' file
and manually installing triton using
pip install -U --pre triton
allows the program to function

special symbols in text

Notable problem of the base tortoise model: it does not handle *emphasis* or special symbols like + or = or & elegantly. Lines like:

8 + 5 = 13
him & his wife
I *really* do not like this

all perform oddly poorly.

I am not sure if fine-tuning could fix this. Personally, I think you would need to at least fine-tune both the CLVP && GPT model to fix this.

xxx_cleaners

Hello,

I want to train another language, after a few hours of training it is average, so I would need to change "xxx_cleaners". I understand that by default it uses "english_cleaners". It is used in several files, where can I change it to make the change effective? Can I add an option to change the cleaners in the "EXAMPLE_gpt.yml" file? Another problem arises because there is no "xxx_cleaners" in the synthesis anywhere, which limits the synthesis due to the lack of symbols from another language.

Got error trying to test the fine-tuned model

Hi, thanks very much for providing this amazing repo!

I have tried a little bit of fine-tuning the model, and ran the tortoise_tts.py script using the command below

./scripts/tortoise_tts.py --preset fast --ar_checkpoint $model_path -o test.wav --text "Hello, how are you?"

I got the error below:

╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮
│ /home/ps/Documents/voice-test/voice/./tortoise-tts-fast/scripts/tortoise_tts.py:240 in <module>      │
│                                                                                                  │
│   237 │   #     app = import_module("app")                                                       │
│   238 │   #     sys.exit(app.main())                                                             │
│   239 │                                                                                          │
│ ❱ 240 │   from tortoise.inference import (                                                       │
│   241 │   │   check_pydub,                                                                       │
│   242 │   │   get_all_voices,                                                                    │
│   243 │   │   get_seed,                                                                          │
│                                                                                                  │
│ /home/ps/Documents/voice-test/voice/tortoise-tts-fast/tortoise/inference.py:167 in <module>          │
│                                                                                                  │
│   164                                                                                            │
│   165 from voicefixer import VoiceFixer                                                          │
│   166                                                                                            │
│ ❱ 167 vfixer = VoiceFixer()                                                                      │
│   168                                                                                            │
│   169                                                                                            │
│   170 def save_gen_with_voicefix(g, fpath, squeeze=True, voicefixer=True):                       │
│                                                                                                  │
│ /home/ps/Documents/voice-test/envs/voice/lib/python3.10/site-packages/voicefixer/base.py:13 in       │
│ __init__                                                                                         │
│                                                                                                  │
│    10 class VoiceFixer(nn.Module):                                                               │
│    11 │   def __init__(self):                                                                    │
│    12 │   │   super(VoiceFixer, self).__init__()                                                 │
│ ❱  13 │   │   self._model = voicefixer_fe(channels=2, sample_rate=44100)                         │
│    14 │   │   # print(os.path.join(os.path.expanduser('~'), ".cache/voicefixer/analysis_module   │
│    15 │   │   self.analysis_module_ckpt = os.path.join(                                          │
│    16 │   │   │   │   │   os.path.expanduser("~"),                                               │
│                                                                                                  │
│ /home/ps/Documents/voice-test/envs/voice/lib/python3.10/site-packages/voicefixer/restorer/model.py:1 │
│ 80 in __init__                                                                                   │
│                                                                                                  │
│   177 │   │   # self.am = AudioMetrics()                                                         │
│   178 │   │   # self.im = ImgMetrics()                                                           │
│   179 │   │                                                                                      │
│ ❱ 180 │   │   self.vocoder = Vocoder(sample_rate=44100)                                          │
│   181 │   │                                                                                      │
│   182 │   │   self.valid = None                                                                  │
│   183 │   │   self.fake = None                                                                   │
│                                                                                                  │
│ /home/ps/Documents/voice-test/envs/voice/lib/python3.10/site-packages/voicefixer/vocoder/base.py:19  │
│ in __init__                                                                                      │
│                                                                                                  │
│   16 │   │   │   raise RuntimeError("Error 1: The checkpoint for synthesis module / vocoder (    │
│   17 │   │   │   │   │   │   │   │   By default the checkpoint should be download automatical    │
│   18 │   │   │   │   │   │   │   │   But don't worry! Alternatively you can download it direc    │
│ ❱ 19 │   │   self._load_pretrain(Config.ckpt)                                                    │
│   20 │   │   self.weight_torch = Config.get_mel_weight_torch(percent=1.0)[                       │
│   21 │   │   │   None, None, None, ...                                                           │
│   22 │   │   ]                                                                                   │
│                                                                                                  │
│ /home/ps/Documents/voice-test/envs/voice/lib/python3.10/site-packages/voicefixer/vocoder/base.py:26  │
│ in _load_pretrain                                                                                │
│                                                                                                  │
│   23 │                                                                                           │
│   24 │   def _load_pretrain(self, pth):                                                          │
│   25 │   │   self.model = Generator(Config.cin_channels)                                         │
│ ❱ 26 │   │   checkpoint = load_checkpoint(pth, torch.device("cpu"))                              │
│   27 │   │   load_try(checkpoint["generator"], self.model)                                       │
│   28 │   │   self.model.eval()                                                                   │
│   29 │   │   self.model.remove_weight_norm()                                                     │
│                                                                                                  │
│ /home/ps/Documents/voice-test/envs/voice/lib/python3.10/site-packages/voicefixer/vocoder/model/util. │
│ py:111 in load_checkpoint                                                                        │
│                                                                                                  │
│   108                                                                                            │
│   109                                                                                            │
│   110 def load_checkpoint(checkpoint_path, device):                                              │
│ ❱ 111 │   checkpoint = torch.load(checkpoint_path, map_location=device)                          │
│   112 │   return checkpoint                                                                      │
│   113                                                                                            │
│   114                                                                                            │
│                                                                                                  │
│ /home/ps/Documents/voice-test/envs/voice/lib/python3.10/site-packages/torch/serialization.py:777 in  │
│ load                                                                                             │
│                                                                                                  │
│    774 │   │   │   # If we want to actually tail call to torch.jit.load, we need to              │
│    775 │   │   │   # reset back to the original position.                                        │
│    776 │   │   │   orig_position = opened_file.tell()                                            │
│ ❱  777 │   │   │   with _open_zipfile_reader(opened_file) as opened_zipfile:                     │
│    778 │   │   │   │   if _is_torchscript_zip(opened_zipfile):                                   │
│    779 │   │   │   │   │   warnings.warn("'torch.load' received a zip file that looks like a To  │
│    780 │   │   │   │   │   │   │   │     " dispatching to 'torch.jit.load' (call 'torch.jit.loa  │
│                                                                                                  │
│ /home/ps/Documents/voice-test/envs/voice/lib/python3.10/site-packages/torch/serialization.py:282 in  │
│ __init__                                                                                         │
│                                                                                                  │
│    279                                                                                           │
│    280 class _open_zipfile_reader(_opener):                                                      │
│    281 │   def __init__(self, name_or_buffer) -> None:                                           │
│ ❱  282 │   │   super(_open_zipfile_reader, self).__init__(torch._C.PyTorchFileReader(name_or_bu  │
│    283                                                                                           │
│    284                                                                                           │
│    285 class _open_zipfile_writer_file(_opener):                                                 │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
RuntimeError: PytorchStreamReader failed reading zip archive: failed finding central directory

Just wondering what was the issue? When I tried the first time I saw the script tried to download some models but I was running tortoise for some time already and the .cache/tortoise/models folder had the models, so I cancelled the running and when I execute the command again, it didn't run on the downloading bit, so maybe because the file was already created but not complete? If that's the case, just wondering where did the script download the additional models? Thanks very much for looking into my issue.

TypeError: 'NoneType' object is not subscriptable

3-02-21 09:05:37.187 - INFO: [epoch: 0, iter: 0, lr:(1.000e-05,1.000e-05,)] step: 0.0000e+00 samples: 8.0000e+00 megasamples: 8.0000e-06 iteration_rate: 3.6583e-01 loss_text_ce: 4.2470e+00 loss_mel_ce: 2.9319e+00 loss_gpt_total: 2.9744e+00 grad_scaler_scale: 1.0000e+00 learning_rate_gpt_0: 1.0000e-05 learning_rate_gpt_1: 1.0000e-05 total_samples_loaded: 8.0000e+00 percent_skipped_samples: 1.1111e-01 percent_conditioning_is_self: 8.8889e-01 gpt_conditioning_encoder: 4.6525e+00 gpt_gpt: 4.9838e+00 gpt_heads: 5.3694e+00 23-02-21 09:05:37.187 - INFO: Saving models and training states. 0%| | 0/1 [00:08<?, ?it/s] ╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮ │ H:\DL-Art-School\codes\train.py:381 in <module> │ │ │ │ 378 │ │ torch.cuda.set_device(torch.distributed.get_rank()) │ │ 379 │ │ │ 380 │ trainer.init(args.opt, opt, args.launcher) │ │ ❱ 381 │ trainer.do_training() │ │ 382 │ │ │ │ H:\DL-Art-School\codes\train.py:336 in do_training │ │ │ │ 333 │ │ │ │ │ 334 │ │ │ _t = time() │ │ 335 │ │ │ for train_data in tq_ldr: │ │ ❱ 336 │ │ │ │ self.do_step(train_data) │ │ 337 │ │ │ 338 │ def create_training_generator(self, index): │ │ 339 │ │ self.logger.info('Start training from epoch: {:d}, iter: {:d}'.format(self.start │ │ │ │ H:\DL-Art-School\codes\train.py:263 in do_step │ │ │ │ 260 │ │ │ │ │ self.logger.info('Saving models and training states.') │ │ 261 │ │ │ │ else: │ │ 262 │ │ │ │ │ self.logger.info('Saving model.') │ │ ❱ 263 │ │ │ │ if opt['upgrades']['number_of_checkpoints_to_save'] > 0: │ │ 264 │ │ │ │ │ self.logger.info( │ │ 265 │ │ │ │ │ │ f"Leaving only {opt['upgrades']['number_of_checkpoints_to_save'] │ │ 266 │ │ │ │ │ ) │ ╰──────────────────────────────────────────────────────────────────────────────────────────────────╯ TypeError: 'NoneType' object is not subscriptable

Using the latest version of DLAS but seem to get this error on a small dataset. I used the previous version with little issues. Not sure what this means?

Pip subprocess error

Getting an error on Windows when running Startup DLAS. The installer proceeds after and says completed but then I am missing many dependencies when I try to run "Start DLAS". I can get past them by manually installing each, getting into the UI and starting training but I eventually get an error where it cannot import FixedPositionalEmbedding from x_transformers.

Error is:

Collecting bitsandbytes
  Using cached bitsandbytes-0.37.0-py3-none-any.whl (76.3 MB)
Collecting lion-pytorch==0.0.7
  Using cached lion_pytorch-0.0.7-py3-none-any.whl (4.3 kB)

Pip subprocess error:
ERROR: Ignored the following versions that require a different python version: 1.6.2 Requires-Python >=3.7,<3.10; 1.6.3 Requires-Python >=3.7,<3.10; 1.7.0 Requires-Python >=3.7,<3.10; 1.7.0rc1 Requires-Python >=3.7,<3.10; 1.7.0rc2 Requires-Python >=3.7,<3.10; 1.7.1 Requires-Python >=3.7,<3.10
ERROR: Could not find a version that satisfies the requirement triton==2.0.0a2 (from versions: none)
ERROR: No matching distribution found for triton==2.0.0a2

failed

CondaEnvException: Pip failed

Fixing Tensorboard...

DVAE backup link

this will definitely be taken down at some point; open an issue with a backup link and i'll replace it

https://huggingface.co/Gatozu35/tortoise-tts/resolve/main/dvae.pth

Please remove .idea folders from repository

Hi,
it's nice that you are using Pycharm (I'm using it too) but your setting are in conflict with mine and every time I want to create some change (like now for yesterday merged changes related with disabling states generation and my changes related with saving only X last checkpoints and x last states) I have to remove your settings, add my own and reload Pycharm virtual environment settings, or I need to fix conflicts. It's very annoying.

It would be nice to remove .idea/ folders from repository (there are two - in main folder and in codes subfolder) and add it to file .gitignore so every one of us could use their own settings (and this folder shouldn't be in repository at all because it's not related with project, so you know... :) ).

I can create pull request for if it it helps.

Thank you for your help :)

Can't train dataset , when I press "start training" it says this

What do I do to fix the error? I can anaconda as admin each time before starting the env, what do I do

hi i get this error message.. can anyone pls help?

Train other models in the pipeline

Apart from the GPT model (which has been implemented), there are 4 other models in TorToiSe that could be fine-tuned:

the VQVAE, which learns how to encode the training data,
CLVP, which determines how closely a spoken line matches speech tokens,
the diffuser, which learns how to decompress speech latents into spectrograms,
Unvinet, the vocoder, which convers spectrograms to sound.

IMO, the diffusion model + vocoder are obvious targets. Vocoders are often fine-tuned in other tts pipelines, and the diffusion model serves roughly the same purpose...

...but, the diffusion model is the only other model that takes the conditioning latents into account. I suspect that fine-tuning both the autoregressive & diffuser models on a single speaker would lead to a kind of 'mode collapse' (bear with this inaccurate phrasing), where the conditioning latents fail to affect the output speech substantially. Ideally, some form of mixed speaker training would account for this, but I'm not sure how to accomplish that yet.

Training the VQVAE could be good for datasets that are emotional, and substantially different from the normal LJSpeech+libretts+commonvoice+voxpopuli+... pile of monotonic speech. But I think it would necessitate a parallel training of the GPT model + the CLVP model as well, to account for the change in tokens outputted.

I also think that keeping the CLVP model untrained could be a good idea to retain the power of conditioning latents. Fine-tuning it on a single voice would adjust it to see that specific speaker as more likely than other speakers.

Faster training

As a sister project of tortoise-tts-fast, it would be great if the performance of training code could be improved as well.

Things to investigate:

any compatibilities between inference speed code vs training code
viability of fp16
are injectors a bottleneck, or are they faster than training?
more ideas

figure out how to enable validation dataset

Title

Figure out the best training hyperparameters

The numbers written in ./experiments/EXAMPLE_gpt.yml were picked completely at random! It is very likely the numbers can be better, so long as people are willing to test and see what works.

Please post results here if you change any of the parameters, even if it completely fails!

DLAS throwing a torch._six error.

I installed the UI using the .setup DLAS.bat script and things were going well...followed along with the video tutorial and created a dataset, but when I click Start Training, it gives me a traceback error:

Traceback (most recent call last):
File "C:\Users\oldgu\DL-Art-School\codes\train.py", line 12, in
from data.data_sampler import DistIterSampler
File "C:\Users\oldgu\DL-Art-School\codes\data_init_.py", line 6, in
from utils.util import opt_get
File "C:\Users\oldgu\DL-Art-School\codes\utils\util.py", line 25, in
from torch._six import inf
ModuleNotFoundError: No module named 'torch._six'

A quick look around and it looks like that module is deprecated? Is this a pytorch 2.0 issue or a me issue? Do I need to downgrade pytorch? I'm afraid it would break the whole thing.
I'm running Windows 11 with Miniconda 3

The process cannot access the file because it is being used by another process

Getting this error when I start training:

Disabled distributed training.
Path already exists. Rename it to [C:\Users\Yuri\DL-Art-School\experiments_archived_230428-154734]
Traceback (most recent call last):
File "C:\Users\Yuri\DL-Art-School\codes\train.py", line 380, in
trainer.init(args.opt, opt, args.launcher)
File "C:\Users\Yuri\DL-Art-School\codes\train.py", line 51, in init
util.mkdir_and_rename(
File "C:\Users\Yuri\DL-Art-School\codes\utils\util.py", line 112, in mkdir_and_rename
os.rename(path, new_name)
PermissionError: [WinError 32] The process cannot access the file because it is being used by another process: 'C:\Users\Yuri\DL-Art-School\experiments\' -> 'C:\Users\Yuri\DL-Art-School\experiments\_archived_230428-154734'
Press any key to continue . . .

Tried rolling back out of BitsandBytes and a handful of other things, but nothing working.

text_ce and mel_ce loss

it seems when i train the model, the training loss_text_ce and loss_mel_ce_loss as well as validation loss val_loss_text_ce is trending down as expected but the other validation loss val_loss_mel_ce is trending up.

The training set and validation are quite similar. The data set is small though about 100 wavs each with length at most 8 seconds.

Any idea what might cause this ? Below is my experiment with val_text_ce loss and val_mel_ce loss.

Thanks,

Friendly training process (COLAB NOTEBOOK)

Right now, the process of setting up config ymls and datasets and so on is fairly complicated.

It would be a lot easier for the general public to use if,

there was a simple UI to adjust model configs
you could use colab to do training

I am not familiar with colab notebooks (hard nvim user), so I cannot help much with this.

Possible way to avoide English accent while training on other languages

Hello!

Thanks for great work! I was trying to finetune a model using non English datasets (Russian, etc.). The resulting voice is really good, but I keep getting the result with super strong English accent even after long training. Are there any possible ways to reduce the accent (or ideally get rid of it)?
I guess that the problem is because of the fine-tuning process using English model..

Very detailed tutorial up-to-date

I made a pull request too please accept if possible : #71

Master Deep Voice Cloning in Minutes: Unleash Your Vocal Superpowers! Free and Locally on Your PC

This tutorial is based on

Ozen Toolkit for data preprocessing
DLAS for Training
Tortoise TTS Fast for speech synthesis

Divide by Zero error

Any idea why this is happening?

===============================================
CUDA SETUP: Loading binary C:\Users\oldgu\miniconda3\envs\DLAS\lib\site-packages\bitsandbytes\libbitsandbytes_cuda116.dll...
23-03-22 20:25:03.393 - INFO: name: Test
model: extensibletrainer
scale: 1
gpu_ids: [0]
start_step: -1
checkpointing_enabled: True
fp16: False
use_8bit: True
wandb: False
use_tb_logger: True
datasets:[
train:[
name: Test
n_workers: 8
batch_size: 1
mode: paired_voice_audio
path: C:/Users/oldgu/ozen-toolkit/output/Vincent_AGraveyardofGhostTales.wav_2023_03_22-20_05\train.txt
fetcher_mode: ['lj']
phase: train
max_wav_length: 255995
max_text_length: 200
sample_rate: 22050
load_conditioning: True
num_conditioning_candidates: 2
conditioning_length: 44000
use_bpe_tokenizer: True
load_aligned_codes: False
data_type: img
]
val:[
name: Test
n_workers: 1
batch_size: 1
mode: paired_voice_audio
path: C:/Users/oldgu/ozen-toolkit/output/Vincent_AGraveyardofGhostTales.wav_2023_03_22-20_05\valid.txt
fetcher_mode: ['lj']
phase: val
max_wav_length: 255995
max_text_length: 200
sample_rate: 22050
load_conditioning: True
num_conditioning_candidates: 2
conditioning_length: 44000
use_bpe_tokenizer: True
load_aligned_codes: False
data_type: img
]
]
steps:[
gpt_train:[
training: gpt
loss_log_buffer: 500
optimizer: adamw
optimizer_params:[
lr: 1e-05
triton: False
weight_decay: 0.01
beta1: 0.9
beta2: 0.96
]
clip_grad_eps: 4
injectors:[
paired_to_mel:[
type: torch_mel_spectrogram
mel_norm_file: ../experiments/clips_mel_norms.pth
in: wav
out: paired_mel
]
paired_cond_to_mel:[
type: for_each
subtype: torch_mel_spectrogram
mel_norm_file: ../experiments/clips_mel_norms.pth
in: conditioning
out: paired_conditioning_mel
]
to_codes:[
type: discrete_token
in: paired_mel
out: paired_mel_codes
dvae_config: ../experiments/train_diffusion_vocoder_22k_level.yml
]
paired_fwd_text:[
type: generator
generator: gpt
in: ['paired_conditioning_mel', 'padded_text', 'text_lengths', 'paired_mel_codes', 'wav_lengths']
out: ['loss_text_ce', 'loss_mel_ce', 'logits']
]
]
losses:[
text_ce:[
type: direct
weight: 0.01
key: loss_text_ce
]
mel_ce:[
type: direct
weight: 1
key: loss_mel_ce
]
]
]
]
networks:[
gpt:[
type: generator
which_model_G: unified_voice2
kwargs:[
layers: 30
model_dim: 1024
heads: 16
max_text_tokens: 402
max_mel_tokens: 604
max_conditioning_inputs: 2
mel_length_compression: 1024
number_text_tokens: 256
number_mel_codes: 8194
start_mel_token: 8192
stop_mel_token: 8193
start_text_token: 255
train_solo_embeddings: False
use_mel_codes_as_input: True
checkpointing: True
]
]
]
path:[
pretrain_model_gpt: ../experiments/autoregressive.pth
strict_load: True
root: C:\Users\oldgu\DL-Art-School
experiments_root: C:\Users\oldgu\DL-Art-School\experiments\Test
models: C:\Users\oldgu\DL-Art-School\experiments\Test\models
training_state: C:\Users\oldgu\DL-Art-School\experiments\Test\training_state
log: C:\Users\oldgu\DL-Art-School\experiments\Test
val_images: C:\Users\oldgu\DL-Art-School\experiments\Test\val_images
]
train:[
niter: 200
warmup_iter: -1
mega_batch_factor: 1
val_freq: 500
default_lr_scheme: MultiStepLR
gen_lr_steps: [100, 200, 280, 360]
lr_gamma: 0.5
ema_enabled: False
manual_seed: 1337
]
eval:[
output_state: gen
injectors:[
gen_inj_eval:[
type: generator
generator: generator
in: hq
out: ['gen', 'codebook_commitment_loss']
]
]
]
logger:[
print_freq: 10
save_checkpoint_freq: 10
visuals: ['gen', 'mel']
visual_debug_rate: 500
is_mel_spectrogram: True
disable_state_saving: False
]
upgrades:[
number_of_checkpoints_to_save: 0
number_of_states_to_save: 0
]
is_train: True
dist: False

23-03-22 20:25:03.538 - INFO: Random seed: 1337
Traceback (most recent call last):
File "C:\Users\oldgu\DL-Art-School\codes\train.py", line 398, in
trainer.init(args.opt, opt, args.launcher)
File "C:\Users\oldgu\DL-Art-School\codes\train.py", line 121, in init
self.total_epochs = int(math.ceil(total_iters / train_size))
ZeroDivisionError: division by zero
Press any key to continue . . .

Dataset path is not resolving in Windows

The dataset path is not resolving; however, I can retrain on the older sets. Double checked the dataset, it has valid and train files and wavs folder.

Error setting up training

I would like to check this is the correct format for the path to wavs/transcripts :

datasets:
  train:
    name: harry
    n_workers: 8 # idk what this does
    batch_size: 32 # This leads to ~16GB of vram usage on my 3090.
    mode: paired_voice_audio
    path: H:/DL-Art-School/experiments/harry.zip
    fetcher_mode: ['lj'] # CHANGEME if your dataset isn't in LJSpeech format
    phase: train
    max_wav_length: 255995
    max_text_length: 200
    sample_rate: 22050
    load_conditioning: True
    num_conditioning_candidates: 2
    conditioning_length: 44000
    use_bpe_tokenizer: True
    load_aligned_codes: False
  val:
    name: harry
    n_workers: 1
    batch_size: 32 # this could be higher probably
    mode: paired_voice_audio
    path: H:/DL-Art-School/experiments/harry.txt

Should it point towards the zip file and text file directly like above?

I check getting this error after running python3 train.py -opt ../experiments/EXAMPLE_gpt.yml :

╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮
│ H:\DL-Art-School\codes\train.py:369 in <module>                                                  │
│                                                                                                  │
│   366 │   │   trainer.rank = torch.distributed.get_rank()                                        │
│   367 │   │   torch.cuda.set_device(torch.distributed.get_rank())                                │
│   368 │                                                                                          │
│ ❱ 369 │   trainer.init(args.opt, opt, args.launcher)                                             │
│   370 │   trainer.do_training()                                                                  │
│   371                                                                                            │
│                                                                                                  │
│ H:\DL-Art-School\codes\train.py:130 in init                                                      │
│                                                                                                  │
│   127 │   │   │   │   │   self.logger.info('Total epochs needed: {:d} for iters {:,d}'.format(   │
│   128 │   │   │   │   │   │   self.total_epochs, total_iters))                                   │
│   129 │   │   │   elif phase == 'val':                                                           │
│ ❱ 130 │   │   │   │   self.val_set, collate_fn = create_dataset(dataset_opt, return_collate=Tr   │
│   131 │   │   │   │   self.val_loader = create_dataloader(self.val_set, dataset_opt, opt, None   │
│   132 │   │   │   │   if self.rank <= 0:                                                         │
│   133 │   │   │   │   │   self.logger.info('Number of val images in [{:s}]: {:d}'.format(        │
│                                                                                                  │
│ H:\DL-Art-School\codes\data\__init__.py:107 in create_dataset                                    │
│                                                                                                  │
│   104 │   │   │   collate = C()                                                                  │
│   105 │   else:                                                                                  │
│   106 │   │   raise NotImplementedError('Dataset [{:s}] is not recognized.'.format(mode))        │
│ ❱ 107 │   dataset = D(dataset_opt)                                                               │
│   108 │                                                                                          │
│   109 │   if return_collate:                                                                     │
│   110 │   │   return dataset, collate                                                            │
│                                                                                                  │
│ H:\DL-Art-School\codes\data\audio\paired_voice_audio_dataset.py:156 in __init__                  │
│                                                                                                  │
│   153 │   │   │   │   fetcher_fn = load_voxpopuli                                                │
│   154 │   │   │   else:                                                                          │
│   155 │   │   │   │   raise NotImplementedError()                                                │
│ ❱ 156 │   │   │   self.audiopaths_and_text.extend(fetcher_fn(p, type))                           │
│   157 │   │   self.text_cleaners = hparams.text_cleaners                                         │
│   158 │   │   self.sample_rate = hparams.sample_rate                                             │
│   159 │   │   random.seed(hparams.seed)                                                          │
│                                                                                                  │
│ H:\DL-Art-School\codes\models\audio\tts\tacotron2\taco_utils.py:31 in                            │
│ load_filepaths_and_text_type                                                                     │
│                                                                                                  │
│   28                                                                                             │
│   29 def load_filepaths_and_text_type(filename, type, split="|"):                                │
│   30 │   with open(filename, encoding='utf-8') as f:                                             │
│ ❱ 31 │   │   filepaths_and_text = [list(line.strip().split(split)) + [type] for line in f]       │
│   32 │   │   base = os.path.dirname(filename)                                                    │
│   33 │   │   for j in range(len(filepaths_and_text)):                                            │
│   34 │   │   │   filepaths_and_text[j][0] = os.path.join(base, filepaths_and_text[j][0])         │
│                                                                                                  │
│ H:\DL-Art-School\codes\models\audio\tts\tacotron2\taco_utils.py:31 in <listcomp>                 │
│                                                                                                  │
│   28                                                                                             │
│   29 def load_filepaths_and_text_type(filename, type, split="|"):                                │
│   30 │   with open(filename, encoding='utf-8') as f:                                             │
│ ❱ 31 │   │   filepaths_and_text = [list(line.strip().split(split)) + [type] for line in f]       │
│   32 │   │   base = os.path.dirname(filename)                                                    │
│   33 │   │   for j in range(len(filepaths_and_text)):                                            │
│   34 │   │   │   filepaths_and_text[j][0] = os.path.join(base, filepaths_and_text[j][0])         │
│                                                                                                  │
│ C:\Users\Ali\anaconda3\envs\tts-fast\lib\codecs.py:322 in decode                                 │
│                                                                                                  │
│    319 │   def decode(self, input, final=False):                                                 │
│    320 │   │   # decode input (taking the buffer into account)                                   │
│    321 │   │   data = self.buffer + input                                                        │
│ ❱  322 │   │   (result, consumed) = self._buffer_decode(data, self.errors, final)                │
│    323 │   │   # keep undecoded input until the next call                                        │
│    324 │   │   self.buffer = data[consumed:]                                                     │
│    325 │   │   return result                                                                     │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x89 in position 10: invalid start byte

The dataset is from a former TalkNet dataset which is setup as LJ speech and so I can't understand the issue here the text file is in UTF8 - Any ideas?

Ok thank that. Isnt the check point flag just point to a pth file? Could I not just rename trained model to autoregressive.pth?

          Ok thank that. Isnt the check point flag just point to a pth file? Could I not just rename trained model to autoregressive.pth?

Originally posted by @gmantwo in #47 (comment)