152334h / dl-art-school Goto Github PK
View Code? Open in Web Editor NEWThis project forked from neonbjb/dl-art-school
TorToiSe fine-tuning with DLAS
License: GNU Affero General Public License v3.0
This project forked from neonbjb/dl-art-school
TorToiSe fine-tuning with DLAS
License: GNU Affero General Public License v3.0
I would like to check this is the correct format for the path to wavs/transcripts :
datasets:
train:
name: harry
n_workers: 8 # idk what this does
batch_size: 32 # This leads to ~16GB of vram usage on my 3090.
mode: paired_voice_audio
path: H:/DL-Art-School/experiments/harry.zip
fetcher_mode: ['lj'] # CHANGEME if your dataset isn't in LJSpeech format
phase: train
max_wav_length: 255995
max_text_length: 200
sample_rate: 22050
load_conditioning: True
num_conditioning_candidates: 2
conditioning_length: 44000
use_bpe_tokenizer: True
load_aligned_codes: False
val:
name: harry
n_workers: 1
batch_size: 32 # this could be higher probably
mode: paired_voice_audio
path: H:/DL-Art-School/experiments/harry.txt
Should it point towards the zip file and text file directly like above?
I check getting this error after running python3 train.py -opt ../experiments/EXAMPLE_gpt.yml
:
╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮
│ H:\DL-Art-School\codes\train.py:369 in <module> │
│ │
│ 366 │ │ trainer.rank = torch.distributed.get_rank() │
│ 367 │ │ torch.cuda.set_device(torch.distributed.get_rank()) │
│ 368 │ │
│ ❱ 369 │ trainer.init(args.opt, opt, args.launcher) │
│ 370 │ trainer.do_training() │
│ 371 │
│ │
│ H:\DL-Art-School\codes\train.py:130 in init │
│ │
│ 127 │ │ │ │ │ self.logger.info('Total epochs needed: {:d} for iters {:,d}'.format( │
│ 128 │ │ │ │ │ │ self.total_epochs, total_iters)) │
│ 129 │ │ │ elif phase == 'val': │
│ ❱ 130 │ │ │ │ self.val_set, collate_fn = create_dataset(dataset_opt, return_collate=Tr │
│ 131 │ │ │ │ self.val_loader = create_dataloader(self.val_set, dataset_opt, opt, None │
│ 132 │ │ │ │ if self.rank <= 0: │
│ 133 │ │ │ │ │ self.logger.info('Number of val images in [{:s}]: {:d}'.format( │
│ │
│ H:\DL-Art-School\codes\data\__init__.py:107 in create_dataset │
│ │
│ 104 │ │ │ collate = C() │
│ 105 │ else: │
│ 106 │ │ raise NotImplementedError('Dataset [{:s}] is not recognized.'.format(mode)) │
│ ❱ 107 │ dataset = D(dataset_opt) │
│ 108 │ │
│ 109 │ if return_collate: │
│ 110 │ │ return dataset, collate │
│ │
│ H:\DL-Art-School\codes\data\audio\paired_voice_audio_dataset.py:156 in __init__ │
│ │
│ 153 │ │ │ │ fetcher_fn = load_voxpopuli │
│ 154 │ │ │ else: │
│ 155 │ │ │ │ raise NotImplementedError() │
│ ❱ 156 │ │ │ self.audiopaths_and_text.extend(fetcher_fn(p, type)) │
│ 157 │ │ self.text_cleaners = hparams.text_cleaners │
│ 158 │ │ self.sample_rate = hparams.sample_rate │
│ 159 │ │ random.seed(hparams.seed) │
│ │
│ H:\DL-Art-School\codes\models\audio\tts\tacotron2\taco_utils.py:31 in │
│ load_filepaths_and_text_type │
│ │
│ 28 │
│ 29 def load_filepaths_and_text_type(filename, type, split="|"): │
│ 30 │ with open(filename, encoding='utf-8') as f: │
│ ❱ 31 │ │ filepaths_and_text = [list(line.strip().split(split)) + [type] for line in f] │
│ 32 │ │ base = os.path.dirname(filename) │
│ 33 │ │ for j in range(len(filepaths_and_text)): │
│ 34 │ │ │ filepaths_and_text[j][0] = os.path.join(base, filepaths_and_text[j][0]) │
│ │
│ H:\DL-Art-School\codes\models\audio\tts\tacotron2\taco_utils.py:31 in <listcomp> │
│ │
│ 28 │
│ 29 def load_filepaths_and_text_type(filename, type, split="|"): │
│ 30 │ with open(filename, encoding='utf-8') as f: │
│ ❱ 31 │ │ filepaths_and_text = [list(line.strip().split(split)) + [type] for line in f] │
│ 32 │ │ base = os.path.dirname(filename) │
│ 33 │ │ for j in range(len(filepaths_and_text)): │
│ 34 │ │ │ filepaths_and_text[j][0] = os.path.join(base, filepaths_and_text[j][0]) │
│ │
│ C:\Users\Ali\anaconda3\envs\tts-fast\lib\codecs.py:322 in decode │
│ │
│ 319 │ def decode(self, input, final=False): │
│ 320 │ │ # decode input (taking the buffer into account) │
│ 321 │ │ data = self.buffer + input │
│ ❱ 322 │ │ (result, consumed) = self._buffer_decode(data, self.errors, final) │
│ 323 │ │ # keep undecoded input until the next call │
│ 324 │ │ self.buffer = data[consumed:] │
│ 325 │ │ return result │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x89 in position 10: invalid start byte
The dataset is from a former TalkNet dataset which is setup as LJ speech and so I can't understand the issue here the text file is in UTF8 - Any ideas?
Title
Two things I have discovered so far:
wav_lengths
are supposed to be multiplied by self.mel_length_compression
return_latent
are supposed to be subscripted with -2
, not -1
I might just grab the definition from tortoise-tts instead.
3-02-21 09:05:37.187 - INFO: [epoch: 0, iter: 0, lr:(1.000e-05,1.000e-05,)] step: 0.0000e+00 samples: 8.0000e+00 megasamples: 8.0000e-06 iteration_rate: 3.6583e-01 loss_text_ce: 4.2470e+00 loss_mel_ce: 2.9319e+00 loss_gpt_total: 2.9744e+00 grad_scaler_scale: 1.0000e+00 learning_rate_gpt_0: 1.0000e-05 learning_rate_gpt_1: 1.0000e-05 total_samples_loaded: 8.0000e+00 percent_skipped_samples: 1.1111e-01 percent_conditioning_is_self: 8.8889e-01 gpt_conditioning_encoder: 4.6525e+00 gpt_gpt: 4.9838e+00 gpt_heads: 5.3694e+00 23-02-21 09:05:37.187 - INFO: Saving models and training states. 0%| | 0/1 [00:08<?, ?it/s] ╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮ │ H:\DL-Art-School\codes\train.py:381 in <module> │ │ │ │ 378 │ │ torch.cuda.set_device(torch.distributed.get_rank()) │ │ 379 │ │ │ 380 │ trainer.init(args.opt, opt, args.launcher) │ │ ❱ 381 │ trainer.do_training() │ │ 382 │ │ │ │ H:\DL-Art-School\codes\train.py:336 in do_training │ │ │ │ 333 │ │ │ │ │ 334 │ │ │ _t = time() │ │ 335 │ │ │ for train_data in tq_ldr: │ │ ❱ 336 │ │ │ │ self.do_step(train_data) │ │ 337 │ │ │ 338 │ def create_training_generator(self, index): │ │ 339 │ │ self.logger.info('Start training from epoch: {:d}, iter: {:d}'.format(self.start │ │ │ │ H:\DL-Art-School\codes\train.py:263 in do_step │ │ │ │ 260 │ │ │ │ │ self.logger.info('Saving models and training states.') │ │ 261 │ │ │ │ else: │ │ 262 │ │ │ │ │ self.logger.info('Saving model.') │ │ ❱ 263 │ │ │ │ if opt['upgrades']['number_of_checkpoints_to_save'] > 0: │ │ 264 │ │ │ │ │ self.logger.info( │ │ 265 │ │ │ │ │ │ f"Leaving only {opt['upgrades']['number_of_checkpoints_to_save'] │ │ 266 │ │ │ │ │ ) │ ╰──────────────────────────────────────────────────────────────────────────────────────────────────╯ TypeError: 'NoneType' object is not subscriptable
Using the latest version of DLAS but seem to get this error on a small dataset. I used the previous version with little issues. Not sure what this means?
This great project. I wonder can I use with normal tortoise or is this require tortoise fast?
Apart from the GPT model (which has been implemented), there are 4 other models in TorToiSe that could be fine-tuned:
IMO, the diffusion model + vocoder are obvious targets. Vocoders are often fine-tuned in other tts pipelines, and the diffusion model serves roughly the same purpose...
...but, the diffusion model is the only other model that takes the conditioning latents into account. I suspect that fine-tuning both the autoregressive & diffuser models on a single speaker would lead to a kind of 'mode collapse' (bear with this inaccurate phrasing), where the conditioning latents fail to affect the output speech substantially. Ideally, some form of mixed speaker training would account for this, but I'm not sure how to accomplish that yet.
Training the VQVAE could be good for datasets that are emotional, and substantially different from the normal LJSpeech+libretts+commonvoice+voxpopuli+... pile of monotonic speech. But I think it would necessitate a parallel training of the GPT model + the CLVP model as well, to account for the change in tokens outputted.
I also think that keeping the CLVP model untrained could be a good idea to retain the power of conditioning latents. Fine-tuning it on a single voice would adjust it to see that specific speaker as more likely than other speakers.
What batch size do you use on 24 GB of vram?
Hi,
Thank you for your work.
I was able to follow the steps given in Readme.md file and start training the autoregressive.pth. But, my training crashed mid-way due to lack of server space. Therefore, I would like to resume training from the latest epoch.
My question is : - Should I just set the 'resume_state' value to point to the latest training state file
OR
Should I also change the 'pretrain_model_gpt' to point to the latest epoch PTH file rather than the downloaded autoregressive.pth
Any idea why this is happening?
===============================================
CUDA SETUP: Loading binary C:\Users\oldgu\miniconda3\envs\DLAS\lib\site-packages\bitsandbytes\libbitsandbytes_cuda116.dll...
23-03-22 20:25:03.393 - INFO: name: Test
model: extensibletrainer
scale: 1
gpu_ids: [0]
start_step: -1
checkpointing_enabled: True
fp16: False
use_8bit: True
wandb: False
use_tb_logger: True
datasets:[
train:[
name: Test
n_workers: 8
batch_size: 1
mode: paired_voice_audio
path: C:/Users/oldgu/ozen-toolkit/output/Vincent_AGraveyardofGhostTales.wav_2023_03_22-20_05\train.txt
fetcher_mode: ['lj']
phase: train
max_wav_length: 255995
max_text_length: 200
sample_rate: 22050
load_conditioning: True
num_conditioning_candidates: 2
conditioning_length: 44000
use_bpe_tokenizer: True
load_aligned_codes: False
data_type: img
]
val:[
name: Test
n_workers: 1
batch_size: 1
mode: paired_voice_audio
path: C:/Users/oldgu/ozen-toolkit/output/Vincent_AGraveyardofGhostTales.wav_2023_03_22-20_05\valid.txt
fetcher_mode: ['lj']
phase: val
max_wav_length: 255995
max_text_length: 200
sample_rate: 22050
load_conditioning: True
num_conditioning_candidates: 2
conditioning_length: 44000
use_bpe_tokenizer: True
load_aligned_codes: False
data_type: img
]
]
steps:[
gpt_train:[
training: gpt
loss_log_buffer: 500
optimizer: adamw
optimizer_params:[
lr: 1e-05
triton: False
weight_decay: 0.01
beta1: 0.9
beta2: 0.96
]
clip_grad_eps: 4
injectors:[
paired_to_mel:[
type: torch_mel_spectrogram
mel_norm_file: ../experiments/clips_mel_norms.pth
in: wav
out: paired_mel
]
paired_cond_to_mel:[
type: for_each
subtype: torch_mel_spectrogram
mel_norm_file: ../experiments/clips_mel_norms.pth
in: conditioning
out: paired_conditioning_mel
]
to_codes:[
type: discrete_token
in: paired_mel
out: paired_mel_codes
dvae_config: ../experiments/train_diffusion_vocoder_22k_level.yml
]
paired_fwd_text:[
type: generator
generator: gpt
in: ['paired_conditioning_mel', 'padded_text', 'text_lengths', 'paired_mel_codes', 'wav_lengths']
out: ['loss_text_ce', 'loss_mel_ce', 'logits']
]
]
losses:[
text_ce:[
type: direct
weight: 0.01
key: loss_text_ce
]
mel_ce:[
type: direct
weight: 1
key: loss_mel_ce
]
]
]
]
networks:[
gpt:[
type: generator
which_model_G: unified_voice2
kwargs:[
layers: 30
model_dim: 1024
heads: 16
max_text_tokens: 402
max_mel_tokens: 604
max_conditioning_inputs: 2
mel_length_compression: 1024
number_text_tokens: 256
number_mel_codes: 8194
start_mel_token: 8192
stop_mel_token: 8193
start_text_token: 255
train_solo_embeddings: False
use_mel_codes_as_input: True
checkpointing: True
]
]
]
path:[
pretrain_model_gpt: ../experiments/autoregressive.pth
strict_load: True
root: C:\Users\oldgu\DL-Art-School
experiments_root: C:\Users\oldgu\DL-Art-School\experiments\Test
models: C:\Users\oldgu\DL-Art-School\experiments\Test\models
training_state: C:\Users\oldgu\DL-Art-School\experiments\Test\training_state
log: C:\Users\oldgu\DL-Art-School\experiments\Test
val_images: C:\Users\oldgu\DL-Art-School\experiments\Test\val_images
]
train:[
niter: 200
warmup_iter: -1
mega_batch_factor: 1
val_freq: 500
default_lr_scheme: MultiStepLR
gen_lr_steps: [100, 200, 280, 360]
lr_gamma: 0.5
ema_enabled: False
manual_seed: 1337
]
eval:[
output_state: gen
injectors:[
gen_inj_eval:[
type: generator
generator: generator
in: hq
out: ['gen', 'codebook_commitment_loss']
]
]
]
logger:[
print_freq: 10
save_checkpoint_freq: 10
visuals: ['gen', 'mel']
visual_debug_rate: 500
is_mel_spectrogram: True
disable_state_saving: False
]
upgrades:[
number_of_checkpoints_to_save: 0
number_of_states_to_save: 0
]
is_train: True
dist: False
23-03-22 20:25:03.538 - INFO: Random seed: 1337
Traceback (most recent call last):
File "C:\Users\oldgu\DL-Art-School\codes\train.py", line 398, in
trainer.init(args.opt, opt, args.launcher)
File "C:\Users\oldgu\DL-Art-School\codes\train.py", line 121, in init
self.total_epochs = int(math.ceil(total_iters / train_size))
ZeroDivisionError: division by zero
Press any key to continue . . .
I'm getting following error when starting the training
OutOfMemoryError: CUDA out of memory. Tried to allocate 1.55 GiB (GPU 0; 23.99 GiB total capacity; 20.44 GiB already allocated; 0 bytes free; 22.59 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF Drücken Sie eine beliebige Taste . . .
Training batch size is 188
validation batch size is 48
Training settings, 500
nothing else changed.
I installed the UI using the .setup DLAS.bat script and things were going well...followed along with the video tutorial and created a dataset, but when I click Start Training, it gives me a traceback error:
Traceback (most recent call last):
File "C:\Users\oldgu\DL-Art-School\codes\train.py", line 12, in
from data.data_sampler import DistIterSampler
File "C:\Users\oldgu\DL-Art-School\codes\data_init_.py", line 6, in
from utils.util import opt_get
File "C:\Users\oldgu\DL-Art-School\codes\utils\util.py", line 25, in
from torch._six import inf
ModuleNotFoundError: No module named 'torch._six'
A quick look around and it looks like that module is deprecated? Is this a pytorch 2.0 issue or a me issue? Do I need to downgrade pytorch? I'm afraid it would break the whole thing.
I'm running Windows 11 with Miniconda 3
Since training generates huge files, and you can only link one (I'm guessing the last one created) can the rest be deleted to save space? I'm new to this, but I'm assuming each subsequent .pth file is "better" than the one before it?
Hi, when I try to train, it throws an error:
C:\Users\PC\anaconda3\envs\DLAS\lib\site-packages\torch\optim\lr_scheduler.py:138:
UserWarning: Detected call of `lr_scheduler.step()` before `optimizer.step()`.
In PyTorch 1.1.0 and later, you should call them in the opposite order:
`optimizer.step()` before `lr_scheduler.step()`.
Failure to do this will result in PyTorch skipping the first value of the learning rate schedule.
See more details at https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate
warnings.warn("Detected call of `lr_scheduler.step()` before `optimizer.step()`. "
Error no kernel image is available for execution on the device at line 167 in file
D:\ai\tool\bitsandbytes\csrc\ops.cu
Press any key to continue
I thought this may be wrong CUDA version installed but I checked and my 1070 seem to be compatible with with CUDA 8+.
Anyone encountered this error and found a way to fix it?
Thanks.
A couple of question about seeds and candidates.
From my understanding, Seeds are random, and candidates are ordered from best to worst.
Is there a range of seed numbers? Or can it just be any number?
Is there a way to, say... have the playground spit out a number of candidates where each is a different seed? That would be helpful
when trying to find the right pronunciation or inflection.
Unless there is some other way to request emphasis (like html tags or italics or something.
Or does the Voice Directory do all the work on these things?
Hi,
it's nice that you are using Pycharm (I'm using it too) but your setting are in conflict with mine and every time I want to create some change (like now for yesterday merged changes related with disabling states generation and my changes related with saving only X last checkpoints and x last states) I have to remove your settings, add my own and reload Pycharm virtual environment settings, or I need to fix conflicts. It's very annoying.
It would be nice to remove .idea/
folders from repository (there are two - in main folder and in codes subfolder) and add it to file .gitignore
so every one of us could use their own settings (and this folder shouldn't be in repository at all because it's not related with project, so you know... :) ).
I can create pull request for if it it helps.
Thank you for your help :)
As of this morning after the update - at the start of the training, it exists with the message - ModuleNotFoundError: No module named 'lion_pytorch'
Double checked, lion_pytorch exists under Python.
As a sister project of tortoise-tts-fast, it would be great if the performance of training code could be improved as well.
Things to investigate:
Could you replace the autoregressive.pth file with a custom file (still named autoregressive.pth) that was trained only on speakers with a specific accent (Southern, or German for example)... and then train an accented speaker against that? Or would it break.
The problem is the original custom file would have been trained against the original autoregressive file.
It seems then the processing doesn't "understand" the accent, it just makes it sound British. At best, accents are smoothed out, like an average between American English and whatever accent is applied...trying to get a thick accent on a few voices.
what ever I do I always get this error
Is there anyone out there to give a hand
RecursionError: maximum recursion depth exceeded while calling a Python object
Press any key to continue . . .
Removing Triton from the end of the 'requirements.laxed_edited.txt' file
and manually installing triton using
pip install -U --pre triton
allows the program to function
Ok thank that. Isnt the check point flag just point to a pth file? Could I not just rename trained model to autoregressive.pth?
Originally posted by @gmantwo in #47 (comment)
Where can I change the path to save the model? I'm training on Colab and I want it to save on Google Drive because when the environment disconnects, I lose all the checkpoints
this will definitely be taken down at some point; open an issue with a backup link and i'll replace it
https://huggingface.co/Gatozu35/tortoise-tts/resolve/main/dvae.pth
Getting this error when I start training:
Disabled distributed training.
Path already exists. Rename it to [C:\Users\Yuri\DL-Art-School\experiments_archived_230428-154734]
Traceback (most recent call last):
File "C:\Users\Yuri\DL-Art-School\codes\train.py", line 380, in
trainer.init(args.opt, opt, args.launcher)
File "C:\Users\Yuri\DL-Art-School\codes\train.py", line 51, in init
util.mkdir_and_rename(
File "C:\Users\Yuri\DL-Art-School\codes\utils\util.py", line 112, in mkdir_and_rename
os.rename(path, new_name)
PermissionError: [WinError 32] The process cannot access the file because it is being used by another process: 'C:\Users\Yuri\DL-Art-School\experiments\' -> 'C:\Users\Yuri\DL-Art-School\experiments\_archived_230428-154734'
Press any key to continue . . .
Tried rolling back out of BitsandBytes and a handful of other things, but nothing working.
Yesterday and this morning I was troubleshooting a training issue, but nonetheless DLAS was working. This afternoon I saw an update in the Windows GUI, so I applied it. Ever since then, I've been running into this issue:
Environment name is set as "DLAS" as per environment.yaml
anaconda3/miniconda3 detected in C:\ProgramData\miniconda3
Starting conda environment "DLAS" from C:\ProgramData\miniconda3
Latest git hash: 43f445d
Disabled distributed training.
===================================BUG REPORT===================================
Welcome to bitsandbytes. For bug reports, please submit your error trace to: https://github.com/TimDettmers/bitsandbytes/issues
For effortless bug reporting copy-paste your error into this form: https://docs.google.com/forms/d/e/1FAIpQLScPB8emS3Thkp66nvqwmjTEgxp8Y9ufuWTzFyr9kJ5AoI47dQ/viewform?usp=sf_link
================================================================================
C:\Users\james\.conda\envs\DLAS\lib\site-packages\bitsandbytes\cuda_setup\paths.py:27: UserWarning: WARNING: The following directories listed in your path were found to be non-existent: {WindowsPath('C')}
warn(
C:\Users\james\.conda\envs\DLAS\lib\site-packages\bitsandbytes\cuda_setup\paths.py:93: UserWarning: C:\Users\james\.conda\envs\DLAS did not contain libcudart.so as expected! Searching further paths...
warn(
CUDA_SETUP: WARNING! libcudart.so not found in any environmental path. Searching /usr/local/cuda/lib64...
C:\Users\james\.conda\envs\DLAS\lib\site-packages\bitsandbytes\cuda_setup\paths.py:27: UserWarning: WARNING: The following directories listed in your path were found to be non-existent: {WindowsPath('/usr/local/cuda/lib64')}
warn(
WARNING: No libcudart.so found! Install CUDA or the cudatoolkit package (anaconda)!
CUDA SETUP: Loading binary C:\Users\james\.conda\envs\DLAS\lib\site-packages\bitsandbytes\libbitsandbytes_cpu.so...
Traceback (most recent call last):
File "C:\Users\james\Desktop\DL-Art-School\codes\train.py", line 386, in <module>
trainer.init(args.opt, opt, args.launcher)
File "C:\Users\james\Desktop\DL-Art-School\codes\train.py", line 38, in init
maybe_bnb.populate()
File "C:\Users\james\Desktop\DL-Art-School\codes\maybe_bnb.py", line 15, in populate
import bitsandbytes as bnb
File "C:\Users\james\.conda\envs\DLAS\lib\site-packages\bitsandbytes\__init__.py", line 6, in <module>
from .autograd._functions import (
File "C:\Users\james\.conda\envs\DLAS\lib\site-packages\bitsandbytes\autograd\_functions.py", line 5, in <module>
import bitsandbytes.functional as F
File "C:\Users\james\.conda\envs\DLAS\lib\site-packages\bitsandbytes\functional.py", line 13, in <module>
from .cextension import COMPILED_WITH_CUDA, lib
File "C:\Users\james\.conda\envs\DLAS\lib\site-packages\bitsandbytes\cextension.py", line 41, in <module>
lib = CUDALibrary_Singleton.get_instance().lib
File "C:\Users\james\.conda\envs\DLAS\lib\site-packages\bitsandbytes\cextension.py", line 37, in get_instance
cls._instance.initialize()
File "C:\Users\james\.conda\envs\DLAS\lib\site-packages\bitsandbytes\cextension.py", line 31, in initialize
self.lib = ct.cdll.LoadLibrary(binary_path)
File "C:\Users\james\.conda\envs\DLAS\lib\ctypes\__init__.py", line 452, in LoadLibrary
return self._dlltype(name)
File "C:\Users\james\.conda\envs\DLAS\lib\ctypes\__init__.py", line 364, in __init__
if '/' in name or '\\' in name:
TypeError: argument of type 'WindowsPath' is not iterable
Press any key to continue . . .
Initially I thought maybe it was CUDA, Miniconda, or Python since I had so many different versions installed and probably broken libraries/packages. I uninstalled everything, started with a clean slate, and I still get this error. The longer I look into it, the more it seems to be related to 'bitsandbytes' given the stack trace and the commit history in the last push showing that it was recently added
Reverting back to a previous commit works:
git checkout 83b901c656447126d5a0877639d394335204e1ac
This is Windows 10, Python 10, CUDA 11.7.
set triton: true
and optimizer: lion
in conf to repro
What should a dataset for Multispeaker look like?
Should each speaker have an identifier at the end, for example:
wavs/1.wav|transcription.
or
wavs/1.wav|transcription.|1
or
wavs/1.wav|transcription.|speaker_name
Hello,
I want to train another language, after a few hours of training it is average, so I would need to change "xxx_cleaners". I understand that by default it uses "english_cleaners". It is used in several files, where can I change it to make the change effective? Can I add an option to change the cleaners in the "EXAMPLE_gpt.yml" file? Another problem arises because there is no "xxx_cleaners" in the synthesis anywhere, which limits the synthesis due to the lack of symbols from another language.
I'm having awesome results with fine-tuning datasets, but I am running into a couple issues:
Hi, thanks very much for providing this amazing repo!
I have tried a little bit of fine-tuning the model, and ran the tortoise_tts.py script using the command below
./scripts/tortoise_tts.py --preset fast --ar_checkpoint $model_path -o test.wav --text "Hello, how are you?"
I got the error below:
╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮
│ /home/ps/Documents/voice-test/voice/./tortoise-tts-fast/scripts/tortoise_tts.py:240 in <module> │
│ │
│ 237 │ # app = import_module("app") │
│ 238 │ # sys.exit(app.main()) │
│ 239 │ │
│ ❱ 240 │ from tortoise.inference import ( │
│ 241 │ │ check_pydub, │
│ 242 │ │ get_all_voices, │
│ 243 │ │ get_seed, │
│ │
│ /home/ps/Documents/voice-test/voice/tortoise-tts-fast/tortoise/inference.py:167 in <module> │
│ │
│ 164 │
│ 165 from voicefixer import VoiceFixer │
│ 166 │
│ ❱ 167 vfixer = VoiceFixer() │
│ 168 │
│ 169 │
│ 170 def save_gen_with_voicefix(g, fpath, squeeze=True, voicefixer=True): │
│ │
│ /home/ps/Documents/voice-test/envs/voice/lib/python3.10/site-packages/voicefixer/base.py:13 in │
│ __init__ │
│ │
│ 10 class VoiceFixer(nn.Module): │
│ 11 │ def __init__(self): │
│ 12 │ │ super(VoiceFixer, self).__init__() │
│ ❱ 13 │ │ self._model = voicefixer_fe(channels=2, sample_rate=44100) │
│ 14 │ │ # print(os.path.join(os.path.expanduser('~'), ".cache/voicefixer/analysis_module │
│ 15 │ │ self.analysis_module_ckpt = os.path.join( │
│ 16 │ │ │ │ │ os.path.expanduser("~"), │
│ │
│ /home/ps/Documents/voice-test/envs/voice/lib/python3.10/site-packages/voicefixer/restorer/model.py:1 │
│ 80 in __init__ │
│ │
│ 177 │ │ # self.am = AudioMetrics() │
│ 178 │ │ # self.im = ImgMetrics() │
│ 179 │ │ │
│ ❱ 180 │ │ self.vocoder = Vocoder(sample_rate=44100) │
│ 181 │ │ │
│ 182 │ │ self.valid = None │
│ 183 │ │ self.fake = None │
│ │
│ /home/ps/Documents/voice-test/envs/voice/lib/python3.10/site-packages/voicefixer/vocoder/base.py:19 │
│ in __init__ │
│ │
│ 16 │ │ │ raise RuntimeError("Error 1: The checkpoint for synthesis module / vocoder ( │
│ 17 │ │ │ │ │ │ │ │ By default the checkpoint should be download automatical │
│ 18 │ │ │ │ │ │ │ │ But don't worry! Alternatively you can download it direc │
│ ❱ 19 │ │ self._load_pretrain(Config.ckpt) │
│ 20 │ │ self.weight_torch = Config.get_mel_weight_torch(percent=1.0)[ │
│ 21 │ │ │ None, None, None, ... │
│ 22 │ │ ] │
│ │
│ /home/ps/Documents/voice-test/envs/voice/lib/python3.10/site-packages/voicefixer/vocoder/base.py:26 │
│ in _load_pretrain │
│ │
│ 23 │ │
│ 24 │ def _load_pretrain(self, pth): │
│ 25 │ │ self.model = Generator(Config.cin_channels) │
│ ❱ 26 │ │ checkpoint = load_checkpoint(pth, torch.device("cpu")) │
│ 27 │ │ load_try(checkpoint["generator"], self.model) │
│ 28 │ │ self.model.eval() │
│ 29 │ │ self.model.remove_weight_norm() │
│ │
│ /home/ps/Documents/voice-test/envs/voice/lib/python3.10/site-packages/voicefixer/vocoder/model/util. │
│ py:111 in load_checkpoint │
│ │
│ 108 │
│ 109 │
│ 110 def load_checkpoint(checkpoint_path, device): │
│ ❱ 111 │ checkpoint = torch.load(checkpoint_path, map_location=device) │
│ 112 │ return checkpoint │
│ 113 │
│ 114 │
│ │
│ /home/ps/Documents/voice-test/envs/voice/lib/python3.10/site-packages/torch/serialization.py:777 in │
│ load │
│ │
│ 774 │ │ │ # If we want to actually tail call to torch.jit.load, we need to │
│ 775 │ │ │ # reset back to the original position. │
│ 776 │ │ │ orig_position = opened_file.tell() │
│ ❱ 777 │ │ │ with _open_zipfile_reader(opened_file) as opened_zipfile: │
│ 778 │ │ │ │ if _is_torchscript_zip(opened_zipfile): │
│ 779 │ │ │ │ │ warnings.warn("'torch.load' received a zip file that looks like a To │
│ 780 │ │ │ │ │ │ │ │ " dispatching to 'torch.jit.load' (call 'torch.jit.loa │
│ │
│ /home/ps/Documents/voice-test/envs/voice/lib/python3.10/site-packages/torch/serialization.py:282 in │
│ __init__ │
│ │
│ 279 │
│ 280 class _open_zipfile_reader(_opener): │
│ 281 │ def __init__(self, name_or_buffer) -> None: │
│ ❱ 282 │ │ super(_open_zipfile_reader, self).__init__(torch._C.PyTorchFileReader(name_or_bu │
│ 283 │
│ 284 │
│ 285 class _open_zipfile_writer_file(_opener): │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
RuntimeError: PytorchStreamReader failed reading zip archive: failed finding central directory
Just wondering what was the issue? When I tried the first time I saw the script tried to download some models but I was running tortoise for some time already and the .cache/tortoise/models folder had the models, so I cancelled the running and when I execute the command again, it didn't run on the downloading bit, so maybe because the file was already created but not complete? If that's the case, just wondering where did the script download the additional models? Thanks very much for looking into my issue.
Notable problem of the base tortoise model: it does not handle *emphasis*
or special symbols like +
or =
or &
elegantly. Lines like:
8 + 5 = 13
him & his wife
I *really* do not like this
all perform oddly poorly.
I am not sure if fine-tuning could fix this. Personally, I think you would need to at least fine-tune both the CLVP && GPT model to fix this.
Hi @152334H is there a way to finetune on single speaker but keep the model's zero shot capacity on other speakers as well?
Right now, the process of setting up config ymls and datasets and so on is fairly complicated.
It would be a lot easier for the general public to use if,
I am not familiar with colab notebooks (hard nvim user), so I cannot help much with this.
We should add colab notebook to repository to have history of changes in notebook and so we could add additional fields to colab (we don't have access to Colab notebook which is shared)
I can create PR with improved Colab notebook.
I get that error when trying to create voice/text in the playground, when I click generate nothing happens and that error is in the cli. I've already deleted the tortoise-tts-fast folder so it could be installed again, didn't fix it.
EDIT:
could be related, this is during the install of tortoise
× Running setup.py install for pesq did not run successfully.
│ exit code: 1
╰─> [25 lines of output]
C:\Users\Dominik\anaconda3\envs\DLAS\lib\site-packages\setuptools\__init__.py:85: _DeprecatedInstaller: setuptools.installer and fetch_build_eggs are deprecated. Requirements should be satisfied by a PEP 517 installer. If you are using pip, you can try `pip install --use-pep517`.
dist.fetch_build_eggs(dist.setup_requires)
running install
C:\Users\Dominik\anaconda3\envs\DLAS\lib\site-packages\setuptools\command\install.py:34: SetuptoolsDeprecationWarning: setup.py install is deprecated. Use build and pip and other standards-based tools.
warnings.warn(
running build
running build_py
creating build
creating build\lib.win-amd64-cpython-310
creating build\lib.win-amd64-cpython-310\pesq
copying pesq\_pesq.py -> build\lib.win-amd64-cpython-310\pesq
copying pesq\__init__.py -> build\lib.win-amd64-cpython-310\pesq
copying pesq\cypesq.pyx -> build\lib.win-amd64-cpython-310\pesq
copying pesq\dsp.h -> build\lib.win-amd64-cpython-310\pesq
copying pesq\pesq.h -> build\lib.win-amd64-cpython-310\pesq
copying pesq\pesqio.h -> build\lib.win-amd64-cpython-310\pesq
copying pesq\pesqmain.h -> build\lib.win-amd64-cpython-310\pesq
copying pesq\pesqpar.h -> build\lib.win-amd64-cpython-310\pesq
copying pesq\dsp.c -> build\lib.win-amd64-cpython-310\pesq
copying pesq\pesqdsp.c -> build\lib.win-amd64-cpython-310\pesq
copying pesq\pesqmod.c -> build\lib.win-amd64-cpython-310\pesq
running build_ext
skipping 'pesq\cypesq.c' Cython extension (up-to-date)
building 'cypesq' extension
error: Microsoft Visual C++ 14.0 or greater is required. Get it with "Microsoft C++ Build Tools": https://visualstudio.microsoft.com/visual-cpp-build-tools/
[end of output]
note: This error originates from a subprocess, and is likely not a problem with pip.
error: legacy-install-failure
× Encountered error while trying to install package.
╰─> pesq
note: This is an issue with the package mentioned above, not pip.
hint: See above for output from the failure.
Once a model is trained, can you add to it, or do you just need to rebuild the dataset with new entries and then retrain the whole thing?
it seems when i train the model, the training loss_text_ce and loss_mel_ce_loss as well as validation loss val_loss_text_ce is trending down as expected but the other validation loss val_loss_mel_ce is trending up.
The training set and validation are quite similar. The data set is small though about 100 wavs each with length at most 8 seconds.
Any idea what might cause this ? Below is my experiment with val_text_ce loss and val_mel_ce loss.
Thanks,
Getting an error on Windows when running Startup DLAS. The installer proceeds after and says completed but then I am missing many dependencies when I try to run "Start DLAS". I can get past them by manually installing each, getting into the UI and starting training but I eventually get an error where it cannot import FixedPositionalEmbedding from x_transformers.
Error is:
Collecting bitsandbytes
Using cached bitsandbytes-0.37.0-py3-none-any.whl (76.3 MB)
Collecting lion-pytorch==0.0.7
Using cached lion_pytorch-0.0.7-py3-none-any.whl (4.3 kB)
Pip subprocess error:
ERROR: Ignored the following versions that require a different python version: 1.6.2 Requires-Python >=3.7,<3.10; 1.6.3 Requires-Python >=3.7,<3.10; 1.7.0 Requires-Python >=3.7,<3.10; 1.7.0rc1 Requires-Python >=3.7,<3.10; 1.7.0rc2 Requires-Python >=3.7,<3.10; 1.7.1 Requires-Python >=3.7,<3.10
ERROR: Could not find a version that satisfies the requirement triton==2.0.0a2 (from versions: none)
ERROR: No matching distribution found for triton==2.0.0a2
failed
CondaEnvException: Pip failed
Fixing Tensorboard...
This is really a great effort and help.
Let's start from the EXAMPLE_gpt.yml
file.
There's a parameter named num_conditioning_candidates
under datasets.train
&& datasets.val
. It (ostensibly) determines the number of conditioning wav files that are piped to the gpt model during training.
How does it work? In paired_voice_audio_dataset.py
, there's a line in __getitem__
that grabs num_conditioning_candidates
similar clips from the current dataset:
cond, cond_is_self = load_similar_clips(self.audiopaths_and_text[index][0], self.conditioning_length, self.sample_rate,
n=self.conditioning_candidates) if self.load_conditioning else (None, False)
Or at least, that's what it's supposed to in theory. In practice, it ignores the value of n
passed, because of the lack of a file named similarities.pth
:
def load_similar_clips(path, sample_length, sample_rate, n=3, fallback_to_self=True):
sim_path = os.path.join(os.path.dirname(path), 'similarities.pth')
candidates = []
if os.path.exists(sim_path): # obviously ignored
if len(candidates) == 0: # true
if fallback_to_self: # true
candidates = [path] # ONLY 1 CANDIDATE USED
#...
(i also printed out the values for cond_is_self
and verified that it's true right now)
The similarities.pth
file(s?) are supposed to be generated by a script titled phase_3_generate_similarities.py
. I saw the preparation scripts earlier, but to be real I have no idea how to use them yet.
Plausibly could lead to the cheater latents problem on larger datasets. Also probably reduces the zero shot vc capabilities.
Is there somewhere a board discussion where people talk about the "best" training settings, or at least what worked for them.
How many steps are optimal which batch sizes. For example. I've around 40 minutes of audio as sample which I will beforehand run through ozone to get the dataset, but then it's the questing what setting to use for a dataset like that to get a perfect/good outcome.
Hello!
Thanks for great work! I was trying to finetune a model using non English datasets (Russian, etc.). The resulting voice is really good, but I keep getting the result with super strong English accent even after long training. Are there any possible ways to reduce the accent (or ideally get rid of it)?
I guess that the problem is because of the fine-tuning process using English model..
The numbers written in ./experiments/EXAMPLE_gpt.yml
were picked completely at random! It is very likely the numbers can be better, so long as people are willing to test and see what works.
Please post results here if you change any of the parameters, even if it completely fails!
Sorry to be here again.
I have a 3070 8GB
Now my dataset is fine. I keep getting cuda errros. I've identified 3 places in the yml I can edit to reduce batch sizes but even putting it to 1 gets me an error.
I've also tried changing mega_batch_factor:
as your notes.
I tried a much smaller dataset of 600 wav files.
I get this :
╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮ │ H:\DL-Art-School\codes\train.py:370 in <module> │ │ │ │ 367 │ │ torch.cuda.set_device(torch.distributed.get_rank()) │ │ 368 │ │ │ 369 │ trainer.init(args.opt, opt, args.launcher) │ │ ❱ 370 │ trainer.do_training() │ │ 371 │ │ │ │ H:\DL-Art-School\codes\train.py:325 in do_training │ │ │ │ 322 │ │ │ │ │ 323 │ │ │ _t = time() │ │ 324 │ │ │ for train_data in tq_ldr: │ │ ❱ 325 │ │ │ │ self.do_step(train_data) │ │ 326 │ │ │ 327 │ def create_training_generator(self, index): │ │ 328 │ │ self.logger.info('Start training from epoch: {:d}, iter: {:d}'.format(self.start │ │ │ │ H:\DL-Art-School\codes\train.py:206 in do_step │ │ │ │ 203 │ │ │ print("Update LR: %f" % (time() - _t)) │ │ 204 │ │ _t = time() │ │ 205 │ │ self.model.feed_data(train_data, self.current_step) │ │ ❱ 206 │ │ gradient_norms_dict = self.model.optimize_parameters(self.current_step, return_g │ │ 207 │ │ iteration_rate = (time() - _t) / batch_size │ │ 208 │ │ if self._profile: │ │ 209 │ │ │ print("Model feed + step: %f" % (time() - _t)) │ │ │ │ H:\DL-Art-School\codes\trainer\ExtensibleTrainer.py:302 in optimize_parameters │ │ │ │ 299 │ │ │ new_states = {} │ │ 300 │ │ │ self.batch_size_optimizer.focus(net) │ │ 301 │ │ │ for m in range(self.batch_factor): │ │ ❱ 302 │ │ │ │ ns = step.do_forward_backward(state, m, step_num, train=train_step, no_d │ │ 303 │ │ │ │ # Call into post-backward hooks. │ │ 304 │ │ │ │ for name, net in self.networks.items(): │ │ 305 │ │ │ │ │ if hasattr(net.module, "after_backward"): │ │ │ │ H:\DL-Art-School\codes\trainer\steps.py:214 in do_forward_backward │ │ │ │ 211 │ │ local_state = {} # <-- Will store the entire local state to be passed to inject │ │ 212 │ │ new_state = {} # <-- Will store state values created by this step for returning │ │ 213 │ │ for k, v in state.items(): │ │ ❱ 214 │ │ │ local_state[k] = v[grad_accum_step] │ │ 215 │ │ local_state['train_nets'] = str(self.get_networks_trained()) │ │ 216 │ │ loss_accumulator = self.loss_accumulator if loss_accumulator is None else loss_a │ │ 217 │ ╰──────────────────────────────────────────────────────────────────────────────────────────────────╯ IndexError: list index out of range
I made a pull request too please accept if possible : #71
Master Deep Voice Cloning in Minutes: Unleash Your Vocal Superpowers! Free and Locally on Your PC
This tutorial is based on
Ozen Toolkit for data preprocessing
DLAS for Training
Tortoise TTS Fast for speech synthesis
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.