victorchall / everydream-trainer Goto Github PK

View Code? Open in Web Editor NEW

This project forked from kanewallmann/dreambooth-stable-diffusion

506.0 506.0 41.0 50.9 MB

General fine tuning for Stable Diffusion

License: MIT License

Shell 0.06% Python 10.46% Jupyter Notebook 89.48%

everydream-trainer's People

Stargazers

Watchers

everydream-trainer's Issues

ckpt file not saving when training has finished.

Once the training has finished and it goes to save the ckpt file it will tend to use all system RAM and not save the file.

Windows 10 | 16 Core AMD | 32G RAM | 3090

`Training halted. Summoning checkpoint as last.ckpt
Training complete. max_steps or max_epochs reached, or we blew up.

Traceback (most recent call last):
File "C:\Users\User01\anaconda3\envs\everydream\lib\site-packages\torch\serialization.py", line 379, in save
_save(obj, opened_zipfile, pickle_module, pickle_protocol)
File "C:\Users\User01\anaconda3\envs\everydream\lib\site-packages\torch\serialization.py", line 604, in _save
zip_file.write_record(name, storage.data_ptr(), num_bytes)
MemoryError

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "C:\Users\User01\anaconda3\envs\everydream\lib\site-packages\torch\serialization.py", line 380, in save
return
File "C:\Users\User01\anaconda3\envs\everydream\lib\site-packages\torch\serialization.py", line 259, in exit
self.file_like.write_end_of_file()
RuntimeError: [enforce fail at C:\cb\pytorch_1000000000000\work\caffe2\serialize\inline_container.cc:319] . unexpected pos 9926808960 vs 9926808856

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "main.py", line 754, in
trainer.fit(model, data)
File "C:\Users\User01\anaconda3\envs\everydream\lib\site-packages\pytorch_lightning\trainer\trainer.py", line 770, in fit
self._call_and_handle_interrupt(
File "C:\Users\User01\anaconda3\envs\everydream\lib\site-packages\pytorch_lightning\trainer\trainer.py", line 723, in _call_and_handle_interrupt
return trainer_fn(*args, **kwargs)
File "C:\Users\User01\anaconda3\envs\everydream\lib\site-packages\pytorch_lightning\trainer\trainer.py", line 811, in _fit_impl
results = self._run(model, ckpt_path=self.ckpt_path)
File "C:\Users\User01\anaconda3\envs\everydream\lib\site-packages\pytorch_lightning\trainer\trainer.py", line 1236, in _run
results = self._run_stage()
File "C:\Users\User01\anaconda3\envs\everydream\lib\site-packages\pytorch_lightning\trainer\trainer.py", line 1323, in _run_stage
return self._run_train()
File "C:\Users\User01\anaconda3\envs\everydream\lib\site-packages\pytorch_lightning\trainer\trainer.py", line 1353, in _run_train
self.fit_loop.run()
File "C:\Users\User01\anaconda3\envs\everydream\lib\site-packages\pytorch_lightning\loops\base.py", line 205, in run
self.on_advance_end()
File "C:\Users\User01\anaconda3\envs\everydream\lib\site-packages\pytorch_lightning\loops\fit_loop.py", line 294, in on_advance_end
self.trainer._call_callback_hooks("on_train_epoch_end")
File "C:\Users\User01\anaconda3\envs\everydream\lib\site-packages\pytorch_lightning\trainer\trainer.py", line 1636, in _call_callback_hooks
fn(self, self.lightning_module, *args, **kwargs)
File "C:\Users\User01\anaconda3\envs\everydream\lib\site-packages\pytorch_lightning\callbacks\model_checkpoint.py", line 308, in on_train_epoch_end
self._save_topk_checkpoint(trainer, monitor_candidates)
File "C:\Users\User01\anaconda3\envs\everydream\lib\site-packages\pytorch_lightning\callbacks\model_checkpoint.py", line 379, in _save_topk_checkpoint
self._save_monitor_checkpoint(trainer, monitor_candidates)
File "C:\Users\User01\anaconda3\envs\everydream\lib\site-packages\pytorch_lightning\callbacks\model_checkpoint.py", line 651, in _save_monitor_checkpoint
self._update_best_and_save(current, trainer, monitor_candidates)
File "C:\Users\User01\anaconda3\envs\everydream\lib\site-packages\pytorch_lightning\callbacks\model_checkpoint.py", line 702, in _update_best_and_save
self._save_checkpoint(trainer, filepath)
File "C:\Users\User01\anaconda3\envs\everydream\lib\site-packages\pytorch_lightning\callbacks\model_checkpoint.py", line 384, in _save_checkpoint
trainer.save_checkpoint(filepath, self.save_weights_only)
File "C:\Users\User01\anaconda3\envs\everydream\lib\site-packages\pytorch_lightning\trainer\trainer.py", line 2467, in save_checkpoint
self._checkpoint_connector.save_checkpoint(filepath, weights_only=weights_only, storage_options=storage_options)
File "C:\Users\User01\anaconda3\envs\everydream\lib\site-packages\pytorch_lightning\trainer\connectors\checkpoint_connector.py", line 445, in save_checkpoint
self.trainer.strategy.save_checkpoint(_checkpoint, filepath, storage_options=storage_options)
File "C:\Users\User01\anaconda3\envs\everydream\lib\site-packages\pytorch_lightning\strategies\strategy.py", line 418, in save_checkpoint
self.checkpoint_io.save_checkpoint(checkpoint, filepath, storage_options=storage_options)
File "C:\Users\User01\anaconda3\envs\everydream\lib\site-packages\pytorch_lightning\plugins\io\torch_plugin.py", line 54, in save_checkpoint
atomic_save(checkpoint, path)
File "C:\Users\User01\anaconda3\envs\everydream\lib\site-packages\pytorch_lightning\utilities\cloud_io.py", line 67, in atomic_save
torch.save(checkpoint, bytesbuffer)
File "C:\Users\User01\anaconda3\envs\everydream\lib\site-packages\torch\serialization.py", line 381, in save
_legacy_save(obj, opened_file, pickle_module, pickle_protocol)
File "C:\Users\User01\anaconda3\envs\everydream\lib\site-packages\torch\serialization.py", line 225, in exit
self.file_like.flush()
ValueError: I/O operation on closed file.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "main.py", line 756, in
melk()
File "main.py", line 733, in melk
trainer.save_checkpoint(ckpt_path)
File "C:\Users\User01\anaconda3\envs\everydream\lib\site-packages\pytorch_lightning\trainer\trainer.py", line 2467, in save_checkpoint
self._checkpoint_connector.save_checkpoint(filepath, weights_only=weights_only, storage_options=storage_options)
File "C:\Users\User01\anaconda3\envs\everydream\lib\site-packages\pytorch_lightning\trainer\connectors\checkpoint_connector.py", line 445, in save_checkpoint
self.trainer.strategy.save_checkpoint(_checkpoint, filepath, storage_options=storage_options)
File "C:\Users\User01\anaconda3\envs\everydream\lib\site-packages\pytorch_lightning\strategies\strategy.py", line 418, in save_checkpoint
self.checkpoint_io.save_checkpoint(checkpoint, filepath, storage_options=storage_options)
File "C:\Users\User01\anaconda3\envs\everydream\lib\site-packages\pytorch_lightning\plugins\io\torch_plugin.py", line 54, in save_checkpoint
atomic_save(checkpoint, path)
File "C:\Users\User01\anaconda3\envs\everydream\lib\site-packages\pytorch_lightning\utilities\cloud_io.py", line 67, in atomic_save
torch.save(checkpoint, bytesbuffer)
File "C:\Users\User01\anaconda3\envs\everydream\lib\site-packages\torch\serialization.py", line 381, in save
_legacy_save(obj, opened_file, pickle_module, pickle_protocol)
File "C:\Users\User01\anaconda3\envs\everydream\lib\site-packages\torch\serialization.py", line 225, in exit
self.file_like.flush()
ValueError: I/O operation on closed file.
`

'Trainer' object has no attribute 'strategy'

Error Occur.....

Training: 0it [00:00, ?it/s]Training halted. Summoning checkpoint as last.ckpt
Training complete. max_steps or max_epochs reached, or we blew up.

Traceback (most recent call last):
File "main.py", line 754, in
trainer.fit(model, data)
File "C:\Users\tkgix.conda\envs\everydream\lib\site-packages\pytorch_lightning\trainer\trainer.py", line 740, in fit
self._call_and_handle_interrupt(
File "C:\Users\tkgix.conda\envs\everydream\lib\site-packages\pytorch_lightning\trainer\trainer.py", line 685, in _call_and_handle_interrupt
return trainer_fn(*args, **kwargs)
File "C:\Users\tkgix.conda\envs\everydream\lib\site-packages\pytorch_lightning\trainer\trainer.py", line 777, in _fit_impl
self._run(model, ckpt_path=ckpt_path)
File "C:\Users\tkgix.conda\envs\everydream\lib\site-packages\pytorch_lightning\trainer\trainer.py", line 1199, in _run
self._dispatch()
File "C:\Users\tkgix.conda\envs\everydream\lib\site-packages\pytorch_lightning\trainer\trainer.py", line 1279, in _dispatch
self.training_type_plugin.start_training(self)
File "C:\Users\tkgix.conda\envs\everydream\lib\site-packages\pytorch_lightning\plugins\training_type\training_type_plugin.py", line 202, in start_training
self._results = trainer.run_stage()
File "C:\Users\tkgix.conda\envs\everydream\lib\site-packages\pytorch_lightning\trainer\trainer.py", line 1289, in run_stage
return self._run_train()
File "C:\Users\tkgix.conda\envs\everydream\lib\site-packages\pytorch_lightning\trainer\trainer.py", line 1319, in _run_train
self.fit_loop.run()
File "C:\Users\tkgix.conda\envs\everydream\lib\site-packages\pytorch_lightning\loops\base.py", line 145, in run
self.advance(*args, **kwargs)
File "C:\Users\tkgix.conda\envs\everydream\lib\site-packages\pytorch_lightning\loops\fit_loop.py", line 234, in advance
self.epoch_loop.run(data_fetcher)
File "C:\Users\tkgix.conda\envs\everydream\lib\site-packages\pytorch_lightning\loops\base.py", line 140, in run
self.on_run_start(*args, **kwargs)
File "C:\Users\tkgix.conda\envs\everydream\lib\site-packages\pytorch_lightning\loops\epoch\training_epoch_loop.py", line 137, in on_run_start
self.trainer.call_hook("on_train_epoch_start")
File "C:\Users\tkgix.conda\envs\everydream\lib\site-packages\pytorch_lightning\trainer\trainer.py", line 1495, in call_hook
callback_fx(*args, **kwargs)
File "C:\Users\tkgix.conda\envs\everydream\lib\site-packages\pytorch_lightning\trainer\callback_hook.py", line 88, in on_train_epoch_start
callback.on_train_epoch_start(self, self.lightning_module)
File "E:\AI_Tools_EveryDream-trainer\EveryDream-trainer\main.py", line 461, in on_train_epoch_start
torch.cuda.reset_peak_memory_stats(trainer.strategy.root_device.index)
AttributeError: 'Trainer' object has no attribute 'strategy'

AttributeError: image

I put all images and associated txt file (records the prompt string) into the "input" folder and during the training I have the following error. Not sure why it failed to delete some images.

Epoch 0:   2%| | 33/1456 [01:09<50:09,  2.11s/it, loss=0.0916, v_num=0, train/loTraining halted. Summoning checkpoint as last.ckpt
Training complete. max_steps or max_epochs reached, or we blew up.

Traceback (most recent call last):
  File "/workspace/everydream-trainer/main.py", line 754, in <module>
    trainer.fit(model, data)
  File "/opt/conda/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 770, in fit
    self._call_and_handle_interrupt(
  File "/opt/conda/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 723, in _call_and_handle_interrupt
    return trainer_fn(*args, **kwargs)
  File "/opt/conda/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 811, in _fit_impl
    results = self._run(model, ckpt_path=self.ckpt_path)
  File "/opt/conda/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1236, in _run
    results = self._run_stage()
  File "/opt/conda/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1323, in _run_stage
    return self._run_train()
  File "/opt/conda/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1353, in _run_train
    self.fit_loop.run()
  File "/opt/conda/lib/python3.9/site-packages/pytorch_lightning/loops/base.py", line 204, in run
    self.advance(*args, **kwargs)
  File "/opt/conda/lib/python3.9/site-packages/pytorch_lightning/loops/fit_loop.py", line 266, in advance
    self._outputs = self.epoch_loop.run(self._data_fetcher)
  File "/opt/conda/lib/python3.9/site-packages/pytorch_lightning/loops/base.py", line 204, in run
    self.advance(*args, **kwargs)
  File "/opt/conda/lib/python3.9/site-packages/pytorch_lightning/loops/epoch/training_epoch_loop.py", line 171, in advance
    batch = next(data_fetcher)
  File "/opt/conda/lib/python3.9/site-packages/pytorch_lightning/utilities/fetching.py", line 184, in __next__
    return self.fetching_function()
  File "/opt/conda/lib/python3.9/site-packages/pytorch_lightning/utilities/fetching.py", line 259, in fetching_function
    self._fetch_next_batch(self.dataloader_iter)
  File "/opt/conda/lib/python3.9/site-packages/pytorch_lightning/utilities/fetching.py", line 273, in _fetch_next_batch
    batch = next(iterator)
  File "/opt/conda/lib/python3.9/site-packages/pytorch_lightning/trainer/supporters.py", line 558, in __next__
    return self.request_next_batch(self.loader_iters)
  File "/opt/conda/lib/python3.9/site-packages/pytorch_lightning/trainer/supporters.py", line 570, in request_next_batch
    return apply_to_collection(loader_iters, Iterator, next)
  File "/opt/conda/lib/python3.9/site-packages/pytorch_lightning/utilities/apply_func.py", line 99, in apply_to_collection
    return function(data, *args, **kwargs)
  File "/opt/conda/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 628, in __next__
    data = self._next_data()
  File "/opt/conda/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 1333, in _next_data
    return self._process_data(data)
  File "/opt/conda/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 1359, in _process_data
    data.reraise()
  File "/opt/conda/lib/python3.9/site-packages/torch/_utils.py", line 543, in reraise
    raise exception
AttributeError: Caught AttributeError in DataLoader worker process 1.
Original Traceback (most recent call last):
  File "/opt/conda/lib/python3.9/site-packages/torch/utils/data/_utils/worker.py", line 302, in _worker_loop
    data = fetcher.fetch(index)
  File "/opt/conda/lib/python3.9/site-packages/torch/utils/data/_utils/fetch.py", line 58, in fetch
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/opt/conda/lib/python3.9/site-packages/torch/utils/data/_utils/fetch.py", line 58, in <listcomp>
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/workspace/everydream-trainer/main.py", line 193, in __getitem__
    return self.data[idx]
  File "/workspace/everydream-trainer/ldm/data/every_dream.py", line 70, in __getitem__
    del self.image_train_items[j].image
AttributeError: image

Please add option 'Do not resize'

Thank you for your hard work.

Please allow people who don't need AutoScaling to turn it off.

RuntimeError on 3090Ti

Sanity Checking DataLoader 0: 0%| | 0/2 [00:00<?, ?it/s]Training halted. Summoning checkpoint as last.ckpt
Training complete. max_steps or max_epochs reached, or we blew up.

Traceback (most recent call last):
File "main.py", line 754, in
trainer.fit(model, data)
File "/home/jumble/anaconda3/envs/everydream/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 770, in fit
self._call_and_handle_interrupt(
File "/home/jumble/anaconda3/envs/everydream/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 723, in _call_and_handle_interrupt
return trainer_fn(*args, **kwargs)
File "/home/jumble/anaconda3/envs/everydream/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 811, in _fit_impl
results = self._run(model, ckpt_path=self.ckpt_path)
File "/home/jumble/anaconda3/envs/everydream/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1236, in _run
results = self._run_stage()
File "/home/jumble/anaconda3/envs/everydream/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1323, in _run_stage
return self._run_train()
File "/home/jumble/anaconda3/envs/everydream/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1345, in _run_train
self._run_sanity_check()
File "/home/jumble/anaconda3/envs/everydream/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1413, in _run_sanity_check
val_loop.run()
File "/home/jumble/anaconda3/envs/everydream/lib/python3.8/site-packages/pytorch_lightning/loops/base.py", line 204, in run
self.advance(*args, **kwargs)
File "/home/jumble/anaconda3/envs/everydream/lib/python3.8/site-packages/pytorch_lightning/loops/dataloader/evaluation_loop.py", line 155, in advance
dl_outputs = self.epoch_loop.run(self._data_fetcher, dl_max_batches, kwargs)
File "/home/jumble/anaconda3/envs/everydream/lib/python3.8/site-packages/pytorch_lightning/loops/base.py", line 204, in run
self.advance(*args, **kwargs)
File "/home/jumble/anaconda3/envs/everydream/lib/python3.8/site-packages/pytorch_lightning/loops/epoch/evaluation_epoch_loop.py", line 128, in advance
output = self._evaluation_step(**kwargs)
File "/home/jumble/anaconda3/envs/everydream/lib/python3.8/site-packages/pytorch_lightning/loops/epoch/evaluation_epoch_loop.py", line 226, in _evaluation_step
output = self.trainer._call_strategy_hook("validation_step", *kwargs.values())
File "/home/jumble/anaconda3/envs/everydream/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1765, in _call_strategy_hook
output = fn(*args, **kwargs)
File "/home/jumble/anaconda3/envs/everydream/lib/python3.8/site-packages/pytorch_lightning/strategies/strategy.py", line 344, in validation_step
return self.model.validation_step(*args, **kwargs)
File "/home/jumble/anaconda3/envs/everydream/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
return func(*args, **kwargs)
File "/home/jumble/EveryDream-trainer/ldm/models/diffusion/ddpm.py", line 368, in validation_step
_, loss_dict_no_ema = self.shared_step(batch)
File "/home/jumble/EveryDream-trainer/ldm/models/diffusion/ddpm.py", line 905, in shared_step
loss = self(x, c)
File "/home/jumble/anaconda3/envs/everydream/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl
return forward_call(*input, **kwargs)
File "/home/jumble/EveryDream-trainer/ldm/models/diffusion/ddpm.py", line 935, in forward
return self.p_losses(x, c, t, *args, **kwargs)
File "/home/jumble/EveryDream-trainer/ldm/models/diffusion/ddpm.py", line 1086, in p_losses
logvar_t = self.logvar[t].to(self.device)
RuntimeError: indices should be either on cpu or on the same device as the indexed tensor (cpu)

I'm confused why there is a device issue here- torch.cuda.is_available yields true. Not sure exactly where my cpu is getting called here. Thanks for any help!

Running out of Memory on a 3090

I have a model and want to train an art style with 8000 pictures. Since 8k is a bit much, I decided to cut the training it into pieces of 2000. I don't care about overtraining too much as long as it recognizes the token or artstyle from the prompts after.

Any batchsize bigger than 1 gives me

RuntimeError: CUDA out of memory. Tried to allocate 9.99 GiB (GPU 0; 24.00 GiB total capacity; 5.07 GiB already allocated; 10.71 GiB free; 10.70 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

Why would I need to change the memory pages?

Running fine and training works but very slow saving of checkpoints

Hi, as the title states everything works as expected but I am running the micro.yaml with the provided test files of Ted bennet and as soon as one epoch finishes it tries to save. This process takes ages compared to the training...I only got to two epochs and it took me around two hours. The log directory is on the same SSD so I was wondering what the culprit could be...or is this expected behavior? Cheers

Sample generated images are always identical

While I'm training the model, the sample generated images are always identical (at a pixel level) at every step.
It seems that they are generated by using the same weights instead from the updated model.

Runpod notebook: 'str2optimizer8bit_blockwise' is not defined

Everything was working a few days ago. Today running the train cell of Train_JupyterLab.ipynb causes the error

NameError: name 'str2optimizer8bit_blockwise' is not defined

I'm guessing that Runpod changed something about their 2.1 container

Invalid load key

The log is attached.

(everydream) C:\Users\tomwe\ed>python main.py --base configs/stable-diffusion/v1-finetune_everydream.yaml -t --actual_resume C:\Users\tomwe\ed\models\v1-5-pruned.safetensors -n MyProjectName --data_root training_samples\MyProject
Global seed set to 23
Running on GPUs 0,
Loading model from C:\Users\tomwe\ed\models\v1-5-pruned.safetensors
Traceback (most recent call last):
File "main.py", line 585, in
model = load_model_from_config(config, opt.actual_resume)
File "main.py", line 29, in load_model_from_config
pl_sd = torch.load(ckpt, map_location="cpu")
File "C:\Users\tomwe\miniconda3\envs\everydream\lib\site-packages\torch\serialization.py", line 713, in load
return _legacy_load(opened_file, map_location, pickle_module, **pickle_load_args)
File "C:\Users\tomwe\miniconda3\envs\everydream\lib\site-packages\torch\serialization.py", line 920, in _legacy_load
magic_number = pickle_module.load(f, **pickle_load_args)
_pickle.UnpicklingError: invalid load key, '\xca'.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "main.py", line 783, in
if trainer.global_rank == 0:
NameError: name 'trainer' is not defined

(everydream) C:\Users\tomwe\ed>

Does the Micro mode support multiple aspect ratios?

The question is the same as title. Because I found that the tedd's dataset is all composed of square images.

error when trying to train a model

I used the micromodels guide and trained a model twice. Second attempt was very good.
I was trying now to run again the same command, to train it again with another 1.5 variation and I get those errors.
Even using the same command/base model I used before, gives the same error.
Any idea why?

What needs to be done to support 2.0

Hi,

Firstly, thank you very much for the useful repo. I was trying to fine-tune with stable diffusion 2.0 and got the following error:

RuntimeError: Error(s) in loading state_dict for LatentDiffusion:
        size mismatch for model.diffusion_model.input_blocks.1.1.proj_in.weight: copying a param with shape torch.Size([320, 320]) from checkpoint, the shape
in current model is torch.Size([320, 320, 1, 1]).
        size mismatch for model.diffusion_model.input_blocks.1.1.transformer_blocks.0.attn2.to_k.weight: copying a param with shape torch.Size([320, 1024]) fr
om checkpoint, the shape in current model is torch.Size([320, 768]).

Tried with 512-base-ema.ckpt

What needs to be done, so that the trainer can support v2? Some pointers would be awesome so I could create the pull request :)

Module Not Found Error

First time trying to use Everydream on my Colab Pro under High Ram T4 Computer.
I am unable to get pass the first step.

ModuleNotFoundError Traceback (most recent call last)
in <cell line: 98>()
96 Python_version = get_ipython().getoutput('python --version')
97 import torch
---> 98 import torchvision
99 import xformers
100

1 frames
/usr/local/lib/python3.10/dist-packages/torchvision/_meta_registrations.py in
2
3 import torch
----> 4 import torch._custom_ops
5 import torch.library
6

ModuleNotFoundError: No module named 'torch._custom_ops'

Running but no checkpoints saved

Sometimes this runs beautifully on my pc. Other times, it seems to run but doesn't save anything to the checkpoints directory.

How to train a model for 2 or more people?

The guide you have (MICROMODELS.MD) works fine when I try a base model ( SD1.5 for example ) and one person.

I use a script like this :

#!/bin/bash
#

BASE_MODEL=/.../SD1.5.ckpt
PROMPT=sks
PROMPT_TRAINING_DIR=/.../Pictures/sks/clean

python main.py \
    --base configs/stable-diffusion/v1-finetune_micro.yaml \
    -t \
    --actual_resume $BASE_MODEL \
    -n $PROMPT \
    --gpus 0, \
    --data_root $PROMPT_TRAINING_DIR

where sks is the person I want, and in the /sks/clean I have good face photos of that person in 512x512.

the problem is, if I take the output model, and put is as "BASE_MODEL" and try to train for another person, the results are weird.
It kind of knows the one person, but the second is a mix of the two!!

Also, another issue I saw is that if I take some existing trained models from others ( let's say this one https://huggingface.co/wavymulder/Analog-Diffusion ) the results are not good at all. I cannot get that sks person to appear anything close to what it is.
I'm wondering what the issue is.
When a model is re-trained with new photos, it changes so much that I cannot use it as a base for another one?
Or am I missing something?

Allow pruning script to prune with float32 instead if float16

Currently your pruning script reduce the model size to 2GB at the cost of converting it to half precision. You should add an option to keep it full precision.

victorchall / everydream-trainer Goto Github PK

everydream-trainer's People

Stargazers

Watchers

Forkers

everydream-trainer's Issues

Recommend Projects

Recommend Topics

Recommend Org