exponentialml / text-to-video-finetuning Goto Github PK

View Code? Open in Web Editor NEW

619.0 18.0 103.0 1.86 MB

Finetune ModelScope's Text To Video model using Diffusers 🧨

License: MIT License

Python 100.00%

diffusers stable-diffusion text-to-video modelscope deep-learning diffusion-models pytorch text2video

text-to-video-finetuning's People

Contributors

Stargazers

Watchers

Forkers

treksis techthiyanes wpu93 sergiobr stjordanis spookybando hithereai gdtiti jcbrouwer dguo98 kabachuha brstar96 zero506 sigil-wen tthking nopeanuts jags111 kustomzone florinnichifiriuc akashc1 syntheticthinkers cluna80 genecyber dvschultz one-shot-finish cephdon yolandaw2021 ddaying 0x-maker zhouliang-yu camenduru commerceless ren-creater abdellahgoplatform xiaoya-li bruefire m5l14i11 delta-qin gaowudao womboai dreamingtulpa maximepeabody nofeetbird0321 bfasenfest cdalinghaus justinwking nahidalam myprivateclonelibrary shameforest sentient-22 791428954 pengge revanthraja yrdpplgngr xuzhouwang nbardy sohaib0399 philipmeier pixeli99 jaberwiki jacobyuan7 ssws3 basicsix zideliu shenhongdeng 5l1v3r1 liangzuan1983 samran-elahi yangbinb neko-149 dailingx xuweiyichen xunguo76 xfanac cuizhuyefei usghdfic julkaztwittera tyrannicawe wipwai steveefemsc qiudi0127 jinwook-shim kai0226 thucth-qt cjt222 luthandomaqondo 4yu5h-crtl joel-osebe leojc wenhui-ml arkboy1224 hyeonho99 cryptoholder-la zhuxiongwei24 qingshui szad670401 darkcloudn appimatesa black141312 thanhnm-cs

text-to-video-finetuning's Issues

About VideoLDM

Do you have any knowledge of VideoLDM, and is it possible to integrate its algorithms to further enhance the capabilities of current models, such as generating longer videos?

Question about finetuning multiple videos, each one with a different prompt, using the video_folder.yaml

So, i put all the .mp4 videos in a folder, and each video needs to be paired with a .txt file named like the video it was paired with, and that contains the prompt, in the same folder. Is this right?

Transformer2D initializing

More of a question really, but do you know why the num_attention_heads and attention_head_dim are opposite when initialising Transformer2D blocks?

https://github.com/ExponentialML/Text-To-Video-Finetuning/blob/79e13d17167f66f424a8acad88e83fc76d6d210d/models/unet_3d_blocks.py#L286C17-L286C35

It is opposite in unit_2d_blocks.py
https://github.com/huggingface/diffusers/blob/5439e917cacc885c0ac39dda1b8af12258e6e16d/src/diffusers/models/unet_2d_blocks.py#L872

How to test my model

Hello sir after training the model then how to test my model giving text as input please help me in this issue

Refer to the official release of Diffusers?

It seems like the models finetuned for Diffusers are referring to the latest beta version and not the latest official release of Diffusers, making the models error out when trying to load them with the official version of Diffusers. Could it be change to official releases referred to instead?

Generates wrong model_index.json

Generated on finetuned. Unet is null

{
  "_class_name": "TextToVideoSDPipeline",
  "_diffusers_version": "0.15.0.dev0",
  "scheduler": [
    "diffusers",
    "DDIMScheduler"
  ],
  "text_encoder": [
    "transformers",
    "CLIPTextModel"
  ],
  "tokenizer": [
    "transformers",
    "CLIPTokenizer"
  ],
  "unet": [
    null,
    "UNet3DConditionModel"
  ],
  "vae": [
    "diffusers",
    "AutoencoderKL"
  ]
}

text-to-video-ms-1.7b correct

{
  "_class_name": "TextToVideoSDPipeline",
  "_diffusers_version": "0.15.0.dev0",
  "scheduler": [
    "diffusers",
    "DDIMScheduler"
  ],
  "text_encoder": [
    "transformers",
    "CLIPTextModel"
  ],
  "tokenizer": [
    "transformers",
    "CLIPTokenizer"
  ],
  "unet": [
    "diffusers",
    "UNet3DConditionModel"
  ],
  "vae": [
    "diffusers",
    "AutoencoderKL"
  ]
}

Default model seems to output only noise or greenscreen

After several unsuccessful attempts at fine-tuning where the output was a still frame of noise or a green field, I followed instructions and skipped to the inference to test the base model. It reacted the same way.

Am I not pointing to the model directory correctly?

!cd /content/Text-To-Video-Finetuning && python inference.py --model /content/Text-To-Video-Finetuning/models/model_scope_diffusers --prompt "cat in a space suit"

Lora inference problem

When trying to run inference using --lora_path parameter, getting :

LoRA rank 64 is too large. setting to: 4
list index out of range
Couldn't inject LoRA's due to an error.

0%|          | 0/50 [00:00<?, ?it/s]
0%|          | 0/50 [00:00<?, ?it/s]
Traceback (most recent call last):
File "/content/drive/MyDrive/Text-To-Video-Finetuning/inference.py", line 194, in <module>
videos = inference(**args)
File "/usr/local/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/content/drive/MyDrive/Text-To-Video-Finetuning/inference.py", line 141, in inference
videos = pipeline(
File "/usr/local/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/usr/local/lib/python3.10/site-packages/diffusers/pipelines/text_to_video_synthesis/pipeline_text_to_video_synth.py", line 646, in __call__
noise_pred = self.unet(
File "/usr/local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/content/drive/MyDrive/Text-To-Video-Finetuning/models/unet_3d_condition.py", line 399, in forward
emb = self.time_embedding(t_emb, timestep_cond)
File "/usr/local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/usr/local/lib/python3.10/site-packages/diffusers/models/embeddings.py", line 192, in forward
sample = self.linear_1(sample)
File "/usr/local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/content/drive/MyDrive/Text-To-Video-Finetuning/utils/lora.py", line 60, in forward
+ self.dropout(self.lora_up(self.selector(self.lora_down(input))))
File "/usr/local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/usr/local/lib/python3.10/site-packages/torch/nn/modules/linear.py", line 114, in forward
return F.linear(input, self.weight, self.bias)
RuntimeError: mat1 and mat2 shapes cannot be multiplied (6x320 and 1280x16)

I'm running it on a Colab

How do you turn off validation?

Enabling Multi-GPU training

How to enable multi-GPU training? No matter how many GPUs I use, only one process starts.

model inference of version2

After the fine-tuning of version 2 is completed, how to perform model inference? version1 is as the following:

import torch
from diffusers import DiffusionPipeline, DPMSolverMultistepScheduler
from diffusers.utils import export_to_video

my_trained_model_path = "./trained_model_path/"
pipe = DiffusionPipeline.from_pretrained(my_trained_model_path, torch_dtype=torch.float16, variant="fp16")
pipe.scheduler = DPMSolverMultistepScheduler.from_config(pipe.scheduler.config)
pipe.enable_model_cpu_offload()

prompt = "Your prompt based on train data"
video_frames = pipe(prompt, num_inference_steps=25).frames

out_file = "./my_video.mp4"
video_path = export_to_video(video_frames, out_file)

Abount offset_noise

What is the effect of the offset noise in training?

Watermark removal

I fine-tuned the model via single video fine tuning, but the watermark is still there. Would like to know fine-tune detail that can remove the watermark

Many thanks

training video

I want to train my own video model, please give me some help

How long should I cut each video into? How many frames per video? How many videos are needed?
After I have finished training, how to call the model in the webui?

inference code generate only green noise

while the validation output during training seems to be good. Any bugs in the inference code ? Or it is due to different diffuser version?

Colab?

I am a lazy person.
Has anyone managed to run the finetune on Colab?

multi-gpu training

will it be difficult to modify the code to support multi-gpu training?

About step_loss of version2

During the training of version2, the step loss easily becomes NaN, even if the learning rate is lowered. Have you encountered this issue before?

VRAM optimization?

I want to train it with n_sample_frames: 100. With 100 videos

I'm using a 3090 Ti and the max n_sample_frames is 24

Last question, Can Text-To-Video-Finetuning recreate this video (same amount of frames and mainteining the cameras)?

video.mp4

How does n_sample_frames: work?

So for example, if i have set it to 4, it will finetune using only 4 frames of the video, no matter the length of the video?

about vid2vid in inference

How to use the vid2vid function? Do I only need to provide an initial video?

TypeError: UNet3DConditionModel._set_gradient_checkpointing() got multiple values for argument 'value'

cloned this model git clone https://huggingface.co/camenduru/potat1

the command

python inference.py -m "F:\potat text to video\potat1" -p "fast moving fancy sports car" -W 1024 -H 576 -o "F:\potat text to video" -d cuda -x -s 50 -g 23 -f 24 -T 48

(venv) F:\potat text to video>python check.py
CUDA is available on your system.
CUDA device count: 2
CUDA device name: NVIDIA GeForce RTX 3090 Ti


(venv) F:\potat text to video>cd Text-To-Video-Finetuning

(venv) F:\potat text to video\Text-To-Video-Finetuning>python inference.py -m "F:\potat text to video\potat1" -p "fast moving fancy sports car" -W 1024 -H 576 -o "F:\potat text to video" -d cuda -x -s 50 -g 23 -f 24 -T 48
Traceback (most recent call last):
  File "F:\potat text to video\Text-To-Video-Finetuning\inference.py", line 194, in <module>
    videos = inference(**args)
  File "F:\potat text to video\venv\lib\site-packages\torch\utils\_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "F:\potat text to video\Text-To-Video-Finetuning\inference.py", line 122, in inference
    pipeline = initialize_pipeline(model, device, xformers, sdp)
  File "F:\potat text to video\Text-To-Video-Finetuning\inference.py", line 24, in initialize_pipeline
    unet.disable_gradient_checkpointing()
  File "F:\potat text to video\venv\lib\site-packages\diffusers\models\modeling_utils.py", line 216, in disable_gradient_checkpointing
    self.apply(partial(self._set_gradient_checkpointing, value=False))
  File "F:\potat text to video\venv\lib\site-packages\torch\nn\modules\module.py", line 884, in apply
    module.apply(fn)
  File "F:\potat text to video\venv\lib\site-packages\torch\nn\modules\module.py", line 885, in apply
    fn(self)
TypeError: UNet3DConditionModel._set_gradient_checkpointing() got multiple values for argument 'value'

(venv) F:\potat text to video\Text-To-Video-Finetuning>

Folder for models?

In which folder to put git clone https://huggingface.co/damo-vilab/text-to-video-ms-1.7b models?

Whats the deal with train_text_encoder: ?

Should i train it or not? I tested it and i didnt see any difference, is it better to just keep it on all the time?

The video dosent move.

After finetuning, the output video dosent move, it just stays still. It looks good but there is no movement.

First GPU occupies more VRAM in distributed training

link，
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
cached_latent = torch.load(self.cached_data_list[index], map_location=device)
Otherwise, in multi-GPU distributed training, the first GPU may occupy excessive VRAM compared to the other GPUs.

Error with inference

Get this using both my finetuned model and the original 1.7b model

│ /content/drive/MyDrive/Text-To-Video-Finetuning/inference.py:192 in <module> │
│                                                                              │
│   189 │   │   init = interpolate(init, size=(args["num_frames"], args["heigh │
│   190 │   │   args["init_video"] = init                                      │
│   191 │                                                                      │
│ ❱ 192 │   videos = inference(**args)                                         │
│   193 │                                                                      │
│   194 │   os.makedirs(output_dir, exist_ok=True)                             │
│   195 │   out_stem = f"{output_dir}/"                                        │
│                                                                              │
│ /usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py:115 in    │
│ decorate_context                                                             │
│                                                                              │
│   112 │   @functools.wraps(func)                                             │
│   113 │   def decorate_context(*args, **kwargs):                             │
│   114 │   │   with ctx_factory():                                            │
│ ❱ 115 │   │   │   return func(*args, **kwargs)                               │
│   116 │                                                                      │
│   117 │   return decorate_context                                            │
│   118                                                                        │
│                                                                              │
│ /content/drive/MyDrive/Text-To-Video-Finetuning/inference.py:120 in          │
│ inference                                                                    │
│                                                                              │
│   117 │   lora_rank=64                                                       │
│   118 ):                                                                     │
│   119 │   with torch.autocast(device, dtype=torch.half):                     │
│ ❱ 120 │   │   pipeline = initialize_pipeline(model, device, xformers, sdp)   │
│   121 │   │   inject_inferable_lora(pipeline, lora_path, r=lora_rank)        │
│   122 │   │   prompt = [prompt] * batch_size                                 │
│   123 │   │   negative_prompt = ([negative_prompt] * batch_size) if negative │
│                                                                              │
│ /content/drive/MyDrive/Text-To-Video-Finetuning/inference.py:33 in           │
│ initialize_pipeline                                                          │
│                                                                              │
│    30 │   │   unet=unet.to(device=device, dtype=torch.half),                 │
│    31 │   )                                                                  │
│    32 │   pipeline.scheduler = DPMSolverMultistepScheduler.from_config(pipel │
│ ❱  33 │   unet._set_gradient_checkpointing(value=False)                      │
│    34 │   handle_memory_attention(xformers, sdp, unet)                       │
│    35 │   vae.enable_slicing()                                               │
│    36 │   return pipeline                                                    │
╰──────────────────────────────────────────────────────────────────────────────╯
TypeError: UNet3DConditionModel._set_gradient_checkpointing() missing 1 required
positional argument: 'module'```

Custom resolution?

Is there possibility of having custom resolution for training/inference ?

Does using "attentions" unfreeze all the layers?

Also, what layers should i unfreeze the get the best possible quality? Even if it consumes a ton of vram.

Links to weights?

How about sharing text2video/fine-tuned weights here?

The two working weights I have so far found are these two:
damo-vilab/text-to-video-ms-1.7b
strangeman3107/animov-0.1.1

Accelerator 'function' object has no attribute 'func'

is there a specific version of accelerate that will work? I recently had to reinstall my requirements, and what worked before, doesn't work anymore. I think accelerate changed something on their end that caused this error message. I am using a fresh install at the moment, and everything works up until saving the first checkpoint....

Configuration saved in ./outputs\train_2023-07-02T07-13-31\checkpoint-100\model_index.json
Traceback (most recent call last):
  File "F:\AI\Text-to-Video-Finetuning\train.py", line 994, in <module>
    main(**OmegaConf.load(args.config))
  File "F:\AI\Text-to-Video-Finetuning\train.py", line 899, in main
    save_pipe(
  File "F:\AI\Text-to-Video-Finetuning\train.py", line 506, in save_pipe
    unet, text_encoder = accelerator.prepare(unet, text_encoder)
  File "C:\Users\niver\anaconda3\envs\text2video-finetune\lib\site-packages\accelerate\accelerator.py", line 1182, in prepare
    result = tuple(
  File "C:\Users\niver\anaconda3\envs\text2video-finetune\lib\site-packages\accelerate\accelerator.py", line 1183, in <genexpr>
    self._prepare_one(obj, first_pass=True, device_placement=d) for obj, d in zip(args, device_placement)
  File "C:\Users\niver\anaconda3\envs\text2video-finetune\lib\site-packages\accelerate\accelerator.py", line 1022, in _prepare_one
    return self.prepare_model(obj, device_placement=device_placement)
  File "C:\Users\niver\anaconda3\envs\text2video-finetune\lib\site-packages\accelerate\accelerator.py", line 1308, in prepare_model
    model.forward = MethodType(torch.cuda.amp.autocast(dtype=torch.float16)(model.forward.__func__), model)
AttributeError: 'function' object has no attribute '__func__'

finetune train error of "UnboundLocalError: local variable 'use_offset_noise' referenced before assignment"

after I comment the code about get_logger, I can run as the following output, but meet the other error.
————————————————————————————————————————————
(aigc) cwh8szh@SZH-C-006RW:/mnt/workspace/cwh8szh/aigc/Text-To-Video-Finetuning$ CUDA_VISIBLE_DEVICES=1 python train.py --config ./configs/v2/train_config_caixukun.yaml
{'variance_type', 'timestep_spacing'} was not found in config. Values will be initialized to default values.
{'downsample_padding', 'mid_block_scale_factor'} was not found in config. Values will be initialized to default values.
33 Attention layers using Scaled Dot Product Attention.
LoRA rank 16 is too large. setting to: 4
LoRA rank 16 is too large. setting to: 4
Lora successfully injected into UNet3DConditionModel.
Lora successfully injected into CLIPTextModel.
Non-existant JSON path. Skipping.
Caching Latents.: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 38/38 [00:14<00:00, 2.68it/s]
Steps: 0%| | 0/10000 [00:00<?, ?it/s]2628 params have been unfrozen for training.
Traceback (most recent call last):
File "/mnt/workspace/cwh8szh/aigc/Text-To-Video-Finetuning/train.py", line 1035, in
main(**OmegaConf.load(args.config))
File "/mnt/workspace/cwh8szh/aigc/Text-To-Video-Finetuning/train.py", line 908, in main
loss, latents = finetune_unet(batch, train_encoder=train_text_encoder)
File "/mnt/workspace/cwh8szh/aigc/Text-To-Video-Finetuning/train.py", line 810, in finetune_unet
use_offset_noise = use_offset_noise and not rescale_schedule
UnboundLocalError: local variable 'use_offset_noise' referenced before assignment
Steps: 0%| | 0/10000 [00:55<?, ?it/s]
————————————————————————————————————————————
Is there more guideline of finetune training, much thanks.

Thanks，I will try it.> will it be difficult to modify the code to support multi-gpu training?

          > will it be difficult to modify the code to support multi-gpu training?

I've never tried multiple GPU training, but you may be able to do it naively with accelerate.

accelerate config

You should be prompted to configure your setup, including multiple GPU training.
Then it should be as simple as:

accelerate launch train.py --config ./configs/my_config_hq.yaml

Let us know how it goes if you decide to try! If it doesn't I could try to implement it, but I don't have multiple GPUs and would probably need to rent out a server to do so.

Originally posted by @ExponentialML in #14 (comment)

Main ways to avoid overfitting?

How much VRAM do I need for saving the weights?

I have 16GB available at first place. <12GB is used for training. Then I encounter OOM when saving the weights. This feels a bit insane...

RuntimeError: "slow_conv2d_cpu" not implemented for 'Half'

After running the train.py I get this RuntimeError:

C:\Users\User1\AppData\Local\Programs\Python\Python310\lib\site-packages\accelerate\accelerator.py:249: FutureWarning: logging_diris deprecated and will be removed in version 0.18.0 of 🤗 Accelerate. Useproject_dir` instead.
warnings.warn(
03/24/2023 08:52:48 - INFO - main - Distributed environment: NO
Num processes: 1
Process index: 0
Local process index: 0
Device: cpu

Mixed precision type: fp16

{'variance_type'} was not found in config. Values will be initialized to default values.
C:\Users\User1\AppData\Local\Programs\Python\Python310\lib\site-packages\transformers\modeling_utils.py:402: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
with safe_open(checkpoint_file, framework="pt") as f:
C:\Users\User1\AppData\Local\Programs\Python\Python310\lib\site-packages\torch_utils.py:776: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
return self.fget.get(instance, owner)()
C:\Users\User1\AppData\Local\Programs\Python\Python310\lib\site-packages\torch\storage.py:899: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
storage = cls(wrap_storage=untyped_storage)
C:\Users\User1\AppData\Local\Programs\Python\Python310\lib\site-packages\safetensors\torch.py:99: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
with safe_open(filename, framework="pt", device=device) as f:
{'downsample_padding', 'mid_block_scale_factor'} was not found in config. Values will be initialized to default values.
03/24/2023 08:52:50 - INFO - main - ***** Running training *****
03/24/2023 08:52:50 - INFO - main - Num examples = 1
03/24/2023 08:52:50 - INFO - main - Num Epochs = 1200
03/24/2023 08:52:50 - INFO - main - Instantaneous batch size per device = 1
03/24/2023 08:52:50 - INFO - main - Total train batch size (w. parallel, distributed & accumulation) = 1
03/24/2023 08:52:50 - INFO - main - Gradient Accumulation steps = 1
03/24/2023 08:52:50 - INFO - main - Total optimization steps = 1200
Steps: 0%| | 0/1200 [00:00<?, ?it/s]Traceback (most recent call last):
File "D:\Art\Text-To-Video-Finetuning\train.py", line 498, in
main(**OmegaConf.load(args.config))
File "D:\Art\Text-To-Video-Finetuning\train.py", line 394, in main
loss, latents = finetune_unet(batch, train_encoder=train_text_encoder)
File "D:\Art\Text-To-Video-Finetuning\train.py", line 339, in finetune_unet
latents = tensor_to_vae_latent(pixel_values, vae)
File "D:\Art\Text-To-Video-Finetuning\train.py", line 157, in tensor_to_vae_latent
latents = vae.encode(t).latent_dist.sample()
File "C:\Users\User1\AppData\Local\Programs\Python\Python310\lib\site-packages\diffusers\utils\accelerate_utils.py", line 46, in wrapper
return method(self, *args, **kwargs)
File "C:\Users\User1\AppData\Local\Programs\Python\Python310\lib\site-packages\diffusers\models\autoencoder_kl.py", line 164, in encode
h = self.encoder(x)
File "C:\Users\User1\AppData\Local\Programs\Python\Python310\lib\site-packages\torch\nn\modules\module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "C:\Users\User1\AppData\Local\Programs\Python\Python310\lib\site-packages\diffusers\models\vae.py", line 109, in forward
sample = self.conv_in(sample)
File "C:\Users\User1\AppData\Local\Programs\Python\Python310\lib\site-packages\torch\nn\modules\module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "C:\Users\User1\AppData\Local\Programs\Python\Python310\lib\site-packages\torch\nn\modules\conv.py", line 463, in forward
return self._conv_forward(input, self.weight, self.bias)
File "C:\Users\User1\AppData\Local\Programs\Python\Python310\lib\site-packages\torch\nn\modules\conv.py", line 459, in _conv_forward
return F.conv2d(input, weight, bias, self.stride,
RuntimeError: "slow_conv2d_cpu" not implemented for 'Half'
Steps: 0%| | 0/1200 [00:00<?, ?it/s]`

error of "from accelerate.logging import get_logger"

Hi, I meet the following error when I run finetune training.
(aigc) cwh8szh@SZH-C-006RW:/mnt/workspace/cwh8szh/aigc/Text-To-Video-Finetuning$ CUDA_VISIBLE_DEVICES=1 python train.py --config ./configs/v2/train_config_caixukun.yaml
Traceback (most recent call last):
File "/mnt/workspace/cwh8szh/aigc/Text-To-Video-Finetuning/train.py", line 1035, in
main(**OmegaConf.load(args.config))
File "/mnt/workspace/cwh8szh/aigc/Text-To-Video-Finetuning/train.py", line 617, in main
create_logging(logging, logger, accelerator)
File "/mnt/workspace/cwh8szh/aigc/Text-To-Video-Finetuning/train.py", line 64, in create_logging
logger.info(accelerator.state, main_process_only=False)
AttributeError: 'str' object has no attribute 'info'

I have tried different version of accelerate, but cannot slove this error.
the following of my main pip list:
Package Version

accelerate 0.20.3
tensorboard 2.10.0
tensorboard-data-server 0.6.1
tensorboard-plugin-wit 1.8.1
tokenizers 0.13.3
torch 2.0.1
torchaudio 2.0.2
torchvision 0.15.2
transformers 4.30.2
triton 2.0.0

TypeError: get_logger() got an unexpected keyword argument 'log_level'

Hi, I am trying to run your script but it always shows me this error.
Another thing is that it's not possible for me to install triton. It's like the repo doesn't exist anymore.

Error caught was: No module named 'triton'
╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮
│ C:\Users\Life\Text-To-Video-Finetuning\train.py:43 in <module>                                   │
│                                                                                                  │
│    40 # Will error if the minimal version of diffusers is not installed. Remove at your own ri   │
│    41 check_min_version("0.10.0.dev0")                                                           │
│    42                                                                                            │
│ ❱  43 logger = get_logger(__name__, log_level="INFO")                                            │
│    44                                                                                            │
│    45 def create_logging(logging, logger, accelerator):                                          │
│    46 │   logging.basicConfig(                                                                   │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
TypeError: get_logger() got an unexpected keyword argument 'log_level'

Question: regarding fine-tuning with images

Is there a rough count on how many images to train a concept if not using a video? I know for LORA it can be as few as 9-10 but for DreamBooth, usually 2-3x that amount.

Encountered 2 file(s) that may not have been copied correctly on Windows

I see the following after Requirements and Installation step

Encountered 2 file(s) that may not have been copied correctly on Windows:
	unet/diffusion_pytorch_model.safetensors
	unet/diffusion_pytorch_model.bin

Could this break something during finetuning ?

Is there a problem with attn3 here?

code，from the network of unet3d, attn3 will do nothing?

how to remove that watermark？

use a lot of data or with more layers unfrozen could make it ?

Is this the new version of the model?

There was a new version of modelscope released recently, it was trained for a month longer and it can generate better videos, is this repo using the new model or the old one?

Feature request

Thank you, for making this. It seems to work, and I have a model.

I wanted to ask if there is:

a link to a repository that we can use to generate videos with our new diffusion models, or a small example on how to do it with python or something like that.
a way to specify the frame rate of the sample videos. Everything seems to sample at 6-8 fps, so the default 24fps videos seem too fast to really see what the sample video looks like.
if we use a json file, do we also need to specify the video folder, or does the json's hyperlinks take care of that?

Thank you!

Using a trained model on webui

Is it possible?
Is there information that I should refer to anywhere?

webui Lora Might be causing errors in checkpoint models.

Some weights of the model checkpoint were not used when initializing UNet3DConditionModel:
This IS expected if you are initializing CLIPTextModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing CLIPTextModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).

Has anyone else had similar issues. I believe it has to do with the Lora Training because I only notice such behavior on models created while also training the new webui lora. The most recent model did not use the Loras, and had no such issues.

Please keep updating and improving this repo, it works shockingly well.

I dont think there is anything like this out there, this has a ton of potential.

NameError: name 'glob' is not defined

After i run the script train_config.yaml i get this error below:

2023-04-09 13:40:38.702636: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
/usr/local/lib/python3.9/dist-packages/accelerate/accelerator.py:249: FutureWarning: logging_dir is deprecated and will be removed in version 0.18.0 of 🤗 Accelerate. Use project_dir instead.
warnings.warn(
04/09/2023 13:40:40 - INFO - main - Distributed environment: NO
Num processes: 1
Process index: 0
Local process index: 0
Device: cuda

Mixed precision type: fp16

{'variance_type'} was not found in config. Values will be initialized to default values.
/usr/local/lib/python3.9/dist-packages/transformers/modeling_utils.py:402: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
with safe_open(checkpoint_file, framework="pt") as f:
/usr/local/lib/python3.9/dist-packages/torch/_utils.py:776: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
return self.fget.get(instance, owner)()
/usr/local/lib/python3.9/dist-packages/torch/storage.py:899: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
storage = cls(wrap_storage=untyped_storage)
/usr/local/lib/python3.9/dist-packages/safetensors/torch.py:99: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
with safe_open(filename, framework="pt", device=device) as f:
{'mid_block_scale_factor', 'downsample_padding'} was not found in config. Values will be initialized to default values.
33 Attention layers using Scaled Dot Product Attention.
Lora successfully injected into UNet3DConditionModel.
Lora successfully injected into CLIPTextModel.
Non-existant JSON path. Skipping.
╭───────────────────── Traceback (most recent call last) ──────────────────────╮
│ /content/Text-To-Video-Finetuning/train.py:915 in │
│ │
│ 912 │ parser.add_argument("--config", type=str, default="./configs/my_co │
│ 913 │ args = parser.parse_args() │
│ 914 │ │
│ ❱ 915 │ main(**OmegaConf.load(args.config)) │
│ 916 │
│ │
│ /content/Text-To-Video-Finetuning/train.py:582 in main │
│ │
│ 579 │ ) │
│ 580 │ │
│ 581 │ # Get the training dataset based on types (json, single_video, ima │
│ ❱ 582 │ train_datasets = get_train_dataset(dataset_types, train_data, toke │
│ 583 │ │
│ 584 │ # Extend datasets that are less than the greatest one. This allows │
│ 585 │ attrs = ['train_data', 'frames', 'image_dir', 'video_files'] │
│ │
│ /content/Text-To-Video-Finetuning/train.py:86 in get_train_dataset │
│ │
│ 83 │ for DataSet in [VideoJsonDataset, SingleVideoDataset, ImageDataset │
│ 84 │ │ for dataset in dataset_types: │
│ 85 │ │ │ if dataset == DataSet.getname(): │
│ ❱ 86 │ │ │ │ train_datasets.append(DataSet(**train_data, tokenizer= │
│ 87 │ │
│ 88 │ if len(train_datasets) > 0: │
│ 89 │ │ return train_datasets │
│ │
│ /content/Text-To-Video-Finetuning/utils/dataset.py:487 in init │
│ │
│ 484 │ │ │
│ 485 │ │ self.fallback_prompt = fallback_prompt │
│ 486 │ │ │
│ ❱ 487 │ │ self.video_files = glob(f"{path}/*.mp4") │
│ 488 │ │ │
│ 489 │ │ self.width = width │
│ 490 │ │ self.height = height │
╰──────────────────────────────────────────────────────────────────────────────╯
NameError: name 'glob' is not defined

What is the difference between using the video_folder.yaml and using the my_config.yaml?

I want to finetune the model using multiple videos, same prompt each video. Which .yaml file should i use?