Giter VIP home page Giter VIP logo

text-to-video-finetuning's People

Contributors

bfasenfest avatar bruefire avatar camenduru avatar exponentialml avatar jacobyuan7 avatar jcbrouwer avatar kabachuha avatar maximepeabody avatar one-shot-finish avatar samran-elahi avatar sergiobr avatar zideliu avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

text-to-video-finetuning's Issues

About VideoLDM

Do you have any knowledge of VideoLDM, and is it possible to integrate its algorithms to further enhance the capabilities of current models, such as generating longer videos?

similar implementation to Nivida VideoLDM?

Any Possible way to have the Same Nvidia implementation of using a the SD models / dreambooth models as a base for Txt2vid model?
https://research.nvidia.com/labs/toronto-ai/VideoLDM/

i saw this unofficial implementation, but not sure where it goes?
https://github.com/srpkdyy/VideoLDM

is there no way to use the modelscope model or zeroscope model and idk merge em together something like that? or do some training or fine-tuning ontop of a dreambooth model?

How to test my model

Hello sir after training the model then how to test my model giving text as input please help me in this issue

Refer to the official release of Diffusers?

It seems like the models finetuned for Diffusers are referring to the latest beta version and not the latest official release of Diffusers, making the models error out when trying to load them with the official version of Diffusers. Could it be change to official releases referred to instead?

Generates wrong model_index.json

Generated on finetuned. Unet is null

{
  "_class_name": "TextToVideoSDPipeline",
  "_diffusers_version": "0.15.0.dev0",
  "scheduler": [
    "diffusers",
    "DDIMScheduler"
  ],
  "text_encoder": [
    "transformers",
    "CLIPTextModel"
  ],
  "tokenizer": [
    "transformers",
    "CLIPTokenizer"
  ],
  "unet": [
    null,
    "UNet3DConditionModel"
  ],
  "vae": [
    "diffusers",
    "AutoencoderKL"
  ]
}

text-to-video-ms-1.7b correct

{
  "_class_name": "TextToVideoSDPipeline",
  "_diffusers_version": "0.15.0.dev0",
  "scheduler": [
    "diffusers",
    "DDIMScheduler"
  ],
  "text_encoder": [
    "transformers",
    "CLIPTextModel"
  ],
  "tokenizer": [
    "transformers",
    "CLIPTokenizer"
  ],
  "unet": [
    "diffusers",
    "UNet3DConditionModel"
  ],
  "vae": [
    "diffusers",
    "AutoencoderKL"
  ]
}

Default model seems to output only noise or greenscreen

After several unsuccessful attempts at fine-tuning where the output was a still frame of noise or a green field, I followed instructions and skipped to the inference to test the base model. It reacted the same way.

Am I not pointing to the model directory correctly?

!cd /content/Text-To-Video-Finetuning && python inference.py --model /content/Text-To-Video-Finetuning/models/model_scope_diffusers --prompt "cat in a space suit"

Lora inference problem

When trying to run inference using --lora_path parameter, getting :

LoRA rank 64 is too large. setting to: 4
list index out of range
Couldn't inject LoRA's due to an error.

0%|          | 0/50 [00:00<?, ?it/s]
0%|          | 0/50 [00:00<?, ?it/s]
Traceback (most recent call last):
File "/content/drive/MyDrive/Text-To-Video-Finetuning/inference.py", line 194, in <module>
videos = inference(**args)
File "/usr/local/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/content/drive/MyDrive/Text-To-Video-Finetuning/inference.py", line 141, in inference
videos = pipeline(
File "/usr/local/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/usr/local/lib/python3.10/site-packages/diffusers/pipelines/text_to_video_synthesis/pipeline_text_to_video_synth.py", line 646, in __call__
noise_pred = self.unet(
File "/usr/local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/content/drive/MyDrive/Text-To-Video-Finetuning/models/unet_3d_condition.py", line 399, in forward
emb = self.time_embedding(t_emb, timestep_cond)
File "/usr/local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/usr/local/lib/python3.10/site-packages/diffusers/models/embeddings.py", line 192, in forward
sample = self.linear_1(sample)
File "/usr/local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/content/drive/MyDrive/Text-To-Video-Finetuning/utils/lora.py", line 60, in forward
+ self.dropout(self.lora_up(self.selector(self.lora_down(input))))
File "/usr/local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/usr/local/lib/python3.10/site-packages/torch/nn/modules/linear.py", line 114, in forward
return F.linear(input, self.weight, self.bias)
RuntimeError: mat1 and mat2 shapes cannot be multiplied (6x320 and 1280x16)

I'm running it on a Colab

model inference of version2

After the fine-tuning of version 2 is completed, how to perform model inference? version1 is as the following:

import torch
from diffusers import DiffusionPipeline, DPMSolverMultistepScheduler
from diffusers.utils import export_to_video

my_trained_model_path = "./trained_model_path/"
pipe = DiffusionPipeline.from_pretrained(my_trained_model_path, torch_dtype=torch.float16, variant="fp16")
pipe.scheduler = DPMSolverMultistepScheduler.from_config(pipe.scheduler.config)
pipe.enable_model_cpu_offload()

prompt = "Your prompt based on train data"
video_frames = pipe(prompt, num_inference_steps=25).frames

out_file = "./my_video.mp4"
video_path = export_to_video(video_frames, out_file)

Watermark removal

I fine-tuned the model via single video fine tuning, but the watermark is still there. Would like to know fine-tune detail that can remove the watermark

Many thanks

training video

I want to train my own video model, please give me some help

How long should I cut each video into? How many frames per video? How many videos are needed?
After I have finished training, how to call the model in the webui?

Colab?

I am a lazy person.
Has anyone managed to run the finetune on Colab?

multi-gpu training

will it be difficult to modify the code to support multi-gpu training?

About step_loss of version2

During the training of version2, the step loss easily becomes NaN, even if the learning rate is lowered. Have you encountered this issue before?

VRAM optimization?

I want to train it with n_sample_frames: 100. With 100 videos
image

I'm using a 3090 Ti and the max n_sample_frames is 24
image

Last question, Can Text-To-Video-Finetuning recreate this video (same amount of frames and mainteining the cameras)?

video.mp4

How does n_sample_frames: work?

So for example, if i have set it to 4, it will finetune using only 4 frames of the video, no matter the length of the video?

TypeError: UNet3DConditionModel._set_gradient_checkpointing() got multiple values for argument 'value'

cloned this model git clone https://huggingface.co/camenduru/potat1

the command

python inference.py -m "F:\potat text to video\potat1" -p "fast moving fancy sports car" -W 1024 -H 576 -o "F:\potat text to video" -d cuda -x -s 50 -g 23 -f 24 -T 48
(venv) F:\potat text to video>python check.py
CUDA is available on your system.
CUDA device count: 2
CUDA device name: NVIDIA GeForce RTX 3090 Ti


(venv) F:\potat text to video>cd Text-To-Video-Finetuning

(venv) F:\potat text to video\Text-To-Video-Finetuning>python inference.py -m "F:\potat text to video\potat1" -p "fast moving fancy sports car" -W 1024 -H 576 -o "F:\potat text to video" -d cuda -x -s 50 -g 23 -f 24 -T 48
Traceback (most recent call last):
  File "F:\potat text to video\Text-To-Video-Finetuning\inference.py", line 194, in <module>
    videos = inference(**args)
  File "F:\potat text to video\venv\lib\site-packages\torch\utils\_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "F:\potat text to video\Text-To-Video-Finetuning\inference.py", line 122, in inference
    pipeline = initialize_pipeline(model, device, xformers, sdp)
  File "F:\potat text to video\Text-To-Video-Finetuning\inference.py", line 24, in initialize_pipeline
    unet.disable_gradient_checkpointing()
  File "F:\potat text to video\venv\lib\site-packages\diffusers\models\modeling_utils.py", line 216, in disable_gradient_checkpointing
    self.apply(partial(self._set_gradient_checkpointing, value=False))
  File "F:\potat text to video\venv\lib\site-packages\torch\nn\modules\module.py", line 884, in apply
    module.apply(fn)
  File "F:\potat text to video\venv\lib\site-packages\torch\nn\modules\module.py", line 885, in apply
    fn(self)
TypeError: UNet3DConditionModel._set_gradient_checkpointing() got multiple values for argument 'value'

(venv) F:\potat text to video\Text-To-Video-Finetuning>

Folder for models?

In which folder to put git clone https://huggingface.co/damo-vilab/text-to-video-ms-1.7b models?

The video dosent move.

After finetuning, the output video dosent move, it just stays still. It looks good but there is no movement.

First GPU occupies more VRAM in distributed training

link๏ผŒ
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
cached_latent = torch.load(self.cached_data_list[index], map_location=device)
Otherwise, in multi-GPU distributed training, the first GPU may occupy excessive VRAM compared to the other GPUs.

Error with inference

Get this using both my finetuned model and the original 1.7b model

โ”‚ /content/drive/MyDrive/Text-To-Video-Finetuning/inference.py:192 in <module> โ”‚
โ”‚                                                                              โ”‚
โ”‚   189 โ”‚   โ”‚   init = interpolate(init, size=(args["num_frames"], args["heigh โ”‚
โ”‚   190 โ”‚   โ”‚   args["init_video"] = init                                      โ”‚
โ”‚   191 โ”‚                                                                      โ”‚
โ”‚ โฑ 192 โ”‚   videos = inference(**args)                                         โ”‚
โ”‚   193 โ”‚                                                                      โ”‚
โ”‚   194 โ”‚   os.makedirs(output_dir, exist_ok=True)                             โ”‚
โ”‚   195 โ”‚   out_stem = f"{output_dir}/"                                        โ”‚
โ”‚                                                                              โ”‚
โ”‚ /usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py:115 in    โ”‚
โ”‚ decorate_context                                                             โ”‚
โ”‚                                                                              โ”‚
โ”‚   112 โ”‚   @functools.wraps(func)                                             โ”‚
โ”‚   113 โ”‚   def decorate_context(*args, **kwargs):                             โ”‚
โ”‚   114 โ”‚   โ”‚   with ctx_factory():                                            โ”‚
โ”‚ โฑ 115 โ”‚   โ”‚   โ”‚   return func(*args, **kwargs)                               โ”‚
โ”‚   116 โ”‚                                                                      โ”‚
โ”‚   117 โ”‚   return decorate_context                                            โ”‚
โ”‚   118                                                                        โ”‚
โ”‚                                                                              โ”‚
โ”‚ /content/drive/MyDrive/Text-To-Video-Finetuning/inference.py:120 in          โ”‚
โ”‚ inference                                                                    โ”‚
โ”‚                                                                              โ”‚
โ”‚   117 โ”‚   lora_rank=64                                                       โ”‚
โ”‚   118 ):                                                                     โ”‚
โ”‚   119 โ”‚   with torch.autocast(device, dtype=torch.half):                     โ”‚
โ”‚ โฑ 120 โ”‚   โ”‚   pipeline = initialize_pipeline(model, device, xformers, sdp)   โ”‚
โ”‚   121 โ”‚   โ”‚   inject_inferable_lora(pipeline, lora_path, r=lora_rank)        โ”‚
โ”‚   122 โ”‚   โ”‚   prompt = [prompt] * batch_size                                 โ”‚
โ”‚   123 โ”‚   โ”‚   negative_prompt = ([negative_prompt] * batch_size) if negative โ”‚
โ”‚                                                                              โ”‚
โ”‚ /content/drive/MyDrive/Text-To-Video-Finetuning/inference.py:33 in           โ”‚
โ”‚ initialize_pipeline                                                          โ”‚
โ”‚                                                                              โ”‚
โ”‚    30 โ”‚   โ”‚   unet=unet.to(device=device, dtype=torch.half),                 โ”‚
โ”‚    31 โ”‚   )                                                                  โ”‚
โ”‚    32 โ”‚   pipeline.scheduler = DPMSolverMultistepScheduler.from_config(pipel โ”‚
โ”‚ โฑ  33 โ”‚   unet._set_gradient_checkpointing(value=False)                      โ”‚
โ”‚    34 โ”‚   handle_memory_attention(xformers, sdp, unet)                       โ”‚
โ”‚    35 โ”‚   vae.enable_slicing()                                               โ”‚
โ”‚    36 โ”‚   return pipeline                                                    โ”‚
โ•ฐโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฏ
TypeError: UNet3DConditionModel._set_gradient_checkpointing() missing 1 required
positional argument: 'module'```


Custom resolution?

Is there possibility of having custom resolution for training/inference ?

Links to weights?

How about sharing text2video/fine-tuned weights here?

The two working weights I have so far found are these two:
damo-vilab/text-to-video-ms-1.7b
strangeman3107/animov-0.1.1

Accelerator 'function' object has no attribute '__func__'

is there a specific version of accelerate that will work? I recently had to reinstall my requirements, and what worked before, doesn't work anymore. I think accelerate changed something on their end that caused this error message. I am using a fresh install at the moment, and everything works up until saving the first checkpoint....

Configuration saved in ./outputs\train_2023-07-02T07-13-31\checkpoint-100\model_index.json
Traceback (most recent call last):
  File "F:\AI\Text-to-Video-Finetuning\train.py", line 994, in <module>
    main(**OmegaConf.load(args.config))
  File "F:\AI\Text-to-Video-Finetuning\train.py", line 899, in main
    save_pipe(
  File "F:\AI\Text-to-Video-Finetuning\train.py", line 506, in save_pipe
    unet, text_encoder = accelerator.prepare(unet, text_encoder)
  File "C:\Users\niver\anaconda3\envs\text2video-finetune\lib\site-packages\accelerate\accelerator.py", line 1182, in prepare
    result = tuple(
  File "C:\Users\niver\anaconda3\envs\text2video-finetune\lib\site-packages\accelerate\accelerator.py", line 1183, in <genexpr>
    self._prepare_one(obj, first_pass=True, device_placement=d) for obj, d in zip(args, device_placement)
  File "C:\Users\niver\anaconda3\envs\text2video-finetune\lib\site-packages\accelerate\accelerator.py", line 1022, in _prepare_one
    return self.prepare_model(obj, device_placement=device_placement)
  File "C:\Users\niver\anaconda3\envs\text2video-finetune\lib\site-packages\accelerate\accelerator.py", line 1308, in prepare_model
    model.forward = MethodType(torch.cuda.amp.autocast(dtype=torch.float16)(model.forward.__func__), model)
AttributeError: 'function' object has no attribute '__func__'

finetune train error of "UnboundLocalError: local variable 'use_offset_noise' referenced before assignment"

after I comment the code about get_logger, I can run as the following output, but meet the other error.
โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”
(aigc) cwh8szh@SZH-C-006RW:/mnt/workspace/cwh8szh/aigc/Text-To-Video-Finetuning$ CUDA_VISIBLE_DEVICES=1 python train.py --config ./configs/v2/train_config_caixukun.yaml
{'variance_type', 'timestep_spacing'} was not found in config. Values will be initialized to default values.
{'downsample_padding', 'mid_block_scale_factor'} was not found in config. Values will be initialized to default values.
33 Attention layers using Scaled Dot Product Attention.
LoRA rank 16 is too large. setting to: 4
LoRA rank 16 is too large. setting to: 4
Lora successfully injected into UNet3DConditionModel.
Lora successfully injected into CLIPTextModel.
Non-existant JSON path. Skipping.
Caching Latents.: 100%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 38/38 [00:14<00:00, 2.68it/s]
Steps: 0%| | 0/10000 [00:00<?, ?it/s]2628 params have been unfrozen for training.
Traceback (most recent call last):
File "/mnt/workspace/cwh8szh/aigc/Text-To-Video-Finetuning/train.py", line 1035, in
main(**OmegaConf.load(args.config))
File "/mnt/workspace/cwh8szh/aigc/Text-To-Video-Finetuning/train.py", line 908, in main
loss, latents = finetune_unet(batch, train_encoder=train_text_encoder)
File "/mnt/workspace/cwh8szh/aigc/Text-To-Video-Finetuning/train.py", line 810, in finetune_unet
use_offset_noise = use_offset_noise and not rescale_schedule
UnboundLocalError: local variable 'use_offset_noise' referenced before assignment
Steps: 0%| | 0/10000 [00:55<?, ?it/s]
โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”
Is there more guideline of finetune training, much thanks.

Thanks๏ผŒI will try it.> will it be difficult to modify the code to support multi-gpu training?

          > will it be difficult to modify the code to support multi-gpu training?

I've never tried multiple GPU training, but you may be able to do it naively with accelerate.

accelerate config

You should be prompted to configure your setup, including multiple GPU training.
Then it should be as simple as:

accelerate launch train.py --config ./configs/my_config_hq.yaml

Let us know how it goes if you decide to try! If it doesn't I could try to implement it, but I don't have multiple GPUs and would probably need to rent out a server to do so.

Originally posted by @ExponentialML in #14 (comment)

RuntimeError: "slow_conv2d_cpu" not implemented for 'Half'

After running the train.py I get this RuntimeError:

C:\Users\User1\AppData\Local\Programs\Python\Python310\lib\site-packages\accelerate\accelerator.py:249: FutureWarning: logging_diris deprecated and will be removed in version 0.18.0 of ๐Ÿค— Accelerate. Useproject_dir` instead.
warnings.warn(
03/24/2023 08:52:48 - INFO - main - Distributed environment: NO
Num processes: 1
Process index: 0
Local process index: 0
Device: cpu

Mixed precision type: fp16

{'variance_type'} was not found in config. Values will be initialized to default values.
C:\Users\User1\AppData\Local\Programs\Python\Python310\lib\site-packages\transformers\modeling_utils.py:402: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
with safe_open(checkpoint_file, framework="pt") as f:
C:\Users\User1\AppData\Local\Programs\Python\Python310\lib\site-packages\torch_utils.py:776: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
return self.fget.get(instance, owner)()
C:\Users\User1\AppData\Local\Programs\Python\Python310\lib\site-packages\torch\storage.py:899: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
storage = cls(wrap_storage=untyped_storage)
C:\Users\User1\AppData\Local\Programs\Python\Python310\lib\site-packages\safetensors\torch.py:99: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
with safe_open(filename, framework="pt", device=device) as f:
{'downsample_padding', 'mid_block_scale_factor'} was not found in config. Values will be initialized to default values.
03/24/2023 08:52:50 - INFO - main - ***** Running training *****
03/24/2023 08:52:50 - INFO - main - Num examples = 1
03/24/2023 08:52:50 - INFO - main - Num Epochs = 1200
03/24/2023 08:52:50 - INFO - main - Instantaneous batch size per device = 1
03/24/2023 08:52:50 - INFO - main - Total train batch size (w. parallel, distributed & accumulation) = 1
03/24/2023 08:52:50 - INFO - main - Gradient Accumulation steps = 1
03/24/2023 08:52:50 - INFO - main - Total optimization steps = 1200
Steps: 0%| | 0/1200 [00:00<?, ?it/s]Traceback (most recent call last):
File "D:\Art\Text-To-Video-Finetuning\train.py", line 498, in
main(**OmegaConf.load(args.config))
File "D:\Art\Text-To-Video-Finetuning\train.py", line 394, in main
loss, latents = finetune_unet(batch, train_encoder=train_text_encoder)
File "D:\Art\Text-To-Video-Finetuning\train.py", line 339, in finetune_unet
latents = tensor_to_vae_latent(pixel_values, vae)
File "D:\Art\Text-To-Video-Finetuning\train.py", line 157, in tensor_to_vae_latent
latents = vae.encode(t).latent_dist.sample()
File "C:\Users\User1\AppData\Local\Programs\Python\Python310\lib\site-packages\diffusers\utils\accelerate_utils.py", line 46, in wrapper
return method(self, *args, **kwargs)
File "C:\Users\User1\AppData\Local\Programs\Python\Python310\lib\site-packages\diffusers\models\autoencoder_kl.py", line 164, in encode
h = self.encoder(x)
File "C:\Users\User1\AppData\Local\Programs\Python\Python310\lib\site-packages\torch\nn\modules\module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "C:\Users\User1\AppData\Local\Programs\Python\Python310\lib\site-packages\diffusers\models\vae.py", line 109, in forward
sample = self.conv_in(sample)
File "C:\Users\User1\AppData\Local\Programs\Python\Python310\lib\site-packages\torch\nn\modules\module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "C:\Users\User1\AppData\Local\Programs\Python\Python310\lib\site-packages\torch\nn\modules\conv.py", line 463, in forward
return self._conv_forward(input, self.weight, self.bias)
File "C:\Users\User1\AppData\Local\Programs\Python\Python310\lib\site-packages\torch\nn\modules\conv.py", line 459, in _conv_forward
return F.conv2d(input, weight, bias, self.stride,
RuntimeError: "slow_conv2d_cpu" not implemented for 'Half'
Steps: 0%| | 0/1200 [00:00<?, ?it/s]`

error of "from accelerate.logging import get_logger"

Hi, I meet the following error when I run finetune training.
(aigc) cwh8szh@SZH-C-006RW:/mnt/workspace/cwh8szh/aigc/Text-To-Video-Finetuning$ CUDA_VISIBLE_DEVICES=1 python train.py --config ./configs/v2/train_config_caixukun.yaml
Traceback (most recent call last):
File "/mnt/workspace/cwh8szh/aigc/Text-To-Video-Finetuning/train.py", line 1035, in
main(**OmegaConf.load(args.config))
File "/mnt/workspace/cwh8szh/aigc/Text-To-Video-Finetuning/train.py", line 617, in main
create_logging(logging, logger, accelerator)
File "/mnt/workspace/cwh8szh/aigc/Text-To-Video-Finetuning/train.py", line 64, in create_logging
logger.info(accelerator.state, main_process_only=False)
AttributeError: 'str' object has no attribute 'info'

I have tried different version of accelerate, but cannot slove this error.
the following of my main pip list:
Package Version


accelerate 0.20.3
tensorboard 2.10.0
tensorboard-data-server 0.6.1
tensorboard-plugin-wit 1.8.1
tokenizers 0.13.3
torch 2.0.1
torchaudio 2.0.2
torchvision 0.15.2
transformers 4.30.2
triton 2.0.0

TypeError: get_logger() got an unexpected keyword argument 'log_level'

Hi, I am trying to run your script but it always shows me this error.
Another thing is that it's not possible for me to install triton. It's like the repo doesn't exist anymore.

Error caught was: No module named 'triton'
โ•ญโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€ Traceback (most recent call last) โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฎ
โ”‚ C:\Users\Life\Text-To-Video-Finetuning\train.py:43 in <module>                                   โ”‚
โ”‚                                                                                                  โ”‚
โ”‚    40 # Will error if the minimal version of diffusers is not installed. Remove at your own ri   โ”‚
โ”‚    41 check_min_version("0.10.0.dev0")                                                           โ”‚
โ”‚    42                                                                                            โ”‚
โ”‚ โฑ  43 logger = get_logger(__name__, log_level="INFO")                                            โ”‚
โ”‚    44                                                                                            โ”‚
โ”‚    45 def create_logging(logging, logger, accelerator):                                          โ”‚
โ”‚    46 โ”‚   logging.basicConfig(                                                                   โ”‚
โ•ฐโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฏ
TypeError: get_logger() got an unexpected keyword argument 'log_level'

Question: regarding fine-tuning with images

Is there a rough count on how many images to train a concept if not using a video? I know for LORA it can be as few as 9-10 but for DreamBooth, usually 2-3x that amount.

Is this the new version of the model?

There was a new version of modelscope released recently, it was trained for a month longer and it can generate better videos, is this repo using the new model or the old one?

Feature request

Thank you, for making this. It seems to work, and I have a model.

I wanted to ask if there is:

  1. a link to a repository that we can use to generate videos with our new diffusion models, or a small example on how to do it with python or something like that.
  2. a way to specify the frame rate of the sample videos. Everything seems to sample at 6-8 fps, so the default 24fps videos seem too fast to really see what the sample video looks like.
  3. if we use a json file, do we also need to specify the video folder, or does the json's hyperlinks take care of that?

Thank you!

webui Lora Might be causing errors in checkpoint models.

Some weights of the model checkpoint were not used when initializing UNet3DConditionModel:
This IS expected if you are initializing CLIPTextModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing CLIPTextModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).

Has anyone else had similar issues. I believe it has to do with the Lora Training because I only notice such behavior on models created while also training the new webui lora. The most recent model did not use the Loras, and had no such issues.

NameError: name 'glob' is not defined

After i run the script train_config.yaml i get this error below:

2023-04-09 13:40:38.702636: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
/usr/local/lib/python3.9/dist-packages/accelerate/accelerator.py:249: FutureWarning: logging_dir is deprecated and will be removed in version 0.18.0 of ๐Ÿค— Accelerate. Use project_dir instead.
warnings.warn(
04/09/2023 13:40:40 - INFO - main - Distributed environment: NO
Num processes: 1
Process index: 0
Local process index: 0
Device: cuda

Mixed precision type: fp16

{'variance_type'} was not found in config. Values will be initialized to default values.
/usr/local/lib/python3.9/dist-packages/transformers/modeling_utils.py:402: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
with safe_open(checkpoint_file, framework="pt") as f:
/usr/local/lib/python3.9/dist-packages/torch/_utils.py:776: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
return self.fget.get(instance, owner)()
/usr/local/lib/python3.9/dist-packages/torch/storage.py:899: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
storage = cls(wrap_storage=untyped_storage)
/usr/local/lib/python3.9/dist-packages/safetensors/torch.py:99: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
with safe_open(filename, framework="pt", device=device) as f:
{'mid_block_scale_factor', 'downsample_padding'} was not found in config. Values will be initialized to default values.
33 Attention layers using Scaled Dot Product Attention.
Lora successfully injected into UNet3DConditionModel.
Lora successfully injected into CLIPTextModel.
Non-existant JSON path. Skipping.
โ•ญโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€ Traceback (most recent call last) โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฎ
โ”‚ /content/Text-To-Video-Finetuning/train.py:915 in โ”‚
โ”‚ โ”‚
โ”‚ 912 โ”‚ parser.add_argument("--config", type=str, default="./configs/my_co โ”‚
โ”‚ 913 โ”‚ args = parser.parse_args() โ”‚
โ”‚ 914 โ”‚ โ”‚
โ”‚ โฑ 915 โ”‚ main(**OmegaConf.load(args.config)) โ”‚
โ”‚ 916 โ”‚
โ”‚ โ”‚
โ”‚ /content/Text-To-Video-Finetuning/train.py:582 in main โ”‚
โ”‚ โ”‚
โ”‚ 579 โ”‚ ) โ”‚
โ”‚ 580 โ”‚ โ”‚
โ”‚ 581 โ”‚ # Get the training dataset based on types (json, single_video, ima โ”‚
โ”‚ โฑ 582 โ”‚ train_datasets = get_train_dataset(dataset_types, train_data, toke โ”‚
โ”‚ 583 โ”‚ โ”‚
โ”‚ 584 โ”‚ # Extend datasets that are less than the greatest one. This allows โ”‚
โ”‚ 585 โ”‚ attrs = ['train_data', 'frames', 'image_dir', 'video_files'] โ”‚
โ”‚ โ”‚
โ”‚ /content/Text-To-Video-Finetuning/train.py:86 in get_train_dataset โ”‚
โ”‚ โ”‚
โ”‚ 83 โ”‚ for DataSet in [VideoJsonDataset, SingleVideoDataset, ImageDataset โ”‚
โ”‚ 84 โ”‚ โ”‚ for dataset in dataset_types: โ”‚
โ”‚ 85 โ”‚ โ”‚ โ”‚ if dataset == DataSet.getname(): โ”‚
โ”‚ โฑ 86 โ”‚ โ”‚ โ”‚ โ”‚ train_datasets.append(DataSet(**train_data, tokenizer= โ”‚
โ”‚ 87 โ”‚ โ”‚
โ”‚ 88 โ”‚ if len(train_datasets) > 0: โ”‚
โ”‚ 89 โ”‚ โ”‚ return train_datasets โ”‚
โ”‚ โ”‚
โ”‚ /content/Text-To-Video-Finetuning/utils/dataset.py:487 in init โ”‚
โ”‚ โ”‚
โ”‚ 484 โ”‚ โ”‚ โ”‚
โ”‚ 485 โ”‚ โ”‚ self.fallback_prompt = fallback_prompt โ”‚
โ”‚ 486 โ”‚ โ”‚ โ”‚
โ”‚ โฑ 487 โ”‚ โ”‚ self.video_files = glob(f"{path}/*.mp4") โ”‚
โ”‚ 488 โ”‚ โ”‚ โ”‚
โ”‚ 489 โ”‚ โ”‚ self.width = width โ”‚
โ”‚ 490 โ”‚ โ”‚ self.height = height โ”‚
โ•ฐโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฏ
NameError: name 'glob' is not defined

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.