I want to train it with n_sample_frames: 100. With 100 videos <a target="_blank" r

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

It was tested on a 3090 Ti 24 GB VRAM <code

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

<a class="user-mention notranslate" data-hovercard-type="user" data-hover

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

VRAM optimization?,about exponentialml/text-to-video-finetuning

Comments (27)

ExponentialML commented on May 26, 2024 3

@seel-channel Glad to see the great improvement! 8~16 frames is more than enough context for most cases, so training with at least <= 16GB VRAM should be the norm now with all attention layers unlocked. I'm not certain if the model can retain that much information without using a different attention mechanism all together for the temporal layers.

Convolution layers are still tricky to finetune, as well as well as getting around the increased VRAM usage. I'll update the default configs to better match different use cases.

I'll keep this issue open as I think there's a bit more room to play with here for 12GB VRAM users possibly.

from text-to-video-finetuning.

XmYx commented on May 26, 2024 2

Hey, I'm currently running some experiments with training this model, and using torch 2, a fresh xformers, with 24Gb of VRAM I can do n_sample_frames: 48, with 384 x 256 inputs, (20x 2 sec), while having modules set to "attn1" and "attn2", and text_encoder training on. after getting the encoder hidden states I moved the text encoder back to cpu, and at frame preparation I also move the vae back and forth. But I'll update to the latest now, and try some higher res samples.

from text-to-video-finetuning.

ExponentialML commented on May 26, 2024 2

@sergiobr & @seel-channel I've updated the repository with gradient checkpointing support.

Can you pull the latest update and see if you see an improvement by adding gradient_checkpointing: True to your configs? You should see a loss in training speed but improvement in VRAM usage.

from text-to-video-finetuning.

seel-channel commented on May 26, 2024 2

It was tested on a 3090 Ti 24 GB VRAM

`gradient_checkpointing:`	24 frames (VRAM)	50 frames (VRAM)	75 frames (VRAM)	100 frames (VRAM)
True	6 hours (13.6 GB)	10.27 hours (20.3 GB)	12.6 hours (23.4 GB)	---- (OUT VRAM)
False	5 hours (23.7 GB)	---- (OUT VRAM)	---- (OUT VRAM)	---- (OUT VRAM)

from text-to-video-finetuning.

sergiobr commented on May 26, 2024 1

@ExponentialML Thank ou very much for detailed explanation.
I'm very curious about these deep working of neural networks and I wanna go deep and understand it better.

from text-to-video-finetuning.

seel-channel commented on May 26, 2024 1

Every 256x256 frame costs 0.5 Gb VRAM. As a base, the model needs ~14 GB VRAM

Frames (count)	Consumption (VRAM Gb)
26	23.7
25	23.2
24	22.7
01	14.1

trainable_modules:
- "down_blocks.2.attentions"
- "down_blocks.2.temp"
- "up_blocks.2.attentions"
- "up_blocks.2.temp"

from text-to-video-finetuning.

maximepeabody commented on May 26, 2024 1

@seel-channel Glad to see the great improvement! 8~16 frames is more than enough context for most cases, so training with at least <= 16GB VRAM should be the norm now with all attention layers unlocked. I'm not certain if the model can retain that much information without using a different attention mechanism all together for the temporal layers.

Convolution layers are still tricky to finetune, as well as well as getting around the increased VRAM usage. I'll update the default configs to better match different use cases.

I'll keep this issue open as I think there's a bit more room to play with here for 12GB VRAM users possibly.

I'm trying to finetune with 16 frames, using the tumblr TGIF dataset - I'm hoping to get rid of that tacky "shutterstock" watermark!

What's the best config at the moment, while using an a100? Thanks btw! this is great work. I'll make a PR for the tgif gif dataset once I've got it working properly

from text-to-video-finetuning.

ExponentialML commented on May 26, 2024 1

@sergiobr Please comment on the PR so any conversions are easy to track, thanks!

I recommend just using the video dataset with captions for now using "folder" only for quicker testing (you can test on small datasets of 5 or so).

Also as @seel-channel said, you can just set the n_sample_frames to "1". I may separate the parameters as they behave differently for each dataset.

from text-to-video-finetuning.

ExponentialML commented on May 26, 2024

Hey! As it currently stands, finetuning with this many sample frames are unfeasible.

Many SOTA implementations (video diffusion, Make-A-Video, etc.) sample from roughly ~16 frames. This is one of the reasons why the current models are trained on a resolution of 256x256, or use upscaling networks (like Imagen) to generate at 64x64 then go up as needed.

I would recommend either using xformers or Pytorch 2.0 to see if you can squeeze a bit more performance while keeping the sample frames around 4-8. As the model was already trained on a fairly large domain, you should be fine using a low frame sample account and the model will pick up on the temporal coherency.

As for the video example you've posted, I don't see why it wouldn't be able to. Maybe you could try and report back?

Hope that helps!

from text-to-video-finetuning.

seel-channel commented on May 26, 2024

Does Higher resolution need more VRAM?

from text-to-video-finetuning.

seel-channel commented on May 26, 2024

After updated to Torch 2.0. I can't increase n_sample_frames higher than 24

--config

pretrained_model_path: "\\weights\\text-to-video-ms-1.7b" #https://huggingface.co/damo-vilab/text-to-video-ms-1.7b/tree/main
output_dir: "weights"
train_text_encoder: False

train_data:
    json_path: "train_data.json"
    preprocessed: True
    width: 256      
    height: 256
    sample_start_idx: 0
    sample_frame_rate: 1 # Proccess an image every `sample_frame_rate`
    n_sample_frames: 24
    use_random_start_idx: False
    shuffle_frames: False
    vid_data_key: "video_path"

    single_video_path: ""
    single_video_prompt: ""

validation_data:
    prompt: ""
    sample_preview: True
    num_frames: 100
    width: 256
    height: 256
    num_inference_steps: 20
    guidance_scale: 9

learning_rate: 5e-6 
adam_weight_decay: 1e-2
train_batch_size: 1
max_train_steps: 50000
checkpointing_steps: 5000
validation_steps: 5000
trainable_modules:
- "attn1"
- "attn2"

seed: 64
mixed_precision: "fp16"
use_8bit_adam: False # This seems to be incompatible at the moment. 
enable_xformers_memory_efficient_attention: True

from text-to-video-finetuning.

seel-channel commented on May 26, 2024

Hey, I'm currently running some experiments with training this model, and using torch 2, a fresh xformers, with 24Gb of VRAM I can do n_sample_frames: 48, with 384 x 256 inputs, (20x 2 sec), while having modules set to "attn1" and "attn2", and text_encoder training on. after getting the encoder hidden states I moved the text encoder back to cpu, and at frame preparation I also move the vae back and forth. But I'll update to the latest now, and try some higher res samples.

Can you share your --config file?

from text-to-video-finetuning.

sergiobr commented on May 26, 2024

I was experiencing yesterday with A100 80gb.
But I think something is wrong, the it/s didn't pass 2. And maximum used vRAM was 41gb.
It looks too much low compared to training dreambooth on images that can reaches 15-30 it/s if I remember well.
I was training hd with videos of 2 frames and 512x512, but maybe it's because I had just 10 of them? It doesn't matter the number of batch I put it don't speed up.
I'll try today with more videos.
How could I use GPU at maximum speed?

from text-to-video-finetuning.

ExponentialML commented on May 26, 2024

@sergiobr Training is going to be bit slower due to the extra added temporal dimension.

Look at it this way. Before, we had this:

(2D UNet Latents): b c h w
Where b == batch, c == channels, h == height, w == width.

Now we have:

(3D UNet Latents) : b c f w h

Where it's the same as above, but now we have the frame information as f.

Now with these two in mind, we now have a temporal transformer for processing the temporal information (relations across time), so an added attention layer and convolution layer (which is filled with 4 3D convolution passes, last acting as identity).

These two layers alone not only increase the amount of memory, but computation time. Remember that each transformer has two attention layers. One for self attention, and another for cross attention layer (relation between image data and text).

If you increase the resolution and use two frames, that's the equivalent of running a batch size of 2 in terms of memory usage. If you increase the batch size and keep the frames the same, you could potentially be doing b * f, increasing VRAM usage. The more frames you add, the slower the training will be as it's more information to process.

If you want to try reducing memory usage, you could try only finetuning the second to last layers on the encoder and decoder blocks as such.

trainable_modules:
  - "down_blocks.2.attentions"
  - "down_blocks.2.temp"
  - "up_blocks.2.attentions"
  - "up_blocks.2.temp"

I'm still working on memory optimizations (no guarantees, but making progress), but speed would come through either mini batching, preprocessing data (resizing videos to match input), or xformers / Scaled Dot Product Attention through Torch 2.0.

from text-to-video-finetuning.

seel-channel commented on May 26, 2024

I was experiencing yesterday with A100 80gb. But I think something is wrong, the it/s didn't pass 2. And maximum used vRAM was 41gb. It looks too much low compared to training dreambooth on images that can reaches 15-30 it/s if I remember well. I was training hd with videos of 2 frames and 512x512, but maybe it's because I had just 10 of them? It doesn't matter the number of batch I put it don't speed up. I'll try today with more videos. How could I use GPU at maximum speed?

@sergiobr, Which service did you use?

from text-to-video-finetuning.

seel-channel commented on May 26, 2024

Is it a good idea train the text encoder?

from text-to-video-finetuning.

ExponentialML commented on May 26, 2024

Is it a good idea train the text encoder?

I've tried it, but wasn't able to get good results like with the image models.

from text-to-video-finetuning.

seel-channel commented on May 26, 2024

I've tried it, but wasn't able to get good results like with the image models.

Does it improve the quality or not? Other question, what offset noise?

from text-to-video-finetuning.

sergiobr commented on May 26, 2024

I was experiencing yesterday with A100 80gb. But I think something is wrong, the it/s didn't pass 2. And maximum used vRAM was 41gb. It looks too much low compared to training dreambooth on images that can reaches 15-30 it/s if I remember well. I was training hd with videos of 2 frames and 512x512, but maybe it's because I had just 10 of them? It doesn't matter the number of batch I put it don't speed up. I'll try today with more videos. How could I use GPU at maximum speed?

@sergiobr, Which service did you use?

I do use GCP

from text-to-video-finetuning.

ExponentialML commented on May 26, 2024

@seel-channel

I re-visited your idea and discovered why text encoder training wasn't working for me initially. The data isn't transferred properly to the text encoder because there isn't a temporal attention layer in the CLIP encoder.

You can actually train the model on single frames / images if you only allow forward passes on the spatial layers (ones like in traditional stable diffusion), and skip the temporal ones entirely. This isn't to be confused with keeping the temporal layers frozen.

Even if you freeze them, the data is still passing through the temporal layers and won't train properly. This in turn allows for fine tuning on about 13GB of VRAM with plausible results (with the text encoder frozen). With it unfrozen, it seems to overfit very quickly.

Once I test it thoroughly I'll update the repository.

from text-to-video-finetuning.

sergiobr commented on May 26, 2024

@seel-channel

I re-visited your idea and discovered why text encoder training wasn't working for me initially. The data isn't transferred properly to the text encoder because there isn't a temporal attention layer in the CLIP encoder.

You can actually train the model on single frames / images if you only allow forward passes on the spatial layers (ones like in traditional stable diffusion), and skip the temporal ones entirely. This isn't to be confused with keeping the temporal layers frozen.

Even if you freeze them, the data is still passing through the temporal layers and won't train properly. This in turn allows for fine tuning on about 13GB of VRAM with plausible results (with the text encoder frozen). With it unfrozen, it seems to overfit very quickly.

Once I test it thoroughly I'll update the repository.

That's great!

from text-to-video-finetuning.

ExponentialML commented on May 26, 2024

If you guys want to test the next version release (which includes image finetuning / text_encoder finetuning, VRAM optimizations), you can do so #26.

from text-to-video-finetuning.

sergiobr commented on May 26, 2024

Great. I'll test it.
Thank you.

from text-to-video-finetuning.

sergiobr commented on May 26, 2024

@ExponentialML
About this

# The rate at which your frames are sampled. 'folder' samples FPS like, 'json' and 'single_video' act as frame skip.
  sample_frame_rate: 30

Is there a reason why it can't be standardized?
If I do understand right, if I don't wanna skip frames when using JSON file mode I need to set it to 0?
Also, when using JSON, the code will only read the frames described with prompt in the file or it will use all frames for training?

from text-to-video-finetuning.

seel-channel commented on May 26, 2024

@ExponentialML

About this
# The rate at which your frames are sampled. 'folder' samples FPS like, 'json' and 'single_video' act as frame skip.

 sample_frame_rate: 30
Is there a reason why it can't be standardized?

If I do understand right, if I don't wanna skip frames when using JSON file mode I need to set it to 0?

Also, when using JSON, the code will only read the frames described with prompt in the file or it will use all frames for training?

#If you want to use all the frames you should set sample_frame_rate: 1, not zero. Also you need to describe on the frames_data.JSON every frame will be use. It's every frame, you need to describe all your frames

from text-to-video-finetuning.

sergiobr commented on May 26, 2024

@ExponentialML
About this
# The rate at which your frames are sampled. 'folder' samples FPS like, 'json' and 'single_video' act as frame skip.

 sample_frame_rate: 30
Is there a reason why it can't be standardized?
If I do understand right, if I don't wanna skip frames when using JSON file mode I need to set it to 0?
Also, when using JSON, the code will only read the frames described with prompt in the file or it will use all frames for training?
#If you want to use all the frames you should set sample_frame_rate: 1, not zero. Also you need to describe on the frames_data.JSON every frame will be use. It's every frame, you need to describe all your frames

Thank you @seel-channel
I think I got it now.
Sometimes I have errors and cannot start training with messages regarding about size of tensors. I used to cut videos with the same number of frames also. Don't know if it's also needed.

from text-to-video-finetuning.

ExponentialML commented on May 26, 2024

I'll close this as I feel the VRAM optimizations are more than sufficient, especially with LoRA training. If this is still an issue, feel free to ping me fore a re-open, or start a discussion to better discuss optimizing.

from text-to-video-finetuning.

VRAM optimization? about text-to-video-finetuning HOT 27 CLOSED

Comments (27)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent