Giter VIP home page Giter VIP logo

Comments (27)

ExponentialML avatar ExponentialML commented on May 26, 2024 3

@seel-channel Glad to see the great improvement! 8~16 frames is more than enough context for most cases, so training with at least <= 16GB VRAM should be the norm now with all attention layers unlocked. I'm not certain if the model can retain that much information without using a different attention mechanism all together for the temporal layers.

Convolution layers are still tricky to finetune, as well as well as getting around the increased VRAM usage. I'll update the default configs to better match different use cases.

I'll keep this issue open as I think there's a bit more room to play with here for 12GB VRAM users possibly.

from text-to-video-finetuning.

XmYx avatar XmYx commented on May 26, 2024 2

Hey, I'm currently running some experiments with training this model, and using torch 2, a fresh xformers, with 24Gb of VRAM I can do n_sample_frames: 48, with 384 x 256 inputs, (20x 2 sec), while having modules set to "attn1" and "attn2", and text_encoder training on. after getting the encoder hidden states I moved the text encoder back to cpu, and at frame preparation I also move the vae back and forth. But I'll update to the latest now, and try some higher res samples.

from text-to-video-finetuning.

ExponentialML avatar ExponentialML commented on May 26, 2024 2

@sergiobr & @seel-channel I've updated the repository with gradient checkpointing support.

Can you pull the latest update and see if you see an improvement by adding gradient_checkpointing: True to your configs? You should see a loss in training speed but improvement in VRAM usage.

from text-to-video-finetuning.

seel-channel avatar seel-channel commented on May 26, 2024 2

It was tested on a 3090 Ti 24 GB VRAM

gradient_checkpointing: 24 frames (VRAM) 50 frames (VRAM) 75 frames (VRAM) 100 frames (VRAM)
True 6 hours (13.6 GB) 10.27 hours (20.3 GB) 12.6 hours (23.4 GB) ---- (OUT VRAM)
False 5 hours (23.7 GB) ---- (OUT VRAM) ---- (OUT VRAM) ---- (OUT VRAM)

from text-to-video-finetuning.

sergiobr avatar sergiobr commented on May 26, 2024 1

@ExponentialML Thank ou very much for detailed explanation.
I'm very curious about these deep working of neural networks and I wanna go deep and understand it better.

from text-to-video-finetuning.

seel-channel avatar seel-channel commented on May 26, 2024 1

Every 256x256 frame costs 0.5 Gb VRAM. As a base, the model needs ~14 GB VRAM

Frames (count) Consumption (VRAM Gb)
26 23.7
25 23.2
24 22.7
01 14.1
trainable_modules:
- "down_blocks.2.attentions"
- "down_blocks.2.temp"
- "up_blocks.2.attentions"
- "up_blocks.2.temp"

from text-to-video-finetuning.

maximepeabody avatar maximepeabody commented on May 26, 2024 1

@seel-channel Glad to see the great improvement! 8~16 frames is more than enough context for most cases, so training with at least <= 16GB VRAM should be the norm now with all attention layers unlocked. I'm not certain if the model can retain that much information without using a different attention mechanism all together for the temporal layers.

Convolution layers are still tricky to finetune, as well as well as getting around the increased VRAM usage. I'll update the default configs to better match different use cases.

I'll keep this issue open as I think there's a bit more room to play with here for 12GB VRAM users possibly.

I'm trying to finetune with 16 frames, using the tumblr TGIF dataset - I'm hoping to get rid of that tacky "shutterstock" watermark!

What's the best config at the moment, while using an a100? Thanks btw! this is great work. I'll make a PR for the tgif gif dataset once I've got it working properly

from text-to-video-finetuning.

ExponentialML avatar ExponentialML commented on May 26, 2024 1

@sergiobr Please comment on the PR so any conversions are easy to track, thanks!

I recommend just using the video dataset with captions for now using "folder" only for quicker testing (you can test on small datasets of 5 or so).

Also as @seel-channel said, you can just set the n_sample_frames to "1". I may separate the parameters as they behave differently for each dataset.

from text-to-video-finetuning.

ExponentialML avatar ExponentialML commented on May 26, 2024

Hey! As it currently stands, finetuning with this many sample frames are unfeasible.

Many SOTA implementations (video diffusion, Make-A-Video, etc.) sample from roughly ~16 frames. This is one of the reasons why the current models are trained on a resolution of 256x256, or use upscaling networks (like Imagen) to generate at 64x64 then go up as needed.

I would recommend either using xformers or Pytorch 2.0 to see if you can squeeze a bit more performance while keeping the sample frames around 4-8. As the model was already trained on a fairly large domain, you should be fine using a low frame sample account and the model will pick up on the temporal coherency.

As for the video example you've posted, I don't see why it wouldn't be able to. Maybe you could try and report back?

Hope that helps!

from text-to-video-finetuning.

seel-channel avatar seel-channel commented on May 26, 2024

Does Higher resolution need more VRAM?

from text-to-video-finetuning.

seel-channel avatar seel-channel commented on May 26, 2024

After updated to Torch 2.0. I can't increase n_sample_frames higher than 24
image

--config

pretrained_model_path: "\\weights\\text-to-video-ms-1.7b" #https://huggingface.co/damo-vilab/text-to-video-ms-1.7b/tree/main
output_dir: "weights"
train_text_encoder: False

train_data:
    json_path: "train_data.json"
    preprocessed: True
    width: 256      
    height: 256
    sample_start_idx: 0
    sample_frame_rate: 1 # Proccess an image every `sample_frame_rate`
    n_sample_frames: 24
    use_random_start_idx: False
    shuffle_frames: False
    vid_data_key: "video_path"

    single_video_path: ""
    single_video_prompt: ""

validation_data:
    prompt: ""
    sample_preview: True
    num_frames: 100
    width: 256
    height: 256
    num_inference_steps: 20
    guidance_scale: 9

learning_rate: 5e-6 
adam_weight_decay: 1e-2
train_batch_size: 1
max_train_steps: 50000
checkpointing_steps: 5000
validation_steps: 5000
trainable_modules:
- "attn1"
- "attn2"

seed: 64
mixed_precision: "fp16"
use_8bit_adam: False # This seems to be incompatible at the moment. 
enable_xformers_memory_efficient_attention: True

from text-to-video-finetuning.

seel-channel avatar seel-channel commented on May 26, 2024

Hey, I'm currently running some experiments with training this model, and using torch 2, a fresh xformers, with 24Gb of VRAM I can do n_sample_frames: 48, with 384 x 256 inputs, (20x 2 sec), while having modules set to "attn1" and "attn2", and text_encoder training on. after getting the encoder hidden states I moved the text encoder back to cpu, and at frame preparation I also move the vae back and forth. But I'll update to the latest now, and try some higher res samples.

Can you share your --config file?

from text-to-video-finetuning.

sergiobr avatar sergiobr commented on May 26, 2024

I was experiencing yesterday with A100 80gb.
But I think something is wrong, the it/s didn't pass 2. And maximum used vRAM was 41gb.
It looks too much low compared to training dreambooth on images that can reaches 15-30 it/s if I remember well.
I was training hd with videos of 2 frames and 512x512, but maybe it's because I had just 10 of them? It doesn't matter the number of batch I put it don't speed up.
I'll try today with more videos.
How could I use GPU at maximum speed?

from text-to-video-finetuning.

ExponentialML avatar ExponentialML commented on May 26, 2024

@sergiobr Training is going to be bit slower due to the extra added temporal dimension.

Look at it this way. Before, we had this:

(2D UNet Latents): b c h w
Where b == batch, c == channels, h == height, w == width.

Now we have:

(3D UNet Latents) : b c f w h

Where it's the same as above, but now we have the frame information as f.

Now with these two in mind, we now have a temporal transformer for processing the temporal information (relations across time), so an added attention layer and convolution layer (which is filled with 4 3D convolution passes, last acting as identity).

These two layers alone not only increase the amount of memory, but computation time. Remember that each transformer has two attention layers. One for self attention, and another for cross attention layer (relation between image data and text).

If you increase the resolution and use two frames, that's the equivalent of running a batch size of 2 in terms of memory usage. If you increase the batch size and keep the frames the same, you could potentially be doing b * f, increasing VRAM usage. The more frames you add, the slower the training will be as it's more information to process.

If you want to try reducing memory usage, you could try only finetuning the second to last layers on the encoder and decoder blocks as such.

trainable_modules:
  - "down_blocks.2.attentions"
  - "down_blocks.2.temp"
  - "up_blocks.2.attentions"
  - "up_blocks.2.temp"

I'm still working on memory optimizations (no guarantees, but making progress), but speed would come through either mini batching, preprocessing data (resizing videos to match input), or xformers / Scaled Dot Product Attention through Torch 2.0.

from text-to-video-finetuning.

seel-channel avatar seel-channel commented on May 26, 2024

I was experiencing yesterday with A100 80gb. But I think something is wrong, the it/s didn't pass 2. And maximum used vRAM was 41gb. It looks too much low compared to training dreambooth on images that can reaches 15-30 it/s if I remember well. I was training hd with videos of 2 frames and 512x512, but maybe it's because I had just 10 of them? It doesn't matter the number of batch I put it don't speed up. I'll try today with more videos. How could I use GPU at maximum speed?

@sergiobr, Which service did you use?

from text-to-video-finetuning.

seel-channel avatar seel-channel commented on May 26, 2024

Is it a good idea train the text encoder?

from text-to-video-finetuning.

ExponentialML avatar ExponentialML commented on May 26, 2024

Is it a good idea train the text encoder?

I've tried it, but wasn't able to get good results like with the image models.

from text-to-video-finetuning.

seel-channel avatar seel-channel commented on May 26, 2024

I've tried it, but wasn't able to get good results like with the image models.

Does it improve the quality or not? Other question, what offset noise?

from text-to-video-finetuning.

sergiobr avatar sergiobr commented on May 26, 2024

I was experiencing yesterday with A100 80gb. But I think something is wrong, the it/s didn't pass 2. And maximum used vRAM was 41gb. It looks too much low compared to training dreambooth on images that can reaches 15-30 it/s if I remember well. I was training hd with videos of 2 frames and 512x512, but maybe it's because I had just 10 of them? It doesn't matter the number of batch I put it don't speed up. I'll try today with more videos. How could I use GPU at maximum speed?

@sergiobr, Which service did you use?

I do use GCP

from text-to-video-finetuning.

ExponentialML avatar ExponentialML commented on May 26, 2024

@seel-channel

I re-visited your idea and discovered why text encoder training wasn't working for me initially. The data isn't transferred properly to the text encoder because there isn't a temporal attention layer in the CLIP encoder.

You can actually train the model on single frames / images if you only allow forward passes on the spatial layers (ones like in traditional stable diffusion), and skip the temporal ones entirely. This isn't to be confused with keeping the temporal layers frozen.

Even if you freeze them, the data is still passing through the temporal layers and won't train properly. This in turn allows for fine tuning on about 13GB of VRAM with plausible results (with the text encoder frozen). With it unfrozen, it seems to overfit very quickly.

Once I test it thoroughly I'll update the repository.

from text-to-video-finetuning.

sergiobr avatar sergiobr commented on May 26, 2024

@seel-channel

I re-visited your idea and discovered why text encoder training wasn't working for me initially. The data isn't transferred properly to the text encoder because there isn't a temporal attention layer in the CLIP encoder.

You can actually train the model on single frames / images if you only allow forward passes on the spatial layers (ones like in traditional stable diffusion), and skip the temporal ones entirely. This isn't to be confused with keeping the temporal layers frozen.

Even if you freeze them, the data is still passing through the temporal layers and won't train properly. This in turn allows for fine tuning on about 13GB of VRAM with plausible results (with the text encoder frozen). With it unfrozen, it seems to overfit very quickly.

Once I test it thoroughly I'll update the repository.

That's great!

from text-to-video-finetuning.

ExponentialML avatar ExponentialML commented on May 26, 2024

If you guys want to test the next version release (which includes image finetuning / text_encoder finetuning, VRAM optimizations), you can do so #26.

from text-to-video-finetuning.

sergiobr avatar sergiobr commented on May 26, 2024

Great. I'll test it.
Thank you.

from text-to-video-finetuning.

sergiobr avatar sergiobr commented on May 26, 2024

@ExponentialML
About this

# The rate at which your frames are sampled. 'folder' samples FPS like, 'json' and 'single_video' act as frame skip.
  sample_frame_rate: 30

Is there a reason why it can't be standardized?
If I do understand right, if I don't wanna skip frames when using JSON file mode I need to set it to 0?
Also, when using JSON, the code will only read the frames described with prompt in the file or it will use all frames for training?

from text-to-video-finetuning.

seel-channel avatar seel-channel commented on May 26, 2024

@ExponentialML

About this


# The rate at which your frames are sampled. 'folder' samples FPS like, 'json' and 'single_video' act as frame skip.

 sample_frame_rate: 30

Is there a reason why it can't be standardized?

If I do understand right, if I don't wanna skip frames when using JSON file mode I need to set it to 0?

Also, when using JSON, the code will only read the frames described with prompt in the file or it will use all frames for training?

#If you want to use all the frames you should set sample_frame_rate: 1, not zero. Also you need to describe on the frames_data.JSON every frame will be use. It's every frame, you need to describe all your frames

from text-to-video-finetuning.

sergiobr avatar sergiobr commented on May 26, 2024

@ExponentialML
About this


# The rate at which your frames are sampled. 'folder' samples FPS like, 'json' and 'single_video' act as frame skip.

 sample_frame_rate: 30

Is there a reason why it can't be standardized?
If I do understand right, if I don't wanna skip frames when using JSON file mode I need to set it to 0?
Also, when using JSON, the code will only read the frames described with prompt in the file or it will use all frames for training?

#If you want to use all the frames you should set sample_frame_rate: 1, not zero. Also you need to describe on the frames_data.JSON every frame will be use. It's every frame, you need to describe all your frames

Thank you @seel-channel
I think I got it now.
Sometimes I have errors and cannot start training with messages regarding about size of tensors. I used to cut videos with the same number of frames also. Don't know if it's also needed.

from text-to-video-finetuning.

ExponentialML avatar ExponentialML commented on May 26, 2024

I'll close this as I feel the VRAM optimizations are more than sufficient, especially with LoRA training. If this is still an issue, feel free to ping me fore a re-open, or start a discussion to better discuss optimizing.

from text-to-video-finetuning.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.