I've been trying to run the fine-tuning code on an NVIDIA T4 by following <a href="htt

How can I run the fine-tuning on a GPU with <= 16GB of VRAM? about text-to-video-finetuning HOT 3 CLOSED

mmakiuchi commented on July 23, 2024

How can I run the fine-tuning on a GPU with <= 16GB of VRAM?

from text-to-video-finetuning.

Comments (3)

ExponentialML commented on July 23, 2024

Hi. Try this config, I've marked the changed lines with "New". I haven't tested it on my rig, but it should be able to sustain under 16GB VRAM. These settings effectively enable LoRA only training. I've also added save_pretrained_model, which is always true, but you can disable it here by setting it to false if you like.

If you can't get memory savings using Pytorch 2.0, try enabling Xformers if you can, and you can try enabling validation previews as well to see if it works.

Train Config [Click Me]

# Pretrained diffusers model path.
pretrained_model_path: cerspense/zeroscope_v2_576w

# The folder where your training outputs will be placed.
output_dir: "./outputs"

# You can train multiple datasets at once. They will be joined together for training.
# Simply remove the line you don't need, or keep them all for mixed training.

# 'image': A folder of images and captions (.txt)
# 'folder': A folder a videos and captions (.txt)
# 'json': The JSON file created with automatic BLIP2 captions using https://github.com/ExponentialML/Video-BLIP2-Preprocessor
# 'single_video': A single video file.mp4 and text prompt
dataset_types:
#   - 'image'
  - 'folder'
#   - 'json'
#   - 'single_video'

# Adds offset noise to training. See https://www.crosslabs.org/blog/diffusion-with-offset-noise
offset_noise_strength: 0.1
use_offset_noise: True

# When True, this extends all items in all enabled datasets to the highest length.
# For example, if you have 200 videos and 10 images, 10 images will be duplicated to the length of 200.
extend_dataset: False

# Caches the latents (Frames-Image -> VAE -> Latent) to a HDD or SDD.
# The latents will be saved under your training folder, and loaded automatically for training.
# This both saves memory and speeds up training and takes very little disk space.
cache_latents: True # New

# If you have cached latents set to `True` and have a directory of cached latents,
# you can skip the caching process and load previously saved ones.
cached_latent_dir: "./outputs" # New

# New / Assuming you want to use with webui
# https://github.com/cloneofsimo/lora
# Use LoRA to train extra layers whilst saving memory. It trains both a LoRA & the model itself.
# This works slightly different than vanilla LoRA and DOES NOT save a separate file.
# It is simply used as a mechanism for saving memory by keeping layers frozen and training the residual.
# The LoRA rank for each is hard coded at `16`. Go higher and we're back where we started :-).

lora_version: "stable_lora"

# This saves a LoRA that is compatible with the text2video webui extension.
# It only works when the lora version is 'stable_lora'.
# This is also a DIFFERENT implementation than Kohya's, so it will not work the same as that implementation.
save_lora_for_webui: True

# The LoRA file will be converted to a different format to be compatible with the webui extension.
# The difference between this and 'save_lora_for_webui' is that you can continue training a Diffusers pipeline model
# when this version is set to False
only_lora_for_webui: False

# Train the text encoder. Leave at false to use LoRA only (Recommended).
train_text_encoder: False

# https://github.com/cloneofsimo/lora
# Use LoRA to train extra layers whilst saving memory. It trains both a LoRA & the model itself.
# This works slightly different than vanilla LoRA and DOES NOT save a separate file.
# It is simply used as a mechanism for saving memory by keeping layers frozen and training the residual.

# Use LoRA for the UNET model.
use_unet_lora: True

# Use LoRA for the Text Encoder.
use_text_lora: True

# The modules to use for LoRA. Different from 'trainable_modules'.
unet_lora_modules: # New
  - "ResnetBlock2D"
  - "TransformerTemporalModel" # New / Remove this line if you get OOM
  - "Transformer2DModel"
  - "CrossAttention"
  - "Attention"
  - "GEGLU"

# The modules to use for LoRA. Different from `trainable_text_modules`.
text_encoder_lora_modules:
  - "CLIPAttention" # New

# The rank for LoRA training. With ModelScope, the maximum should be 1024.
# VRAM increases with higher rank, lower when decreased.
lora_rank: 4 # New

# Training data parameters
train_data:

  # The width and height in which you want your training data to be resized to.
  width: 256
  height: 256

  # This will find the closest aspect ratio to your input width and height.
  # For example, 512x512 width and height with a video of resolution 1280x720 will be resized to 512x256
  use_bucketing: False

  # The start frame index where your videos should start (Leave this at one for json and folder based training).
  sample_start_idx: 1

  # Used for 'folder'. The rate at which your frames are sampled. Does nothing for 'json' and 'single_video' dataset.
  fps: 1

  # # For 'single_video' and 'json'. The number of frames to "step" (1,2,3,4) (frame_step=2) -> (1,3,5,7, ...).
  frame_step: 1 # New

  # # The number of frames to sample. The higher this number, the higher the VRAM (acts similar to batch size).
  n_sample_frames: 8 # New

  # 'single_video'
  single_video_path: "path/to/single/video.mp4"

  # The prompt when using a a single video file
  single_video_prompt: ""

  # Fallback prompt if caption cannot be read. Enabled for 'image' and 'folder'.
  fallback_prompt: ""

  # 'folder'
  path: "data/videos"

  # 'json'
  json_path: 'path/to/train/json/'

  # 'image'
  image_dir: 'path/to/image/directory'

  # The prompt for all image files. Leave blank to use caption files (.txt)
  single_img_prompt: ""

# Validation data parameters.
validation_data:

  # A custom prompt that is different from your training dataset.
  prompt: "A beautiful girl smiling"

  # Whether or not to sample preview during training (Requires more VRAM).
  sample_preview: False

  # The number of frames to sample during validation.
  num_frames: 8 # New

  # Height and width of validation sample.
  width: 256
  height: 256

  # Number of inference steps when generating the video.
  num_inference_steps: 20

  # CFG scale
  guidance_scale: 13

# Learning rate for AdamW
learning_rate: 5e-6

# Weight decay. Higher = more regularization. Lower = closer to dataset.
adam_weight_decay: 1e-3

# Optimizer parameters for the UNET. Overrides base learning rate parameters.
extra_unet_params: null
  #learning_rate: 1e-5
  #adam_weight_decay: 1e-4

# Optimizer parameters for the Text Encoder. Overrides base learning rate parameters.
extra_text_encoder_params: null
  #learning_rate: 5e-6
  #adam_weight_decay: 0.2

# How many batches to train. Not to be confused with video frames.
train_batch_size: 1
gradient_accumulation_steps: 1

# Maximum number of train steps. Model is saved after training.
max_train_steps: 5000

# Saves a model every nth step.
checkpointing_steps: 1000

# Save full pretrained model.
save_pretrained_model = True # New

# How many steps to do for validation if sample_preview is enabled.
validation_steps: 200 # New

# Which modules we want to unfreeze for the UNET. Advanced usage.
trainable_modules: null # New
  #- "all"
  # If you want to ignore temporal attention entirely, remove "attn1-2" and replace with ".attentions"
  # This is for self attetion. Activates for spatial and temporal dimensions if n_sample_frames > 1
#   - "attn1"

  # This is for cross attention (image & text data). Activates for spatial and temporal dimensions if n_sample_frames > 1
#   - "attn2"

  #  Convolution networks that hold temporal information. Activates for spatial and temporal dimensions if n_sample_frames > 1
#   - 'temp_conv'


# Which modules we want to unfreeze for the Text Encoder. Advanced usage.
trainable_text_modules: null # New

# Seed for validation.
seed: 64

# Whether or not we want to use mixed precision with accelerate
mixed_precision: "fp16"

# This seems to be incompatible at the moment.
use_8bit_adam: False

# Trades VRAM usage for speed. You lose roughly 20% of training speed, but save a lot of VRAM.
# If you need to save more VRAM, it can also be enabled for the text encoder, but reduces speed x2.
gradient_checkpointing: True
text_encoder_gradient_checkpointing: True

# Xformers must be installed for best memory savings and performance (< Pytorch 2.0)
enable_xformers_memory_efficient_attention: False

# Use scaled dot product attention (Only available with >= Torch 2.0)
enable_torch_2_attn: True

from text-to-video-finetuning.

mmakiuchi commented on July 23, 2024

Thank you for the reply!
When fine-tuning with the configuration above, I get an ZeroDivisionError: division by zero error in the line num_train_epochs = math.ceil(max_train_steps / num_update_steps_per_epoch).
If I change the configuration line cached_latent_dir: "./outputs" to cached_latent_dir: null, it works.

from text-to-video-finetuning.

ExponentialML commented on July 23, 2024

Thank you for the reply! When fine-tuning with the configuration above, I get an ZeroDivisionError: division by zero error in the line num_train_epochs = math.ceil(max_train_steps / num_update_steps_per_epoch). If I change the configuration line cached_latent_dir: "./outputs" to cached_latent_dir: null, it works.

Glad it's working for you! Yes, this was my bad. For cached latents to work, it should be cache_latents: True.

from text-to-video-finetuning.

How can I run the fine-tuning on a GPU with <= 16GB of VRAM? about text-to-video-finetuning HOT 3 CLOSED

Comments (3)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent