Giter VIP home page Giter VIP logo

Comments (40)

ImBadAtNames2019 avatar ImBadAtNames2019 commented on May 27, 2024 1

AttributeError: 'DDPMScheduler' object has no attribute 'prediction_type' Steps: 0% 0/10000 [00:00<?, ?it/s]

Thats it i give up, this new version is making me want to jump off the balcony. I will just wait for the VideoCrafter implementation.

from text-to-video-finetuning.

ImBadAtNames2019 avatar ImBadAtNames2019 commented on May 27, 2024

Im using the exact same dataset that i also used in the previous version of this repo, and it worked before no problem.

from text-to-video-finetuning.

ImBadAtNames2019 avatar ImBadAtNames2019 commented on May 27, 2024

I added import glob on top of the dataset.py file and now i get this error

33 Attention layers using Scaled Dot Product Attention.
Lora successfully injected into UNet3DConditionModel.
Lora successfully injected into CLIPTextModel.
Traceback (most recent call last):
File "/content/Text-To-Video-Finetuning/train.py", line 915, in
main(**OmegaConf.load(args.config))
File "/content/Text-To-Video-Finetuning/train.py", line 582, in main
train_datasets = get_train_dataset(dataset_types, train_data, tokenizer)
File "/content/Text-To-Video-Finetuning/train.py", line 86, in get_train_dataset
train_datasets.append(DataSet(**train_data, tokenizer=tokenizer))
File "/content/Text-To-Video-Finetuning/utils/dataset.py", line 488, in init
self.video_files = glob(f"{path}/*.mp4")
TypeError: 'module' object is not callable

from text-to-video-finetuning.

ImBadAtNames2019 avatar ImBadAtNames2019 commented on May 27, 2024

Im running it on google colab, if that matters.

from text-to-video-finetuning.

ImBadAtNames2019 avatar ImBadAtNames2019 commented on May 27, 2024

I tried adding from glob import glob instead of import glob on top of the dataset.py script and train.py script, and now i get this error instead

Lora successfully injected into UNet3DConditionModel.
Lora successfully injected into CLIPTextModel.
Caching Latents.: 0% 0/29 [00:00<?, ?it/s]
╭───────────────────── Traceback (most recent call last) ──────────────────────╮
│ /content/Text-To-Video-Finetuning/train.py:915 in │
│ │
│ 912 │ parser.add_argument("--config", type=str, default="./configs/my_co │
│ 913 │ args = parser.parse_args() │
│ 914 │ │
│ ❱ 915 │ main(**OmegaConf.load(args.config)) │
│ 916 │
│ │
│ /content/Text-To-Video-Finetuning/train.py:604 in main │
│ │
│ 601 │ ) │
│ 602 │ │
│ 603 │ # Latents caching │
│ ❱ 604 │ cached_data_loader = handle_cache_latents( │
│ 605 │ │ cache_latents, │
│ 606 │ │ output_dir, │
│ 607 │ │ train_dataloader, │
│ │
│ /content/Text-To-Video-Finetuning/train.py:333 in handle_cache_latents │
│ │
│ 330 │ │ cache_save_dir = f"{output_dir}/cached_latents" │
│ 331 │ │ os.makedirs(cache_save_dir, exist_ok=True) │
│ 332 │ │ │
│ ❱ 333 │ │ for i, batch in enumerate(tqdm(train_dataloader, desc="Caching │
│ 334 │ │ │ │
│ 335 │ │ │ save_name = f"cached_{i}" │
│ 336 │ │ │ full_out_path = f"{cache_save_dir}/{save_name}.pt" │
│ │
│ /usr/local/lib/python3.9/dist-packages/tqdm/std.py:1178 in iter
│ │
│ 1175 │ │ time = self._time │
│ 1176 │ │ │
│ 1177 │ │ try: │
│ ❱ 1178 │ │ │ for obj in iterable: │
│ 1179 │ │ │ │ yield obj │
│ 1180 │ │ │ │ # Update and possibly print the progressbar. │
│ 1181 │ │ │ │ # Note: does not call self.update(1) for speed optimi │
│ │
│ /usr/local/lib/python3.9/dist-packages/torch/utils/data/dataloader.py:634 in │
next
│ │
│ 631 │ │ │ if self._sampler_iter is None: │
│ 632 │ │ │ │ # TODO(pytorch/pytorch#7675
│ 633 │ │ │ │ self._reset() # type: ignore[call-arg] │
│ ❱ 634 │ │ │ data = self._next_data() │
│ 635 │ │ │ self._num_yielded += 1 │
│ 636 │ │ │ if self._dataset_kind == _DatasetKind.Iterable and \ │
│ 637 │ │ │ │ │ self._IterableDataset_len_called is not None and │
│ │
│ /usr/local/lib/python3.9/dist-packages/torch/utils/data/dataloader.py:678 in │
│ _next_data │
│ │
│ 675 │ │
│ 676 │ def _next_data(self): │
│ 677 │ │ index = self._next_index() # may raise StopIteration │
│ ❱ 678 │ │ data = self._dataset_fetcher.fetch(index) # may raise StopIt │
│ 679 │ │ if self._pin_memory: │
│ 680 │ │ │ data = _utils.pin_memory.pin_memory(data, self._pin_memor │
│ 681 │ │ return data │
│ │
│ /usr/local/lib/python3.9/dist-packages/torch/utils/data/utils/fetch.py:51 │
│ in fetch │
│ │
│ 48 │ │ │ if hasattr(self.dataset, "getitems") and self.dataset.

│ 49 │ │ │ │ data = self.dataset.getitems(possibly_batched_index │
│ 50 │ │ │ else: │
│ ❱ 51 │ │ │ │ data = [self.dataset[idx] for idx in possibly_batched_i │
│ 52 │ │ else: │
│ 53 │ │ │ data = self.dataset[possibly_batched_index] │
│ 54 │ │ return self.collate_fn(data) │
│ │
│ /usr/local/lib/python3.9/dist-packages/torch/utils/data/utils/fetch.py:51 │
│ in │
│ │
│ 48 │ │ │ if hasattr(self.dataset, "getitems") and self.dataset.

│ 49 │ │ │ │ data = self.dataset.getitems(possibly_batched_index │
│ 50 │ │ │ else: │
│ ❱ 51 │ │ │ │ data = [self.dataset[idx] for idx in possibly_batched_i │
│ 52 │ │ else: │
│ 53 │ │ │ data = self.dataset[possibly_batched_index] │
│ 54 │ │ return self.collate_fn(data) │
│ │
│ /content/Text-To-Video-Finetuning/utils/dataset.py:549 in getitem
│ │
│ 546 │ │
│ 547 │ def getitem(self, index): │
│ 548 │ │ │
│ ❱ 549 │ │ video, _ = self.process_video_wrapper(self.video_files[index]) │
│ 550 │ │ │
│ 551 │ │ if os.path.exists(self.video_files[index].replace(".mp4", ".tx │
│ 552 │ │ │ with open(self.video_files[index].replace(".mp4", ".txt"), │
│ │
│ /content/Text-To-Video-Finetuning/utils/dataset.py:522 in │
│ process_video_wrapper │
│ │
│ 519 │ │ return video, vr │
│ 520 │ │
│ 521 │ def process_video_wrapper(self, vid_path): │
│ ❱ 522 │ │ video, vr = process_video( │
│ 523 │ │ │ │ vid_path, │
│ 524 │ │ │ │ self.use_bucketing, │
│ 525 │ │ │ │ self.width, │
│ │
│ /content/Text-To-Video-Finetuning/utils/dataset.py:78 in process_video │
│ │
│ 75 def process_video(vid_path, use_bucketing, w, h, get_frame_buckets, ge │
│ 76 │ if use_bucketing: │
│ 77 │ │ vr = decord.VideoReader(vid_path) │
│ ❱ 78 │ │ resize = get_frame_buckets(vr) │
│ 79 │ │ video = get_frame_batch(vr, resize=resize) │
│ 80 │ │
│ 81 │ else: │
│ │
│ /content/Text-To-Video-Finetuning/utils/dataset.py:497 in get_frame_buckets │
│ │
│ 494 │ │
│ 495 │ def get_frame_buckets(self, vr): │
│ 496 │ │ _, h, w = vr[0].shape │
│ ❱ 497 │ │ width, height = sensible_buckets(self.width, self.height, h, w │
│ 498 │ │ resize = T.transforms.Resize((height, width), antialias=True) │
│ 499 │ │ │
│ 500 │ │ return resize │
╰──────────────────────────────────────────────────────────────────────────────╯
NameError: name 'sensible_buckets' is not defined

from text-to-video-finetuning.

ImBadAtNames2019 avatar ImBadAtNames2019 commented on May 27, 2024

Maybe its because im not using conda? I cant make conda work on google colab.

from text-to-video-finetuning.

ImBadAtNames2019 avatar ImBadAtNames2019 commented on May 27, 2024

Nope, i have no idea what to do, i cant use this at all now.

from text-to-video-finetuning.

ImBadAtNames2019 avatar ImBadAtNames2019 commented on May 27, 2024

Now i get this error:

2023-04-09 17:03:29.790548: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
/usr/local/lib/python3.9/dist-packages/accelerate/accelerator.py:249: FutureWarning: logging_dir is deprecated and will be removed in version 0.18.0 of 🤗 Accelerate. Use project_dir instead.
warnings.warn(
04/09/2023 17:03:31 - INFO - main - Distributed environment: NO
Num processes: 1
Process index: 0
Local process index: 0
Device: cuda

Mixed precision type: fp16

{'variance_type'} was not found in config. Values will be initialized to default values.
/usr/local/lib/python3.9/dist-packages/transformers/modeling_utils.py:402: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
with safe_open(checkpoint_file, framework="pt") as f:
/usr/local/lib/python3.9/dist-packages/torch/utils.py:776: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
return self.fget.get(instance, owner)()
/usr/local/lib/python3.9/dist-packages/torch/storage.py:899: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
storage = cls(wrap_storage=untyped_storage)
/usr/local/lib/python3.9/dist-packages/safetensors/torch.py:99: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
with safe_open(filename, framework="pt", device=device) as f:
{'mid_block_scale_factor', 'downsample_padding'} was not found in config. Values will be initialized to default values.
33 Attention layers using Scaled Dot Product Attention.
Lora successfully injected into UNet3DConditionModel.
Lora successfully injected into CLIPTextModel.
Caching Latents.: 0% 0/29 [00:01<?, ?it/s]
╭───────────────────── Traceback (most recent call last) ──────────────────────╮
│ /content/Text-To-Video-Finetuning/train.py:914 in │
│ │
│ 911 │ parser.add_argument("--config", type=str, default="./configs/my_co │
│ 912 │ args = parser.parse_args() │
│ 913 │ │
│ ❱ 914 │ main(**OmegaConf.load(args.config)) │
│ 915 │
│ │
│ /content/Text-To-Video-Finetuning/train.py:603 in main │
│ │
│ 600 │ ) │
│ 601 │ │
│ 602 │ # Latents caching │
│ ❱ 603 │ cached_data_loader = handle_cache_latents( │
│ 604 │ │ cache_latents, │
│ 605 │ │ output_dir, │
│ 606 │ │ train_dataloader, │
│ │
│ /content/Text-To-Video-Finetuning/train.py:332 in handle_cache_latents │
│ │
│ 329 │ │ cache_save_dir = f"{output_dir}/cached_latents" │
│ 330 │ │ os.makedirs(cache_save_dir, exist_ok=True) │
│ 331 │ │ │
│ ❱ 332 │ │ for i, batch in enumerate(tqdm(train_dataloader, desc="Caching │
│ 333 │ │ │ │
│ 334 │ │ │ save_name = f"cached
{i}" │
│ 335 │ │ │ full_out_path = f"{cache_save_dir}/{save_name}.pt" │
│ │
│ /usr/local/lib/python3.9/dist-packages/tqdm/std.py:1178 in iter
│ │
│ 1175 │ │ time = self._time │
│ 1176 │ │ │
│ 1177 │ │ try: │
│ ❱ 1178 │ │ │ for obj in iterable: │
│ 1179 │ │ │ │ yield obj │
│ 1180 │ │ │ │ # Update and possibly print the progressbar. │
│ 1181 │ │ │ │ # Note: does not call self.update(1) for speed optimi │
│ │
│ /usr/local/lib/python3.9/dist-packages/torch/utils/data/dataloader.py:634 in │
next
│ │
│ 631 │ │ │ if self._sampler_iter is None: │
│ 632 │ │ │ │ # TODO(pytorch/pytorch#7675
│ 633 │ │ │ │ self._reset() # type: ignore[call-arg] │
│ ❱ 634 │ │ │ data = self._next_data() │
│ 635 │ │ │ self._num_yielded += 1 │
│ 636 │ │ │ if self._dataset_kind == _DatasetKind.Iterable and \ │
│ 637 │ │ │ │ │ self._IterableDataset_len_called is not None and │
│ │
│ /usr/local/lib/python3.9/dist-packages/torch/utils/data/dataloader.py:678 in │
│ _next_data │
│ │
│ 675 │ │
│ 676 │ def _next_data(self): │
│ 677 │ │ index = self._next_index() # may raise StopIteration │
│ ❱ 678 │ │ data = self._dataset_fetcher.fetch(index) # may raise StopIt │
│ 679 │ │ if self._pin_memory: │
│ 680 │ │ │ data = _utils.pin_memory.pin_memory(data, self._pin_memor │
│ 681 │ │ return data │
│ │
│ /usr/local/lib/python3.9/dist-packages/torch/utils/data/utils/fetch.py:51 │
│ in fetch │
│ │
│ 48 │ │ │ if hasattr(self.dataset, "getitems") and self.dataset.

│ 49 │ │ │ │ data = self.dataset.getitems(possibly_batched_index │
│ 50 │ │ │ else: │
│ ❱ 51 │ │ │ │ data = [self.dataset[idx] for idx in possibly_batched_i │
│ 52 │ │ else: │
│ 53 │ │ │ data = self.dataset[possibly_batched_index] │
│ 54 │ │ return self.collate_fn(data) │
│ │
│ /usr/local/lib/python3.9/dist-packages/torch/utils/data/utils/fetch.py:51 │
│ in │
│ │
│ 48 │ │ │ if hasattr(self.dataset, "getitems") and self.dataset.

│ 49 │ │ │ │ data = self.dataset.getitems(possibly_batched_index │
│ 50 │ │ │ else: │
│ ❱ 51 │ │ │ │ data = [self.dataset[idx] for idx in possibly_batched_i │
│ 52 │ │ else: │
│ 53 │ │ │ data = self.dataset[possibly_batched_index] │
│ 54 │ │ return self.collate_fn(data) │
│ │
│ /content/Text-To-Video-Finetuning/utils/dataset.py:560 in getitem
│ │
│ 557 │ │ │
│ 558 │ │ prompt_ids = self.get_prompt_ids(prompt) │
│ 559 │ │ │
│ ❱ 560 │ │ return {"pixel_values": (video / 127.5 - 1.0), "prompt_ids": p │
│ 561 │
│ 562 class CachedDataset(Dataset): │
│ 563 │ def init(self,cache_dir: str = ''): │
╰──────────────────────────────────────────────────────────────────────────────╯
TypeError: unsupported operand type(s) for /: 'tuple' and 'float'

from text-to-video-finetuning.

ExponentialML avatar ExponentialML commented on May 27, 2024

I just pushed a quick fix. Can you check to see if it works?

from text-to-video-finetuning.

ImBadAtNames2019 avatar ImBadAtNames2019 commented on May 27, 2024

I just pushed a quick fix. Can you check to see if it works?

Im testing now, give me one second.

from text-to-video-finetuning.

ImBadAtNames2019 avatar ImBadAtNames2019 commented on May 27, 2024

I just pushed a quick fix. Can you check to see if it works?

Nope, now i get this error. Im using the default config file, i only changed the location of the model and the location of the folder containing the videos.

2023-04-09 19:42:03.849781: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
/usr/local/lib/python3.9/dist-packages/accelerate/accelerator.py:249: FutureWarning: logging_dir is deprecated and will be removed in version 0.18.0 of 🤗 Accelerate. Use project_dir instead.
warnings.warn(
04/09/2023 19:42:05 - INFO - main - Distributed environment: NO
Num processes: 1
Process index: 0
Local process index: 0
Device: cuda

Mixed precision type: fp16

{'variance_type'} was not found in config. Values will be initialized to default values.
/usr/local/lib/python3.9/dist-packages/transformers/modeling_utils.py:402: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
with safe_open(checkpoint_file, framework="pt") as f:
/usr/local/lib/python3.9/dist-packages/torch/utils.py:776: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
return self.fget.get(instance, owner)()
/usr/local/lib/python3.9/dist-packages/torch/storage.py:899: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
storage = cls(wrap_storage=untyped_storage)
/usr/local/lib/python3.9/dist-packages/safetensors/torch.py:99: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
with safe_open(filename, framework="pt", device=device) as f:
{'mid_block_scale_factor', 'downsample_padding'} was not found in config. Values will be initialized to default values.
33 Attention layers using Scaled Dot Product Attention.
Lora successfully injected into UNet3DConditionModel.
Lora successfully injected into CLIPTextModel.
Caching Latents.: 0% 0/29 [00:00<?, ?it/s]
╭───────────────────── Traceback (most recent call last) ──────────────────────╮
│ /content/Text-To-Video-Finetuning/train.py:914 in │
│ │
│ 911 │ parser.add_argument("--config", type=str, default="./configs/my_co │
│ 912 │ args = parser.parse_args() │
│ 913 │ │
│ ❱ 914 │ main(**OmegaConf.load(args.config)) │
│ 915 │
│ │
│ /content/Text-To-Video-Finetuning/train.py:603 in main │
│ │
│ 600 │ ) │
│ 601 │ │
│ 602 │ # Latents caching │
│ ❱ 603 │ cached_data_loader = handle_cache_latents( │
│ 604 │ │ cache_latents, │
│ 605 │ │ output_dir, │
│ 606 │ │ train_dataloader, │
│ │
│ /content/Text-To-Video-Finetuning/train.py:332 in handle_cache_latents │
│ │
│ 329 │ │ cache_save_dir = f"{output_dir}/cached_latents" │
│ 330 │ │ os.makedirs(cache_save_dir, exist_ok=True) │
│ 331 │ │ │
│ ❱ 332 │ │ for i, batch in enumerate(tqdm(train_dataloader, desc="Caching │
│ 333 │ │ │ │
│ 334 │ │ │ save_name = f"cached
{i}" │
│ 335 │ │ │ full_out_path = f"{cache_save_dir}/{save_name}.pt" │
│ │
│ /usr/local/lib/python3.9/dist-packages/tqdm/std.py:1178 in iter
│ │
│ 1175 │ │ time = self._time │
│ 1176 │ │ │
│ 1177 │ │ try: │
│ ❱ 1178 │ │ │ for obj in iterable: │
│ 1179 │ │ │ │ yield obj │
│ 1180 │ │ │ │ # Update and possibly print the progressbar. │
│ 1181 │ │ │ │ # Note: does not call self.update(1) for speed optimi │
│ │
│ /usr/local/lib/python3.9/dist-packages/torch/utils/data/dataloader.py:634 in │
next
│ │
│ 631 │ │ │ if self._sampler_iter is None: │
│ 632 │ │ │ │ # TODO(pytorch/pytorch#7675
│ 633 │ │ │ │ self._reset() # type: ignore[call-arg] │
│ ❱ 634 │ │ │ data = self._next_data() │
│ 635 │ │ │ self._num_yielded += 1 │
│ 636 │ │ │ if self._dataset_kind == _DatasetKind.Iterable and \ │
│ 637 │ │ │ │ │ self._IterableDataset_len_called is not None and │
│ │
│ /usr/local/lib/python3.9/dist-packages/torch/utils/data/dataloader.py:678 in │
│ _next_data │
│ │
│ 675 │ │
│ 676 │ def _next_data(self): │
│ 677 │ │ index = self._next_index() # may raise StopIteration │
│ ❱ 678 │ │ data = self._dataset_fetcher.fetch(index) # may raise StopIt │
│ 679 │ │ if self._pin_memory: │
│ 680 │ │ │ data = _utils.pin_memory.pin_memory(data, self._pin_memor │
│ 681 │ │ return data │
│ │
│ /usr/local/lib/python3.9/dist-packages/torch/utils/data/utils/fetch.py:51 │
│ in fetch │
│ │
│ 48 │ │ │ if hasattr(self.dataset, "getitems") and self.dataset.

│ 49 │ │ │ │ data = self.dataset.getitems(possibly_batched_index │
│ 50 │ │ │ else: │
│ ❱ 51 │ │ │ │ data = [self.dataset[idx] for idx in possibly_batched_i │
│ 52 │ │ else: │
│ 53 │ │ │ data = self.dataset[possibly_batched_index] │
│ 54 │ │ return self.collate_fn(data) │
│ │
│ /usr/local/lib/python3.9/dist-packages/torch/utils/data/utils/fetch.py:51 │
│ in │
│ │
│ 48 │ │ │ if hasattr(self.dataset, "getitems") and self.dataset.

│ 49 │ │ │ │ data = self.dataset.getitems(possibly_batched_index │
│ 50 │ │ │ else: │
│ ❱ 51 │ │ │ │ data = [self.dataset[idx] for idx in possibly_batched_i │
│ 52 │ │ else: │
│ 53 │ │ │ data = self.dataset[possibly_batched_index] │
│ 54 │ │ return self.collate_fn(data) │
│ │
│ /content/Text-To-Video-Finetuning/utils/dataset.py:550 in getitem
│ │
│ 547 │ │
│ 548 │ def getitem(self, index): │
│ 549 │ │ │
│ ❱ 550 │ │ video, _ = self.process_video_wrapper(self.video_files[index]) │
│ 551 │ │ │
│ 552 │ │ if os.path.exists(self.video_files[index].replace(".mp4", ".tx │
│ 553 │ │ │ with open(self.video_files[index].replace(".mp4", ".txt"), │
│ │
│ /content/Text-To-Video-Finetuning/utils/dataset.py:523 in │
│ process_video_wrapper │
│ │
│ 520 │ │ return video, vr │
│ 521 │ │
│ 522 │ def process_video_wrapper(self, vid_path): │
│ ❱ 523 │ │ video, vr = process_video( │
│ 524 │ │ │ │ vid_path, │
│ 525 │ │ │ │ self.use_bucketing, │
│ 526 │ │ │ │ self.width, │
│ │
│ /content/Text-To-Video-Finetuning/utils/dataset.py:80 in process_video │
│ │
│ 77 │ if use_bucketing: │
│ 78 │ │ vr = decord.VideoReader(vid_path) │
│ 79 │ │ resize = get_frame_buckets(vr) │
│ ❱ 80 │ │ video = get_frame_batch(vr, resize=resize) │
│ 81 │ │
│ 82 │ else: │
│ 83 │ │ vr = decord.VideoReader(vid_path, width=w, height=h) │
│ │
│ /content/Text-To-Video-Finetuning/utils/dataset.py:507 in get_frame_batch │
│ │
│ 504 │ │ native_fps = vr.get_avg_fps() │
│ 505 │ │ every_nth_frame = round(native_fps / self.fps) │
│ 506 │ │ │
│ ❱ 507 │ │ effective_length = len(vr) // every_nth_frame │
│ 508 │ │ │
│ 509 │ │ if effective_length < self.n_sample_frames: │
│ 510 │ │ │ return self.getitem(random.randint(0, len(self.video_f │
╰──────────────────────────────────────────────────────────────────────────────╯
ZeroDivisionError: integer division or modulo by zero

from text-to-video-finetuning.

ExponentialML avatar ExponentialML commented on May 27, 2024

@ImBadAtNames2019 I pushed another fix. Try again please.

I apologize for the inconvenience as I'm able to test at the moment, but the following fix should work.

from text-to-video-finetuning.

ImBadAtNames2019 avatar ImBadAtNames2019 commented on May 27, 2024

@ImBadAtNames2019 I pushed another fix. Try again please.

I apologize for the inconvenience as I'm able to test at the moment, but the following fix should work.

No worries.

Testing right now.

from text-to-video-finetuning.

ImBadAtNames2019 avatar ImBadAtNames2019 commented on May 27, 2024

@ImBadAtNames2019 I pushed another fix. Try again please.

I apologize for the inconvenience as I'm able to test at the moment, but the following fix should work.

Nope, i even tried changing videos and i still get this errors. I get 2 different errors, sometimes the one i showed you above, and sometimes this one:

2023-04-09 20:49:11.225804: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
/usr/local/lib/python3.9/dist-packages/accelerate/accelerator.py:249: FutureWarning: logging_dir is deprecated and will be removed in version 0.18.0 of 🤗 Accelerate. Use project_dir instead.
warnings.warn(
04/09/2023 20:49:13 - INFO - main - Distributed environment: NO
Num processes: 1
Process index: 0
Local process index: 0
Device: cuda

Mixed precision type: fp16

{'variance_type'} was not found in config. Values will be initialized to default values.
/usr/local/lib/python3.9/dist-packages/transformers/modeling_utils.py:402: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
with safe_open(checkpoint_file, framework="pt") as f:
/usr/local/lib/python3.9/dist-packages/torch/utils.py:776: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
return self.fget.get(instance, owner)()
/usr/local/lib/python3.9/dist-packages/torch/storage.py:899: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
storage = cls(wrap_storage=untyped_storage)
/usr/local/lib/python3.9/dist-packages/safetensors/torch.py:99: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
with safe_open(filename, framework="pt", device=device) as f:
{'downsample_padding', 'mid_block_scale_factor'} was not found in config. Values will be initialized to default values.
33 Attention layers using Scaled Dot Product Attention.
Lora successfully injected into UNet3DConditionModel.
Lora successfully injected into CLIPTextModel.
Caching Latents.: 0% 0/2 [00:00<?, ?it/s]
╭───────────────────── Traceback (most recent call last) ──────────────────────╮
│ /content/Text-To-Video-Finetuning/train.py:914 in │
│ │
│ 911 │ parser.add_argument("--config", type=str, default="./configs/my_co │
│ 912 │ args = parser.parse_args() │
│ 913 │ │
│ ❱ 914 │ main(**OmegaConf.load(args.config)) │
│ 915 │
│ │
│ /content/Text-To-Video-Finetuning/train.py:603 in main │
│ │
│ 600 │ ) │
│ 601 │ │
│ 602 │ # Latents caching │
│ ❱ 603 │ cached_data_loader = handle_cache_latents( │
│ 604 │ │ cache_latents, │
│ 605 │ │ output_dir, │
│ 606 │ │ train_dataloader, │
│ │
│ /content/Text-To-Video-Finetuning/train.py:332 in handle_cache_latents │
│ │
│ 329 │ │ cache_save_dir = f"{output_dir}/cached_latents" │
│ 330 │ │ os.makedirs(cache_save_dir, exist_ok=True) │
│ 331 │ │ │
│ ❱ 332 │ │ for i, batch in enumerate(tqdm(train_dataloader, desc="Caching │
│ 333 │ │ │ │
│ 334 │ │ │ save_name = f"cached
{i}" │
│ 335 │ │ │ full_out_path = f"{cache_save_dir}/{save_name}.pt" │
│ │
│ /usr/local/lib/python3.9/dist-packages/tqdm/std.py:1178 in iter
│ │
│ 1175 │ │ time = self._time │
│ 1176 │ │ │
│ 1177 │ │ try: │
│ ❱ 1178 │ │ │ for obj in iterable: │
│ 1179 │ │ │ │ yield obj │
│ 1180 │ │ │ │ # Update and possibly print the progressbar. │
│ 1181 │ │ │ │ # Note: does not call self.update(1) for speed optimi │
│ │
│ /usr/local/lib/python3.9/dist-packages/torch/utils/data/dataloader.py:634 in │
next
│ │
│ 631 │ │ │ if self._sampler_iter is None: │
│ 632 │ │ │ │ # TODO(pytorch/pytorch#7675
│ 633 │ │ │ │ self._reset() # type: ignore[call-arg] │
│ ❱ 634 │ │ │ data = self._next_data() │
│ 635 │ │ │ self._num_yielded += 1 │
│ 636 │ │ │ if self._dataset_kind == _DatasetKind.Iterable and \ │
│ 637 │ │ │ │ │ self._IterableDataset_len_called is not None and │
│ │
│ /usr/local/lib/python3.9/dist-packages/torch/utils/data/dataloader.py:678 in │
│ _next_data │
│ │
│ 675 │ │
│ 676 │ def _next_data(self): │
│ 677 │ │ index = self._next_index() # may raise StopIteration │
│ ❱ 678 │ │ data = self._dataset_fetcher.fetch(index) # may raise StopIt │
│ 679 │ │ if self._pin_memory: │
│ 680 │ │ │ data = _utils.pin_memory.pin_memory(data, self._pin_memor │
│ 681 │ │ return data │
│ │
│ /usr/local/lib/python3.9/dist-packages/torch/utils/data/utils/fetch.py:51 │
│ in fetch │
│ │
│ 48 │ │ │ if hasattr(self.dataset, "getitems") and self.dataset.

│ 49 │ │ │ │ data = self.dataset.getitems(possibly_batched_index │
│ 50 │ │ │ else: │
│ ❱ 51 │ │ │ │ data = [self.dataset[idx] for idx in possibly_batched_i │
│ 52 │ │ else: │
│ 53 │ │ │ data = self.dataset[possibly_batched_index] │
│ 54 │ │ return self.collate_fn(data) │
│ │
│ /usr/local/lib/python3.9/dist-packages/torch/utils/data/utils/fetch.py:51 │
│ in │
│ │
│ 48 │ │ │ if hasattr(self.dataset, "getitems") and self.dataset.

│ 49 │ │ │ │ data = self.dataset.getitems(possibly_batched_index │
│ 50 │ │ │ else: │
│ ❱ 51 │ │ │ │ data = [self.dataset[idx] for idx in possibly_batched_i │
│ 52 │ │ else: │
│ 53 │ │ │ data = self.dataset[possibly_batched_index] │
│ 54 │ │ return self.collate_fn(data) │
│ │
│ /content/Text-To-Video-Finetuning/utils/dataset.py:560 in getitem
│ │
│ 557 │ │ │
│ 558 │ │ prompt_ids = self.get_prompt_ids(prompt) │
│ 559 │ │ │
│ ❱ 560 │ │ return {"pixel_values": (video / 127.5 - 1.0), "prompt_ids": p │
│ 561 │
│ 562 class CachedDataset(Dataset): │
│ 563 │ def init(self,cache_dir: str = ''): │
╰──────────────────────────────────────────────────────────────────────────────╯
TypeError: unsupported operand type(s) for /: 'tuple' and 'float'

from text-to-video-finetuning.

ImBadAtNames2019 avatar ImBadAtNames2019 commented on May 27, 2024

Maybe because im running it on google colab?

from text-to-video-finetuning.

bruefire avatar bruefire commented on May 27, 2024

I'm not using Colab, but I encountered the following same(?) error:
ZeroDivisionError: integer division or modulo by zero

In my case, I removed 'clip_path' items from the JSON file generated after preprocessing, and this allowed me to start the training successfully. I haven't finished the training yet, but it has progressed up to 1500 steps.

from text-to-video-finetuning.

ExponentialML avatar ExponentialML commented on May 27, 2024

@ImBadAtNames2019 Should be fixed now.

@bruefire Could you please post the error log if possible?

from text-to-video-finetuning.

bruefire avatar bruefire commented on May 27, 2024

@ExponentialML
No problem. Here is the log (sorry for the ugly path):
(venv) (base) PS E:\userdata\Documents\program\project\github\Text-To-Video-Finetuning> python train.py --config .\configs\v2\my_train_config.yaml
E:\userdata\Documents\program\project\github\Text-To-Video-Finetuning\venv\lib\site-packages\torchvision\io\image.py:13: UserWarning: Failed to load image Python extension: [WinError 127] 指定されたプロシージャが見つかりません。
warn(f"Failed to load image Python extension: {e}")
WARNING[XFORMERS]: xFormers can't load C++/CUDA extensions. xFormers was built for:
PyTorch 1.13.1+cu117 with CUDA 1107 (you have 2.1.0.dev20230409+cu117)
Python 3.9.13 (you have 3.9.13)
Please reinstall xformers (see https://github.com/facebookresearch/xformers#installing-xformers)
Memory-efficient attention, SwiGLU, sparse and more won't be available.
Set XFORMERS_MORE_DETAILS=1 for more details
A matching Triton is not available, some optimizations will not be enabled.
Error caught was: No module named 'triton'
E:\userdata\Documents\program\project\github\Text-To-Video-Finetuning\venv\lib\site-packages\accelerate\accelerator.py:249: FutureWarning: logging_dir is deprecated and will be removed in version 0.18.0 of 🤗 Accelerate. Use project_dir instead.
warnings.warn(
E:\userdata\Documents\program\project\github\Text-To-Video-Finetuning\venv\lib\site-packages\accelerate\accelerator.py:359: UserWarning: log_with=tensorboard was passed but no supported trackers are currently installed.
warnings.warn(f"log_with={log_with} was passed but no supported trackers are currently installed.")
04/10/2023 07:02:22 - INFO - main - Distributed environment: NO
Num processes: 1
Process index: 0
Local process index: 0
Device: cuda

Mixed precision type: fp16

{'variance_type'} was not found in config. Values will be initialized to default values.
E:\userdata\Documents\program\project\github\Text-To-Video-Finetuning\venv\lib\site-packages\transformers\modeling_utils.py:402: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
with safe_open(checkpoint_file, framework="pt") as f:
{'downsample_padding', 'mid_block_scale_factor'} was not found in config. Values will be initialized to default values.
33 Attention layers using Scaled Dot Product Attention.
Lora successfully injected into UNet3DConditionModel.
Lora successfully injected into CLIPTextModel.
Loading JSON from ./json/anime-v2.json
Caching Latents.: 1%|▌ | 40/4064 [00:07<11:48, 5.68it/s]
Traceback (most recent call last):
File "E:\userdata\Documents\program\project\github\Text-To-Video-Finetuning\train.py", line 914, in
main(**OmegaConf.load(args.config))
File "E:\userdata\Documents\program\project\github\Text-To-Video-Finetuning\train.py", line 603, in main
cached_data_loader = handle_cache_latents(
File "E:\userdata\Documents\program\project\github\Text-To-Video-Finetuning\train.py", line 338, in handle_cache_latents
batch['pixel_values'] = tensor_to_vae_latent(pixel_values, vae)
File "E:\userdata\Documents\program\project\github\Text-To-Video-Finetuning\train.py", line 385, in tensor_to_vae_latent
latents = rearrange(latents, "(b f) c h w -> b c f h w", f=video_length)
File "E:\userdata\Documents\program\project\github\Text-To-Video-Finetuning\venv\lib\site-packages\einops\einops.py", line 483, in rearrange
return reduce(cast(Tensor, tensor), pattern, reduction='rearrange', **axes_lengths)
File "E:\userdata\Documents\program\project\github\Text-To-Video-Finetuning\venv\lib\site-packages\einops\einops.py", line 412, in reduce
return _apply_recipe(recipe, tensor, reduction_type=reduction)
File "E:\userdata\Documents\program\project\github\Text-To-Video-Finetuning\venv\lib\site-packages\einops\einops.py", line 235, in _apply_recipe
_reconstruct_from_shape(recipe, backend.shape(tensor))
File "E:\userdata\Documents\program\project\github\Text-To-Video-Finetuning\venv\lib\site-packages\einops\einops.py", line 199, in _reconstruct_from_shape_uncached
if isinstance(length, int) and isinstance(known_product, int) and length % known_product != 0:
ZeroDivisionError: integer division or modulo by zero

from text-to-video-finetuning.

ExponentialML avatar ExponentialML commented on May 27, 2024

@bruefire Interesting. Could you check to see if that specific video file is corrupt or plays at all? It seems everything goes well up until the 40th clip. If it is, I can implement some checks to ensure we can get past corrupt videos.

from text-to-video-finetuning.

bruefire avatar bruefire commented on May 27, 2024

@ExponentialML Ok. but I have work to do, so I'll check it once I get back.

from text-to-video-finetuning.

ImBadAtNames2019 avatar ImBadAtNames2019 commented on May 27, 2024

@bruefire Interesting. Could you check to see if that specific video file is corrupt or plays at all? It seems everything goes well up until the 40th clip. If it is, I can implement some checks to ensure we can get past corrupt videos.

Sorry i went off to sleep. I tested it again and got the same error (ZeroDivisionError: integer division or modulo by zero), but then i tried changing the video folder dataset for the second time, and this time it worked. So the problem now seems to be my dataset, but its the exact same dataset that i used in the previous version, and it worked. Im checking which specific videos are causing the problem.

from text-to-video-finetuning.

ImBadAtNames2019 avatar ImBadAtNames2019 commented on May 27, 2024

I have no idea why the videos in my dataset are causing this problem, all of them are. I even tried processing the videos that work in handbrake and davinci (just like i did with the videos of my dataset that are causing this problem) and everything works just fine. I dont know, i will rebuild my dataset again from zero and lets see what happens.

from text-to-video-finetuning.

ImBadAtNames2019 avatar ImBadAtNames2019 commented on May 27, 2024

Ok i kinda figured it out. My dataset is made of short gifs, mp4 format, 10 fps, some of them dont even last a second. Decreasing the fps value to 10 and setting n_sample_frames to 2 in the config file fixed the issue for me. But why do i have to set it so low? if i set it higher than 2 i get the same error. How is it sampling frames? is it sampling 1 frame every 10? or is it sampling 2 frames one after another?

from text-to-video-finetuning.

ImBadAtNames2019 avatar ImBadAtNames2019 commented on May 27, 2024

I think the problem was caused because i set it to sample more frames than there actually are, but i didnt get this error in the previous version.

from text-to-video-finetuning.

ImBadAtNames2019 avatar ImBadAtNames2019 commented on May 27, 2024

Im trying to sample more than 2 frames but it just wont let me. If i set the fps lower than 10 or the n_sample_frames higher than 2 i get this error: RecursionError: maximum recursion depth exceeded in comparison

Im losing my mind.

from text-to-video-finetuning.

ImBadAtNames2019 avatar ImBadAtNames2019 commented on May 27, 2024

God its finally working, i just had to loop each video from the dataset till it reached 2 seconds length, then i set the fps to 10 and n sample frames to 8.

from text-to-video-finetuning.

ImBadAtNames2019 avatar ImBadAtNames2019 commented on May 27, 2024

Nope, there is something wrong the way its sampling the frames, this is whats causing the problem. The movement of the output is completely wrong.

from text-to-video-finetuning.

bruefire avatar bruefire commented on May 27, 2024

@bruefire Interesting. Could you check to see if that specific video file is corrupt or plays at all? It seems everything goes well up until the 40th clip. If it is, I can implement some checks to ensure we can get past corrupt videos.

I checked, and it seems that there are no damaged files, including the clipped videos.
but I noticed that an error occurs when there is a 'data.frame_index=n-1' item with 'num_frames=n' in the JSON.

from text-to-video-finetuning.

JCBrouwer avatar JCBrouwer commented on May 27, 2024

@bruefire @ImBadAtNames2019 sorry, I think this is happening due to some assumptions in the VideoFolder dataset.

I've made it throw a more clear error if the videos are too short and also guarded against dividing by zero for low frame-rate videos.

I'm not quite sure it's solved both of your issues (as it seems like decord's VideoReader might be incorrectly reading the frame-rate of short GIFs?), but could you give this pull request a try and share your results?

from text-to-video-finetuning.

ImBadAtNames2019 avatar ImBadAtNames2019 commented on May 27, 2024

@bruefire @ImBadAtNames2019 sorry, I think this is happening due to some assumptions in the VideoFolder dataset.

I've made it throw a more clear error if the videos are too short and also guarded against dividing by zero for low frame-rate videos.

I'm not quite sure it's solved both of your issues (as it seems like decord's VideoReader might be incorrectly reading the frame-rate of short GIFs?), but could you give this pull request a try and share your results?

I have no idea how to use a pull request. Can i just replace the lines of code modified in the "files changed" tab?

from text-to-video-finetuning.

JCBrouwer avatar JCBrouwer commented on May 27, 2024

Yeah you can just replace those lines or use git:

git fetch origin pull/49/head:videofolder-fix
git checkout videofolder-fix

from text-to-video-finetuning.

bruefire avatar bruefire commented on May 27, 2024

@JCBrouwer
Thank you, but it didn't work well for me.
I encountered the same ZeroDivisionError.

from text-to-video-finetuning.

ImBadAtNames2019 avatar ImBadAtNames2019 commented on May 27, 2024

Yeah you can just replace those lines or use git:

git fetch origin pull/49/head:videofolder-fix
git checkout videofolder-fix

Yes now its not giving me any error, but im still not sure if its sampling the frames correctly. For example, if the video is 10fps, and i set the fps value in the config file to 10, and the n_sample_frames to 4, its going to sample 4 frames in order, one after another (without skipping any frame) from a random part of the video? And if i do the same thing again but this time i set the fps value to 5, its going to still sample 4 frames, but this time skipping one frame in between every time, like one frame yes and one frame no, one frame yes and one frame no. Did i get this right?

from text-to-video-finetuning.

JCBrouwer avatar JCBrouwer commented on May 27, 2024

@bruefire ahh you're using the JSON dataset, the fix won't affect that. Seems to me that you're somehow loading a video in that has a length of 0.

@ImBadAtNames2019, yes your description is what the video loader should be doing. The fact that you were getting this error, though, makes me a little suspicious:

 │ 504 │ │ native_fps = vr.get_avg_fps() │
│ 505 │ │ every_nth_frame = round(native_fps / self.fps) │
│ 506 │ │ │
│ ❱ 507 │ │ effective_length = len(vr) // every_nth_frame │
│ 508 │ │ │
│ 509 │ │ if effective_length < self.n_sample_frames: │
│ 510 │ │ │ return self.getitem(random.randint(0, len(self.video_f │
╰──────────────────────────────────────────────────────────────────────────────╯
ZeroDivisionError: integer division or modulo by zero

Were you trying with an fps config that was higher than your 10 fps videos? Otherwise I think maybe vr.get_avg_fps() might have been returning a wrong value.

from text-to-video-finetuning.

ImBadAtNames2019 avatar ImBadAtNames2019 commented on May 27, 2024

In the beginning yes i had the fps set to the default value 24, and the videos from the dataset were at 10 fps. But then i changed the fps value to 10 and i was still getting problems, it wasnt letting me sample more than 2 frames. So then i increased the length of the videos to 2 seconds by looping them (they were gifs originally), after that i was able to sample 12 frames (more than that would give me errors), but the movement of the video output (after finetuning) was completely wrong. I wrote above everything that happened. Now im at 72% progress with the updated script, using 10fps and 8 n_sample_frames, lets see if i get better results.

from text-to-video-finetuning.

ImBadAtNames2019 avatar ImBadAtNames2019 commented on May 27, 2024

I tested the fine tuned model and there is no difference compared to the stock one, like it didnt fine tune it at all. Here is my config file below, i dont know what im doing wrong, i didnt have this problems with the previous version. My dataset is a folder containing 29 mp4 videos (720px720p resolution, 10fps, 1-3 seconds length, not less than 1 second), each video has its own txt file (named like the video, same folder) containing the prompt. Im using a 40gb nvidia a100 rented on google colab. @JCBrouwer

# Pretrained diffusers model path.
pretrained_model_path: "/content/drive/MyDrive/models/model_scope_diffusers" #https://huggingface.co/damo-vilab/text-to-video-ms-1.7b/tree/main

# The folder where your training outputs will be placed.
output_dir: "./outputs"

# You can train multiple datasets at once. They will be joined together for training.
# Simply remove the line you don't need, or keep them all for mixed training.

# 'image': A folder of images and captions (.txt)
# 'folder': A folder a videos and captions (.txt)
# 'json': The JSON file created with automatic BLIP2 captions using https://github.com/ExponentialML/Video-BLIP2-Preprocessor
# 'single_video': A single video file.mp4 and text prompt
dataset_types: 
  - 'folder'

# Adds offset noise to training. See https://www.crosslabs.org/blog/diffusion-with-offset-noise
offset_noise_strength: 0.1
use_offset_noise: False

# When True, this extends all items in all enabled datasets to the highest length. 
# For example, if you have 200 videos and 10 images, 10 images will be duplicated to the length of 200. 
extend_dataset: False

# Caches the latents (Frames-Image -> VAE -> Latent) to a HDD or SDD. 
# The latents will be saved under your training folder, and loaded automatically for training.
# This both saves memory and speeds up training and takes very little disk space.
cache_latents: True

# If you have cached latents set to `True` and have a directory of cached latents,
# you can skip the caching process and load previously saved ones. 
cached_latent_dir: null #/path/to/cached_latents

# Train the text encoder. Leave at false to use LoRA only (Recommended).
train_text_encoder: False

# https://github.com/cloneofsimo/lora
# Use LoRA to train extra layers whilst saving memory. It trains both a LoRA & the model itself.
# This works slightly different than vanilla LoRA and DOES NOT save a separate file.
# It is simply used as a mechanism for saving memory by keeping layers frozen and training the residual.

# Use LoRA for the UNET model.
use_unet_lora: True

# Use LoRA for the Text Encoder.
use_text_lora: True

# The modules to use for LoRA. Different from 'trainable_modules'.
unet_lora_modules:
  - "ResnetBlock2D"

# The modules to use for LoRA. Different from `trainable_text_modules`.
text_encoder_lora_modules:
  - "CLIPEncoderLayer"

# The rank for LoRA training. With ModelScope, the maximum should be 1024. 
# VRAM increases with higher rank, lower when decreased.
lora_rank: 16

# Training data parameters
train_data:

  # The width and height in which you want your training data to be resized to.
  width: 384      
  height: 384

  # This will find the closest aspect ratio to your input width and height. 
  # For example, 512x512 width and height with a video of resolution 1280x720 will be resized to 512x256
  use_bucketing: True

  # The start frame index where your videos should start (Leave this at one for json and folder based training).
  sample_start_idx: 1

  # Used for 'folder'. The rate at which your frames are sampled. Does nothing for 'json' and 'single_video' dataset.
  fps: 10

  # For 'single_video' and 'json'. The number of frames to "step" (1,2,3,4) (frame_step=2) -> (1,3,5,7, ...).  
  frame_step: 5

  # The number of frames to sample. The higher this number, the higher the VRAM (acts similar to batch size).
  n_sample_frames: 8
  
  # 'single_video'
  single_video_path: "path/to/single/video.mp4"

  # The prompt when using a a single video file
  single_video_prompt: ""

  # Fallback prompt if caption cannot be read. Enabled for 'image' and 'folder'.
  fallback_prompt: ''
  
  # 'folder'
  path: "/content/drive/MyDrive/Datasets/dataset_1"

  # 'json'
  json_path: 'path/to/train/json/'

  # 'image'
  image_dir: 'path/to/image/directory'

  # The prompt for all image files. Leave blank to use caption files (.txt) 
  single_img_prompt: ""

# Validation data parameters.
validation_data:

  # A custom prompt that is different from your training dataset. 
  prompt: "anime girl dancing"

  # Whether or not to sample preview during training (Requires more VRAM).
  sample_preview: True

  # The number of frames to sample during validation.
  num_frames: 16

  # Height and width of validation sample.
  width: 384
  height: 384

  # Number of inference steps when generating the video.
  num_inference_steps: 25

  # CFG scale
  guidance_scale: 9

# Learning rate for AdamW
learning_rate: 5e-6

# Weight decay. Higher = more regularization. Lower = closer to dataset.
adam_weight_decay: 1e-2

# Optimizer parameters for the UNET. Overrides base learning rate parameters.
extra_unet_params: null
  #learning_rate: 1e-5
  #adam_weight_decay: 1e-4

# Optimizer parameters for the Text Encoder. Overrides base learning rate parameters.
extra_text_encoder_params: null
  #learning_rate: 5e-6
  #adam_weight_decay: 0.2

# How many batches to train. Not to be confused with video frames.
train_batch_size: 1

# Maximum number of train steps. Model is saved after training.
max_train_steps: 2500

# Saves a model every nth step.
checkpointing_steps: 25000

# How many steps to do for validation if sample_preview is enabled.
validation_steps: 100

# Which modules we want to unfreeze for the UNET. Advanced usage.
trainable_modules:

  # If you want to ignore temporal attention entirely, remove "attn1-2" and replace with ".attentions"
  # This is for self attetion. Activates for spatial and temporal dimensions if n_sample_frames > 1
  - "attn1"
  
  # This is for cross attention (image & text data). Activates for spatial and temporal dimensions if n_sample_frames > 1
  - "attn2"
  
  #  Convolution networks that hold temporal information. Activates for spatial and temporal dimensions if n_sample_frames > 1
  - 'temp_conv'


# Which modules we want to unfreeze for the Text Encoder. Advanced usage.
trainable_text_modules:
  - "all"

# Seed for validation.
seed: 64

# Whether or not we want to use mixed precision with accelerate
mixed_precision: "fp16"

# This seems to be incompatible at the moment.
use_8bit_adam: False 

# Trades VRAM usage for speed. You lose roughly 20% of training speed, but save a lot of VRAM.
# If you need to save more VRAM, it can also be enabled for the text encoder, but reduces speed x2.
gradient_checkpointing: False
text_encoder_gradient_checkpointing: False

# Xformers must be installed for best memory savings and performance (< Pytorch 2.0)
enable_xformers_memory_efficient_attention: False

# Use scaled dot product attention (Only available with >= Torch 2.0)
enable_torch_2_attn: True

from text-to-video-finetuning.

JCBrouwer avatar JCBrouwer commented on May 27, 2024

Ok @ImBadAtNames2019 I think the issues you were running into earlier should now more clearly fail with an error about the video files being to short. Judging by when you run into errors I'd hazard a guess that your shortest video is about 1.2 seconds long.

Regarding fine-tuning not being very effective, I'd suggest trying to raise your learning rate and training for longer than 2500 steps. For me a learning rate of 1e-5 and weight_decay of 0 starts to give clearly tuned results after ~5000 steps.

from text-to-video-finetuning.

JCBrouwer avatar JCBrouwer commented on May 27, 2024

Regarding the error you're running into @bruefire, it's probably going wrong in this function when the BLIP2 frame is too close to the end of the video to sample a full n_sample_frames at the frame_step.

If so, any idea what a good fix would be @ExponentialML ?

from text-to-video-finetuning.

ImBadAtNames2019 avatar ImBadAtNames2019 commented on May 27, 2024

Ok @ImBadAtNames2019 I think the issues you were running into earlier should now more clearly fail with an error about the video files being to short. Judging by when you run into errors I'd hazard a guess that your shortest video is about 1.2 seconds long.

Regarding fine-tuning not being very effective, I'd suggest trying to raise your learning rate and training for longer than 2500 steps. For me a learning rate of 1e-5 and weight_decay of 0 starts to give clearly tuned results after ~5000 steps.

I will try finetuning it with 5k steps but i doubt it will make any difference. The output of the model finetuned with 2500 steps is identical to the one of the stock model, are you sure my config file above is ok? Maybe i didnt configure it properly. In the previous version the output was completely different even with less than 2500 steps.

from text-to-video-finetuning.

ExponentialML avatar ExponentialML commented on May 27, 2024

Regarding the error you're running into @bruefire, it's probably going wrong in this function when the BLIP2 frame is too close to the end of the video to sample a full n_sample_frames at the frame_step.

If so, any idea what a good fix would be @ExponentialML ?

It's tricky, but my recommendation would be to just return 1 frame. That way it will be trained on both the text encoder and attention layers, and the full frame videos will go to the temporal dimension. If all else fails, skip into the next batch or grab a fallback frame when the dataset is instantiated.

from text-to-video-finetuning.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.