Giter VIP home page Giter VIP logo

dynamicrafter's Introduction

DynamiCrafter: Animating Open-domain Images with Video Diffusion Priors

     
Open in OpenXLab           

Jinbo Xing, Menghan Xia*, Yong Zhang, Haoxin Chen, Wangbo Yu,
Hanyuan Liu, Xintao Wang, Tien-Tsin Wong*, Ying Shan


From CUHK and Tencent AI Lab.

at European Conference on Computer Vision (ECCV) 2024, Oral

🔆 Introduction

🔥🔥 Training / Fine-tuning code is available NOW!!!

🔥 We 1024x576 version ranks 1st on the I2V benchmark list from VBench!
🔥 Generative frame interpolation / looping video generation model weights (320x512) have been released!
🔥 New Update Rolls Out for DynamiCrafter! Better Dynamic, Higher Resolution, and Stronger Coherence!
🤗 DynamiCrafter can animate open-domain still images based on text prompt by leveraging the pre-trained video diffusion priors. Please check our project page and paper for more information.

👀 Seeking comparisons with Stable Video Diffusion and PikaLabs? Click the image below.

1.1. Showcases (576x1024)

1.2. Showcases (320x512)

1.3. Showcases (256x256)

"bear playing guitar happily, snowing" "boy walking on the street"

2. Applications

2.1 Storytelling video generation (see project page for more details)

2.2 Generative frame interpolation

Input starting frame Input ending frame Generated video

2.3 Looping video generation

📝 Changelog

  • [2024.06.14]: 🔥🔥 Release training code for interpolation.
  • [2024.05.24]: Release WebVid10M-motion annotations.
  • [2024.05.05]: Release training code.
  • [2024.03.14]: Release generative frame interpolation and looping video models (320x512).
  • [2024.02.05]: Release high-resolution models (320x512 & 576x1024).
  • [2023.12.02]: Launch the local Gradio demo.
  • [2023.11.29]: Release the main model at a resolution of 256x256.
  • [2023.11.27]: Launch the project page and update the arXiv preprint.

🧰 Models

Model Resolution GPU Mem. & Inference Time (A100, ddim 50steps) Checkpoint
DynamiCrafter1024 576x1024 18.3GB & 75s (perframe_ae=True) Hugging Face
DynamiCrafter512 320x512 12.8GB & 20s (perframe_ae=True) Hugging Face
DynamiCrafter256 256x256 11.9GB & 10s (perframe_ae=False) Hugging Face
DynamiCrafter512_interp 320x512 12.8GB & 20s (perframe_ae=True) Hugging Face

Currently, our DynamiCrafter can support generating videos of up to 16 frames with a resolution of 576x1024. The inference time can be reduced by using fewer DDIM steps.

GPU memory consumed on RTX 4090 reported by @noguchis in Twitter: 18.3GB (576x1024), 12.8GB (320x512), 11.9GB (256x256).

⚙️ Setup

Install Environment via Anaconda (Recommended)

conda create -n dynamicrafter python=3.8.5
conda activate dynamicrafter
pip install -r requirements.txt

💫 Inference

1. Command line

Image-to-Video Generation

  1. Download pretrained models via Hugging Face, and put the model.ckpt with the required resolution in checkpoints/dynamicrafter_[1024|512|256]_v1/model.ckpt.
  2. Run the commands based on your devices and needs in terminal.
  # Run on a single GPU:
  # Select the model based on required resolutions: i.e., 1024|512|320:
  sh scripts/run.sh 1024
  # Run on multiple GPUs for parallel inference:
  sh scripts/run_mp.sh 1024

Generative Frame Interpolation / Looping Video Generation

Download pretrained model DynamiCrafter512_interp and put the model.ckpt in checkpoints/dynamicrafter_512_interp_v1/model.ckpt.

  sh scripts/run_application.sh interp # Generate frame interpolation
  sh scripts/run_application.sh loop   # Looping video generation

2. Local Gradio demo

Image-to-Video Generation

  1. Download the pretrained models and put them in the corresponding directory according to the previous guidelines.
  2. Input the following commands in terminal (choose a model based on the required resolution: 1024, 512 or 256).
  python gradio_app.py --res 1024

Generative Frame Interpolation / Looping Video Generation

Download the pretrained model and put it in the corresponding directory according to the previous guidelines.

  python gradio_app_interp_and_loop.py 

💥 Training / Fine-tuning

Image-to-Video Generation

  1. Download the WebVid Dataset, and important items in .csv are page_dir, videoid, and name.
  2. Download the pretrained models and put them in the corresponding directory according to the previous guidelines.
  3. Change <YOUR_SAVE_ROOT_DIR> path in training_[1024|512]_v1.0/run.sh
  4. Carefully check all paths in training_[1024|512]_v1.0/config.yaml, including model:pretrained_checkpoint, data:data_dir, and data:meta_path.
  5. Input the following commands in terminal (choose a model based on the required resolution: 1024 or 512).

We adopt DDPShardedStrategy by default for training, please make sure it is available in your pytorch_lightning.

  sh configs/training_1024_v1.0/run.sh ## fine-tune DynamiCrafter1024
  1. All the checkpoints/tensorboard record/loginfo will be saved in <YOUR_SAVE_ROOT_DIR>.

Generative Frame Interpolation

Download pretrained model DynamiCrafter512_interp and put the model.ckpt in checkpoints/dynamicrafter_512_interp_v1/model.ckpt. Follow the same fine-tuning procedure in "Image-to-Video Generation", and run the script below:

sh configs/training_512_v1.0/run_interp.sh

🎁 WebVid-10M-motion annotations (~2.6M)

The annoations of our WebVid-10M-motion is available on Huggingface Dataset. In addition to the original annotations, we add three more motion-related annotations: dynamic_confidence, dynamic_wording, and dynamic_source_category. Please refer to our supplementary document (Section D) for more details.

🤝 Community Support

  1. ComfyUI and pruned models (bf16): ComfyUI-DynamiCrafterWrapper (Thanks to kijai)
Model Resolution GPU Mem. Checkpoint
DynamiCrafter1024 576x1024 10GB Hugging Face
DynamiCrafter512_interp 320x512 8GB Hugging Face
  1. ComfyUI: ComfyUI-DynamiCrafter (Thanks to chaojie)

  2. ComfyUI: ComfyUI_Native_DynamiCrafter (Thanks to ExponentialML)

  3. Docker: DynamiCrafter_docker (Thanks to maximofn)

👨‍👩‍👧‍👦 Crafter Family

VideoCrafter1: Framework for high-quality video generation.

ScaleCrafter: Tuning-free method for high-resolution image/video generation.

TaleCrafter: An interactive story visualization tool that supports multiple characters.

LongerCrafter: Tuning-free method for longer high-quality video generation.

MakeYourVideo, might be a Crafter:): Video generation/editing with textual and structural guidance.

StyleCrafter: Stylized-image-guided text-to-image and text-to-video generation.

😉 Citation

Please consider citing our paper if our code and dataset annotations are useful:

@article{xing2023dynamicrafter,
  title={DynamiCrafter: Animating Open-domain Images with Video Diffusion Priors},
  author={Xing, Jinbo and Xia, Menghan and Zhang, Yong and Chen, Haoxin and Yu, Wangbo and Liu, Hanyuan and Wang, Xintao and Wong, Tien-Tsin and Shan, Ying},
  journal={arXiv preprint arXiv:2310.12190},
  year={2023}
}

🙏 Acknowledgements

We would like to thank AK(@_akhaliq) for the help of setting up hugging face online demo, and camenduru for providing the replicate & colab online demo, and Xinliang for his support and contribution to the open source project.

📢 Disclaimer

This project strives to impact the domain of AI-driven video generation positively. Users are granted the freedom to create videos using this tool, but they are expected to comply with local laws and utilize it responsibly. The developers do not assume any responsibility for potential misuse by users.


dynamicrafter's People

Contributors

dailingx avatar doubiiu avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

dynamicrafter's Issues

Constructed dataset for motion control

Dear authors,

Thanks for your great work!

In you paper, you mentioned that you constructed a dataset for better motion control. Do you have plan to release the dataset and the model trained on this dataset?

Best

Error of running inference on multiple GPUs

Hi, @Doubiiu thank you for the impressive work! I met an error when running sh scripts/run_mp.sh 1024:

RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cuda:7! (when checking argument for argument weight in method wrapper_CUDA__cudnn_convolution)

Do you have any suggestions?

gray stripes in both side of the frame using image to create video

Hello, thanks for your great job and open source.
Just created 2s video from 1152*832 image with a 2080Ti 22GB VR, using 560s.

The motion part is perfect, but sometimes imperfection can be found like fames in this video.
Two gray stripes in both side of the frame.
Original image:
2wandering

a_brown_mother_bear_looking_ahead._two_b.mp4

And one more question, can I create more longer videos? such as 3s or 4s.
Finally, I have change the format of H264 to H265 just by modified func.py for H264 videos look like still images (no motion effect)
So maybe you can make video format as a parameter that can be set in the config file.

Prompt-Image correspondence

In order to add a new prompt in test_prompts.txt, I need to first check the alphabetic order of the image filename, and insert it in the corresponding line? Any better way? Thanks.

Unexpected key(s) in state_dict: "model.diffusion_model.framestride_embed.0.weight",

(S:\DynamiCrafter) S:\DynamiCrafter> python gradio_app.py --res 1024
A matching Triton is not available, some optimizations will not be enabled.
Error caught was: No module named 'triton'
AE working on z of shape (1, 4, 32, 32) = 4096 dimensions.
Traceback (most recent call last):
File "S:\DynamiCrafter\scripts\evaluation\funcs.py", line 106, in load_checkpoint
model.load_state_dict(state_dict, strict=full_strict)
File "S:\DynamiCrafter\lib\site-packages\torch\nn\modules\module.py", line 2041, in load_state_dict
raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format(
RuntimeError: Error(s) in loading state_dict for LatentVisualDiffusion:
Missing key(s) in state_dict: "scale_arr", "model.diffusion_model.fps_embedding.0.weight", "model.diffusion_model.fps_embedding.0.bias", "model.diffusion_model.fps_embedding.2.weight", "model.diffusion_model.fps_embedding.2.bias".
Unexpected key(s) in state_dict: "model.diffusion_model.framestride_embed.0.weight", "model.diffusion_model.framestride_embed.0.bias", "model.diffusion_model.framestride_embed.2.weight", "model.diffusion_model.framestride_embed.2.bias", "model.diffusion_model.input_blocks.1.1.transformer_blocks.0.attn2.alpha", "model.diffusion_model.input_blocks.2.1.transformer_blocks.0.attn2.alpha", "model.diffusion_model.input_blocks.4.1.transformer_blocks.0.attn2.alpha", "model.diffusion_model.input_blocks.5.1.transformer_blocks.0.attn2.alpha", "model.diffusion_model.input_blocks.7.1.transformer_blocks.0.attn2.alpha", "model.diffusion_model.input_blocks.8.1.transformer_blocks.0.attn2.alpha", "model.diffusion_model.middle_block.1.transformer_blocks.0.attn2.alpha", "model.diffusion_model.output_blocks.3.1.transformer_blocks.0.attn2.alpha", "model.diffusion_model.output_blocks.4.1.transformer_blocks.0.attn2.alpha", "model.diffusion_model.output_blocks.5.1.transformer_blocks.0.attn2.alpha", "model.diffusion_model.output_blocks.6.1.transformer_blocks.0.attn2.alpha", "model.diffusion_model.output_blocks.7.1.transformer_blocks.0.attn2.alpha", "model.diffusion_model.output_blocks.8.1.transformer_blocks.0.attn2.alpha", "model.diffusion_model.output_blocks.9.1.transformer_blocks.0.attn2.alpha", "model.diffusion_model.output_blocks.10.1.transformer_blocks.0.attn2.alpha", "model.diffusion_model.output_blocks.11.1.transformer_blocks.0.attn2.alpha".

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "gradio_app.py", line 170, in
dynamicrafter_iface = dynamicrafter_demo(result_dir, args.res)
File "gradio_app.py", line 47, in dynamicrafter_demo
image2video = Image2Video(result_dir, resolution=resolution)
File "S:\DynamiCrafter\scripts\gradio\i2v_test.py", line 31, in init
model = load_model_checkpoint(model, ckpt_path)
File "S:\DynamiCrafter\scripts\evaluation\funcs.py", line 127, in load_model_checkpoint
load_checkpoint(model, ckpt, full_strict=True)
File "S:\DynamiCrafter\scripts\evaluation\funcs.py", line 118, in load_checkpoint
model.load_state_dict(new_pl_sd, strict=full_strict)
File "S:\DynamiCrafter\lib\site-packages\torch\nn\modules\module.py", line 2041, in load_state_dict
raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format(
RuntimeError: Error(s) in loading state_dict for LatentVisualDiffusion:
Missing key(s) in state_dict: "scale_arr".
Unexpected key(s) in state_dict: "model.diffusion_model.input_blocks.1.1.transformer_blocks.0.attn2.alpha", "model.diffusion_model.input_blocks.2.1.transformer_blocks.0.attn2.alpha", "model.diffusion_model.input_blocks.4.1.transformer_blocks.0.attn2.alpha", "model.diffusion_model.input_blocks.5.1.transformer_blocks.0.attn2.alpha", "model.diffusion_model.input_blocks.7.1.transformer_blocks.0.attn2.alpha", "model.diffusion_model.input_blocks.8.1.transformer_blocks.0.attn2.alpha", "model.diffusion_model.middle_block.1.transformer_blocks.0.attn2.alpha", "model.diffusion_model.output_blocks.3.1.transformer_blocks.0.attn2.alpha", "model.diffusion_model.output_blocks.4.1.transformer_blocks.0.attn2.alpha", "model.diffusion_model.output_blocks.5.1.transformer_blocks.0.attn2.alpha", "model.diffusion_model.output_blocks.6.1.transformer_blocks.0.attn2.alpha", "model.diffusion_model.output_blocks.7.1.transformer_blocks.0.attn2.alpha", "model.diffusion_model.output_blocks.8.1.transformer_blocks.0.attn2.alpha", "model.diffusion_model.output_blocks.9.1.transformer_blocks.0.attn2.alpha", "model.diffusion_model.output_blocks.10.1.transformer_blocks.0.attn2.alpha", "model.diffusion_model.output_blocks.11.1.transformer_blocks.0.attn2.alpha".

(S:\DynamiCrafter) S:\DynamiCrafter>

1024 model flashes gray

Please tell me why the video I generated using the 1024 model flashes gray and the resolution is 576 1024.

Is it possible to generate video with fixed pose?

Hi, thanks for your great work!
I wonder if I can use DynamiCrafter to generate the video with the fixed pose, such as an application like 2.2 Looping video Generation
I wish the camera was fixed and no changes in the camera and we could only focus on the scene's motion.
How can we design the prompt?(choose image, and text prompt)

Why isn't it more popular?

I've created several videos that seem to match the quality of those generated with SVD. Even though this project has a more permissive license, it's far less popular than SVD. I wonder why this could be? Is it the computational efficiency?

Question about UCF-101/MSRVTT evaluation in paper

Hi,

Thank you for sharing this great work. I'm interested in how you performed the evaluation on UCF-101 and MSR-VTT in the paper, in particular, how did you select the first frame to let the model condition on when generating the video and how did you select the real videos to compute FVD? My understanding is that we could randomly select a 16-frame video clip from the test videos and use this as the real video. dynamicrafter then generates a video based on the first frame of the selected video clip. This generated video is then compared against the real video. Is this the correct understanding? Thanks in advance.

Garbage results with frame count <16

Is it possible to generate interpolation at <16 frames? I see "up to 16 frames", but any time I try anything lower than 16 frames of interpolation, I get garbage:

image

This speeds up generation far more than lowering step count and I'd love to be able to make this work!

Related on the wrapper github: kijai/ComfyUI-DynamiCrafterWrapper#9

Evaluation on MSR-VTT test set

Hi, following on the above discussion, can you tell how you selected the 2048 samples for both the datasets? Because on calculating FVD for the entire dataset of MSR-VTT i.e. on 2990 videos, I got a score of 328 which is more than the reported value. Therefore, I was curious to know, if I am doing something wrong here.

Thanks.

Originally posted by @hiteshK03 in #6 (comment)

Questions about training

Hi, first all this is really great work. Thanks for releasing the code.

I have a few questions about the training.

  1. In the first stage of training, the query transformer P is trained on SD2.1. So I understand all weights of SD2.1 are fixed and only P is trained. Do you train at that stage also lambda = tanh(alpha)?
  2. The number of queries in P depend on the number of frames. When you train with one query using a T2I model, and you go to the next stage to train it on a T2V model, do you initialize the F (=16) queries by repeating the one query that you trained on the T2I model?
  3. SD2.1 is trained on 768x768 resolution using v-prediction as loss. So, do you train in the first stage P using 768x768 resolution and v-prediction? If yes, I suppose you used the LAION dataset to train it, right (as WebVid10M does not offer that resolution)?

You mention later: to avoid learning short-cuts, you randomly select a video frame as image condition.
Two questions about that:

  1. For the initial training stage on the T2I model, are you using the exact frame, and only for the last training stage, you randomly select a frame?
  2. Is the frame you use for conditioning always contained in the input video sequence x_t ? So you trained with 16 frames. The frame you used for conditioning, was it always contained in these 16 frames?

Training code

Congratulation on the wonderful work! The results are amazing!
I wonder if you have any plan to release the training code? (If there is a plan, could you tell me the approximate time?)

Thanks for your great work!

api

有api吗?

CUDA out of memory on rtx 4090

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 25.31 GiB (GPU 0; 23.99 GiB total capacity; 11.33 GiB already allocated; 9.91 GiB free; 12.47 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

generate a longer video

Thanks for your wonderful work, but I have a question here, I modify the inference_1024_v1.0.yaml file, changing the video_length parameter from 16 to 32, and I run the gradio_app.py, I got the error: " size mismatch for image_proj_model.latents: copying a param with shape torch.Size([1, 256, 1024]) from checkpoint, the shape in current model is torch.Size([1, 512, 1024]).", can you show me how to generate longer video?

Question about the img and txt signal guidance

Thank you for the great work!

In the 4.1 Implementation Details part of your paper, you claim that there are two guidance scales for text-conditioned image animation. I notice that in your released run.sh code, you commit the --multiple_cond_cfg. Is there any difference with or without --multiple_cond_cfg and will the performance be better without --multiple_cond_cfg?

Thank you so much for the help!

Location of the input image in the generated video

Hi, is there a way to control the location of the input image in the generated video? e.g. if it is possible to always make it the first frame? I tried to look at the code and it seems that the input image is replicated 16 times and concatenated along the channel dimension of the noise, and it seems that there is no way to control it. I am wondering if this is the correct understanding. Thank you!

How did you train Query Transformer weights?

Hello, I have a question while reading your paper.
In this paper, you mentions the use of Query Transformer and Learnable Latent Vectors.
Upon closer examination, it appears that Learnable Latent Vectors consist of weights repeated 4 times for Perceiver Attn and FF.

As I presume,

  1. To do separate training to enhance the image's details after passing through FrozenOpenCLIPImageEmbedderV2 and then through image_proj_stage_config.
  2. It seems that fine-tuning was done with the Spatial Attn (with freezing Temp Attn) alongside the mentioned weights without separate training.

If the first assumption is correct, I would like to know how you calculated the loss with the input image.
or if the answer is second, I'm interested in understanding how you conducted the training.
Could you please provide detailed explanations about that Query Transformer's trained weights?

Arbitrary resolution

Hi, thank you so much for your great work!

May I ask how to adapt to images with arbitrary resolutions, for instance, 512x768?

Thank you so much for your help!

Gradio takes a long time to run

When I run the 576*1024 model, it takes 86 seconds to run with the 'run.sh' command, but it takes 250 seconds with Gradio. Why is this happening?

Inquiry about num_frames and GPU memory consumption

Hi, thank you for open-sourcing such a nice work!

For the generation of an video with 256x256, it is stated that 20GB gpu memory is required.
(1) I wonder how many frames are generated in the above case.
(2) If I reduce the number of frames to generate, then will it be possible to generate a video with bigger resolution than 256x256? Or the results will be not as good since the model is trained in the dataset of 256x256?

Thank you in advance :)

Classifier free Guidance Training

Hi, once again, your work is really great, thanks for sharing all this and providing support.

In Section 4.1 of your paper you describe to use multi-condition classifier-free guidance.

I could not find any information in the paper about specific training for that purpose. So my question is, did you randomly replaced as input for the CLIP image encoder in the "Dual-stream image injection" block the input image with the zero image during training? If so, can you provide the likelihoods for that?

Allocation on device 0 would exceed allowed memory. (out of memory)

It was working fine before, but recently it suddenly started blowing up video memory, I don't know what's causing it?(之前可以正常运行,最近突然开始爆显存,不知道什么原因?)

Error occurred when executing DynamiCrafterI2V:

Allocation on device 0 would exceed allowed memory. (out of memory)
Currently allocated : 16.13 GiB
Requested : 25.31 GiB
Device limit : 24.00 GiB
Free (according to CUDA): 0 bytes
PyTorch limit (set by user-supplied memory fraction)
: 17179869184.00 GiB

File "E:\ComfyUI_windows_portable\ComfyUI\execution.py", line 151, in recursive_execute
output_data, output_ui = get_output_data(obj, input_data_all)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "E:\ComfyUI_windows_portable\ComfyUI\execution.py", line 81, in get_output_data
return_values = map_node_over_list(obj, input_data_all, obj.FUNCTION, allow_interrupt=True)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "E:\ComfyUI_windows_portable\ComfyUI\execution.py", line 74, in map_node_over_list
results.append(getattr(obj, func)(**slice_dict(input_data_all, i)))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "E:\ComfyUI_windows_portable\ComfyUI\custom_nodes\ComfyUI-DynamiCrafterWrapper\nodes.py", line 256, in process
samples, _ = ddim_sampler.sample(S=steps,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "E:\ComfyUI_windows_portable\python_embeded\Lib\site-packages\torch\utils_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "E:\ComfyUI_windows_portable\ComfyUI\custom_nodes\ComfyUI-DynamiCrafterWrapper\lvdm\models\samplers\ddim.py", line 113, in sample
samples, intermediates = self.ddim_sampling(conditioning, size,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "E:\ComfyUI_windows_portable\python_embeded\Lib\site-packages\torch\utils_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "E:\ComfyUI_windows_portable\ComfyUI\custom_nodes\ComfyUI-DynamiCrafterWrapper\lvdm\models\samplers\ddim.py", line 186, in ddim_sampling
outs = self.p_sample_ddim(img, cond, ts, index=index, use_original_steps=ddim_use_original_steps,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "E:\ComfyUI_windows_portable\python_embeded\Lib\site-packages\torch\utils_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "E:\ComfyUI_windows_portable\ComfyUI\custom_nodes\ComfyUI-DynamiCrafterWrapper\lvdm\models\samplers\ddim.py", line 222, in p_sample_ddim
e_t_cond = self.model.apply_model(x, t, c, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "E:\ComfyUI_windows_portable\ComfyUI\custom_nodes\ComfyUI-DynamiCrafterWrapper\lvdm\models\ddpm3d.py", line 551, in apply_model
x_recon = self.model(x_noisy, t, **cond, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "E:\ComfyUI_windows_portable\python_embeded\Lib\site-packages\torch\nn\modules\module.py", line 1511, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "E:\ComfyUI_windows_portable\python_embeded\Lib\site-packages\torch\nn\modules\module.py", line 1520, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "E:\ComfyUI_windows_portable\ComfyUI\custom_nodes\ComfyUI-DynamiCrafterWrapper\lvdm\models\ddpm3d.py", line 714, in forward
out = self.diffusion_model(xc, t, context=cc, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "E:\ComfyUI_windows_portable\python_embeded\Lib\site-packages\torch\nn\modules\module.py", line 1511, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "E:\ComfyUI_windows_portable\python_embeded\Lib\site-packages\torch\nn\modules\module.py", line 1520, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "E:\ComfyUI_windows_portable\ComfyUI\custom_nodes\ComfyUI-DynamiCrafterWrapper\lvdm\modules\networks\openaimodel3d.py", line 583, in forward
h = module(h, emb, context=context, batch_size=b, frame_window_size=frame_window_size, frame_window_stride=frame_window_stride)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "E:\ComfyUI_windows_portable\python_embeded\Lib\site-packages\torch\nn\modules\module.py", line 1511, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "E:\ComfyUI_windows_portable\python_embeded\Lib\site-packages\torch\nn\modules\module.py", line 1520, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "E:\ComfyUI_windows_portable\ComfyUI\custom_nodes\ComfyUI-DynamiCrafterWrapper\lvdm\modules\networks\openaimodel3d.py", line 41, in forward
x = layer(x, context)
^^^^^^^^^^^^^^^^^
File "E:\ComfyUI_windows_portable\python_embeded\Lib\site-packages\torch\nn\modules\module.py", line 1511, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "E:\ComfyUI_windows_portable\python_embeded\Lib\site-packages\torch\nn\modules\module.py", line 1520, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "E:\ComfyUI_windows_portable\ComfyUI\custom_nodes\ComfyUI-DynamiCrafterWrapper\lvdm\modules\attention.py", line 311, in forward
x = block(x, context=context, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "E:\ComfyUI_windows_portable\python_embeded\Lib\site-packages\torch\nn\modules\module.py", line 1511, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "E:\ComfyUI_windows_portable\python_embeded\Lib\site-packages\torch\nn\modules\module.py", line 1520, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "E:\ComfyUI_windows_portable\ComfyUI\custom_nodes\ComfyUI-DynamiCrafterWrapper\lvdm\modules\attention.py", line 246, in forward
return checkpoint(self._forward, input_tuple, self.parameters(), self.checkpoint)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "E:\ComfyUI_windows_portable\ComfyUI\custom_nodes\ComfyUI-DynamiCrafterWrapper\lvdm\common.py", line 94, in checkpoint
return func(*inputs)
^^^^^^^^^^^^^
File "E:\ComfyUI_windows_portable\ComfyUI\custom_nodes\ComfyUI-DynamiCrafterWrapper\lvdm\modules\attention.py", line 250, in _forward
x = self.attn1(self.norm1(x), context=context if self.disable_self_attn else None, mask=mask) + x
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "E:\ComfyUI_windows_portable\python_embeded\Lib\site-packages\torch\nn\modules\module.py", line 1511, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "E:\ComfyUI_windows_portable\python_embeded\Lib\site-packages\torch\nn\modules\module.py", line 1520, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "E:\ComfyUI_windows_portable\ComfyUI\custom_nodes\ComfyUI-DynamiCrafterWrapper\lvdm\modules\attention.py", line 125, in forward
sim = sim.softmax(dim=-1)

1024模型是否不能运行在24G显存的GPU上?

我试了1024模型,不管把video_length减到多少,都是OOM
我看到其他网友在reddit和其他一些平台上问这个问题,但是没有人给予正面的回答。
希望作者能给出1024模型需要的显存大小,或者是否有办法降低显存的使用。
谢谢。

output size question

Thank you for your wonderful work, I have a question, the output size is fixed? only 256 * 256, 320 * 576 and 576 * 1024 can be used?

"perframe_ae" was not specified when calling with resolution 512 from gradio_app.py

Great work!

When I checked the code in run.sh and run_mp.sh, the "perframe_ae" option was specified when the resolution was other than 256, i.e. 512 and 1024. However, "perframe_ae" is not set at resolution 512 because only the values from the config file are applied when invoking from gradio_app.py.

I think adding "perframe_ae: True" to inference_512_v1.0.yaml is a good idea. What do you think?

FileExistsError: [WinError 183] Cannot create a file because it already exists: './dynamicrafter_1024_v1/'

Please help! if i download the file i get the rror. If o dont and let the code download its taking 5 hours!
How do i modify the code so it wont redownload the model?!
Traceback (most recent call last):
File "gradio_app.py", line 170, in
dynamicrafter_iface = dynamicrafter_demo(result_dir, args.res)
File "gradio_app.py", line 47, in dynamicrafter_demo
image2video = Image2Video(result_dir, resolution=resolution)
File "S:\DynamiCrafter\scripts\gradio\i2v_test.py", line 16, in init
self.download_model()
File "S:\DynamiCrafter\scripts\gradio\i2v_test.py", line 97, in download_model
os.makedirs('./dynamicrafter_'+str(self.resolution[1])+'_v1/')
File "S:\DynamiCrafter\lib\os.py", line 223, in makedirs
mkdir(name, mode)
FileExistsError: [WinError 183] Cannot create a file because it already exists: './dynamicrafter_1024_v1/'

How to train DynamiCrafter model?

Hi, first of all, thank you for the work you've done. I have used DynamiCrafter extensively, and its performance is quite impressive in most scenarios. However, I would like to train DynamiCrafter again in certain areas. May I ask if you have any plans to further open source the training code?

Help: There is no ddim discretization method called "'uniform_trailing'"

Hello, Thanks for your great work and open source.
I tried to run on windows 11 ended up with this error. (I convert the run.sh to run.bat so that it can be run on windows)
torch vresion: 2.0.0+cu118

Any solution or hint ? Thanks.

(dynacraft) D:\Python310\dynacraft\DynamiCrafter-main>scripts\run.bat 1024
@dynamicrafter cond-Inference: 2024-02-11-16-04-50
Global seed set to 123
AE working on z of shape (1, 4, 32, 32) = 4096 dimensions.

model checkpoint loaded.
Inference with 16 frames
Prompts testing [rank:0] 8/8 samples loaded.
Sample Batch: 0it [02:05, ?it/s]
Traceback (most recent call last):
File "D:\Python310\dynacraft\DynamiCrafter-main\scripts\evaluation\inference.py", line 347, in
run_inference(args, gpu_num, rank)
File "D:\Python310\dynacraft\DynamiCrafter-main\scripts\evaluation\inference.py", line 295, in run_inference
batch_samples = image_guided_synthesis(model, prompts, videos, noise_shape, args.n_samples, args.ddim_steps, args.ddim_eta,
File "D:\Python310\dynacraft\DynamiCrafter-main\scripts\evaluation\inference.py", line 217, in image_guided_synthesis
samples, _ = ddim_sampler.sample(S=ddim_steps,
File "D:\Python310\dynacraft\lib\site-packages\torch\utils_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "D:\Python310\dynacraft\DynamiCrafter-main\scripts\evaluation....\lvdm\models\samplers\ddim.py", line 103, in sample
self.make_schedule(ddim_num_steps=S, ddim_discretize=timestep_spacing, ddim_eta=eta, verbose=schedule_verbose)
File "D:\Python310\dynacraft\DynamiCrafter-main\scripts\evaluation....\lvdm\models\samplers\ddim.py", line 25, in make_schedule
self.ddim_timesteps = make_ddim_timesteps(ddim_discr_method=ddim_discretize, num_ddim_timesteps=ddim_num_steps,
File "D:\Python310\dynacraft\DynamiCrafter-main\scripts\evaluation....\lvdm\models\utils_diffusion.py", line 69, in make_ddim_timesteps
raise NotImplementedError(f'There is no ddim discretization method called "{ddim_discr_method}"')
NotImplementedError: There is no ddim discretization method called "'uniform_trailing'"
(dynacraft) D:\Python310\dynacraft\DynamiCrafter-main>
run.zip

AttributeError: 'VisionTransformer' object has no attribute 'input_patchnorm'

I get the following error message when I try to execute the method. Do you have any idea where this could be coming from?

"dynamicrafter/lvdm/modules/encoders/condition.py", line 339, in forward
    z = self.encode_with_vision_transformer(image)
"/dynamicrafter/lvdm/modules/encoders/condition.py", line 346, in encode_with_vision_transformer
    if self.model.visual.input_patchnorm:
AttributeError: 'VisionTransformer' object has no attribute 'input_patchnorm'

License

Hi,
Thanks for open sourcing! The license is Apache, however in the readme it says:

We develop this repository for RESEARCH purposes, so it can only be used for personal/research/non-commercial purposes.

might it be possible to remove this since it is Apache licensed?
Thanks again for releasing this amazing work!

Frames sampling with different FPS

Hi,

Thanks for your wonderful work! I don't quite understand the video frames sampling process. You mentioned in the paper that FPS is sampled between 5~30, so my understanding is that you first sample an FPS and then sample frames from a video by the FPS. Say like, we have a 10s video and an FPS of 16, how do you sample frames from the video?

Best

About AutoencoderKL

Did you train yourself to use AutoencoderKL in the latest version? Or still using the official SD AutoencoderKL weight? Have you ever tried training an AutoencoderKL from scratch?

ModuleNotFoundError: No module named 'chardet'

(dynamicrafter) D:\AIGC\Github\dynamicrafter>sh scripts/run.sh 512

(dynamicrafter) D:\AIGC\Github\dynamicrafter>python gradio_app.py --res 512
Traceback (most recent call last):
File "C:\Users\zheng\anaconda3\envs\dynamicrafter\lib\site-packages\requests\compat.py", line 11, in
import chardet
ModuleNotFoundError: No module named 'chardet'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "gradio_app.py", line 3, in
import gradio as gr
File "C:\Users\zheng\anaconda3\envs\dynamicrafter\lib\site-packages\gradio_init_.py", line 3, in
import gradio.simple_templates
File "C:\Users\zheng\anaconda3\envs\dynamicrafter\lib\site-packages\gradio_simple_templates_init
.py", line 1, in
from .simpledropdown import SimpleDropdown
File "C:\Users\zheng\anaconda3\envs\dynamicrafter\lib\site-packages\gradio_simple_templates\simpledropdown.py", line 6, in
from gradio.components.base import FormComponent
File "C:\Users\zheng\anaconda3\envs\dynamicrafter\lib\site-packages\gradio\components_init_.py", line 1, in
from gradio.components.annotated_image import AnnotatedImage
File "C:\Users\zheng\anaconda3\envs\dynamicrafter\lib\site-packages\gradio\components\annotated_image.py", line 8, in
from gradio_client.documentation import document, set_documentation_group
File "C:\Users\zheng\anaconda3\envs\dynamicrafter\lib\site-packages\gradio_client_init_.py", line 1, in
from gradio_client.client import Client
File "C:\Users\zheng\anaconda3\envs\dynamicrafter\lib\site-packages\gradio_client\client.py", line 25, in
from huggingface_hub import CommitOperationAdd, SpaceHardware, SpaceStage
File "C:\Users\zheng\anaconda3\envs\dynamicrafter\lib\site-packages\huggingface_hub_init_.py", line 370, in getattr
submod = importlib.import_module(submod_path)
File "C:\Users\zheng\anaconda3\envs\dynamicrafter\lib\importlib_init_.py", line 127, in import_module
return bootstrap.gcd_import(name[level:], package, level)
File "C:\Users\zheng\anaconda3\envs\dynamicrafter\lib\site-packages\huggingface_hub\hf_api.py", line 46, in
import requests
File "C:\Users\zheng\anaconda3\envs\dynamicrafter\lib\site-packages\requests_init
.py", line 45, in
from .exceptions import RequestsDependencyWarning
File "C:\Users\zheng\anaconda3\envs\dynamicrafter\lib\site-packages\requests\exceptions.py", line 9, in
from .compat import JSONDecodeError as CompatJSONDecodeError
File "C:\Users\zheng\anaconda3\envs\dynamicrafter\lib\site-packages\requests\compat.py", line 13, in
import charset_normalizer as chardet
File "C:\Users\zheng\anaconda3\envs\dynamicrafter\lib\site-packages\charset_normalizer_init
.py", line 23, in
from charset_normalizer.api import from_fp, from_path, from_bytes, normalize
File "C:\Users\zheng\anaconda3\envs\dynamicrafter\lib\site-packages\charset_normalizer\api.py", line 10, in
from charset_normalizer.md import mess_ratio
File "charset_normalizer\md.py", line 5, in
from charset_normalizer.utils import is_punctuation, is_symbol, unicode_range, is_accentuated, is_latin,
ImportError: cannot import name 'COMMON_SAFE_ASCII_CHARACTERS' from 'charset_normalizer.constant' (C:\Users\zheng\anaconda3\envs\dynamicrafter\lib\site-packages\charset_normalizer\constant.py)

输入图片需要和生成的视频同样分辨率吗?

很棒的工作,感谢分享。注意到现在可以生成三种分辨率的视频:576x1024,320x512,256x256。请问输入的参考图片要使用和输入视频同样分辨率的图片效果会更好吗?还是代码中有前处理会自动把图像resize到视频分辨率上?谢谢

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.