Hi! This is a great work with amazing results, good job! <p dir=

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Condition Text To Video generation on first frame about text2video-zero HOT 16 OPEN

PaulOrasan commented on August 18, 2024

Condition Text To Video generation on first frame

from text2video-zero.

Comments (16)

PaulOrasan commented on August 18, 2024 2

I've now realised that my mistake was that I didn't do forward steps on the first frame latents.

The code is working fine now after doing 1000 forward steps on the first frame latents :). I'll try to post some results later.

Do you have any suggestions on how to tune hyperparameters such as motion field strength, t0 and t1 etc?

from text2video-zero.

rob-hen commented on August 18, 2024 1

Hi @PaulOrasan,

motion field strength controls the global motion (camera translation). The parameters t0 and t1 enable to perceive object motion (e.g., making a walking step). The larger the gap between t0 and t1, the more variance you will observe in the object motion.

from text2video-zero.

Jaxkr commented on August 18, 2024

I've now realised that my mistake was that I didn't do forward steps on the first frame latents.

The code is working fine now after doing 1000 forward steps on the first frame latents :). I'll try to post some results later.

Hey @PaulOrasan, how did you do the 1000 forward steps on the first frame latents? I have a very similar setup to you but am using ControlNet to generate a walk animation.

I added a method to turn an image into latents

class StableDiffusionControlNetPipelineWithInitialImageSupport(
    StableDiffusionControlNetPipeline
):
    def encode_img_to_latents(self, image: PIL.Image):
        img = np.array(image.convert("RGB"))
        img = img[None].transpose(0, 3, 1, 2)
        '''
        127.5 (half of 256) is used to normalize the pixel values of the image to the range [0, 1]
        Subtracting one makes it [-1, 1]
        '''
        img = torch.from_numpy(img).to(dtype=torch.float32) / 127.5 - 1.0
        masked_image_latents = self.vae.encode(
            img.to(device=self.device, dtype=torch.float16)
        ).latent_dist.sample()
        
        # Magic number source: https://github.com/huggingface/diffusers/issues/437#issuecomment-1241827515
        return masked_image_latents * 0.18215

Then, I consume it in model.py:

def process_controlnet_pose():
    # ...
    image = PIL.Image.open('first.png')
    first_frame_latents = self.pipe.encode_img_to_latents(image)
    
    print(f"first frame latents size:: {first_frame_latents.size()}")
    latents = torch.randn((1, 4, h//8, w//8), dtype=self.dtype,
                          device=self.device, generator=self.generator)
    latents = latents.repeat(f, 1, 1, 1)
    print(f"latents size: {latents.size()}")
    print(f"latents[0]: {latents[0].size()}")
    latents[0] = first_frame_latents

with shapes:

video shape: torch.Size([19, 3, 512, 512])
first frame latents size: torch.Size([1, 4, 64, 64])
latents size: torch.Size([19, 4, 64, 64])
latents[0]: torch.Size([4, 64, 64])

However, the first frame is very blurry (like the examples you posted) and the second frame has a significant drift from the first:

asdfasdf.mp4

@rob-hen : Is there any recommendation for conditioning generation from the first frame? Checked the paper and all the code and couldn't find anything.

from text2video-zero.

rob-hen commented on August 18, 2024

The forward step should be done using deterministic forward steps (not included in the current code), in order to recover the input image. For the text conditioning, some further adaption must be done (like null text inversion).

from text2video-zero.

Jaxkr commented on August 18, 2024

The forward step should be done using deterministic forward steps (not included in the current code), in order to recover the input image. For the text conditioning, some further adaption must be done (like null text inversion).

Thank you for the reply! Could you point me towards an implemention for deterministic foward steps? I have tried Googling extensively and also asking ChatGPT (but this stuff is too new for it).

Additionally, for null text conditioning are you referring to https://arxiv.org/pdf/2211.09794.pdf?

from text2video-zero.

rob-hen commented on August 18, 2024

@Jaxkr The paper you found is correct. It also shows the deterministic DDIM inversion (which I was referring to).

from text2video-zero.

Jaxkr commented on August 18, 2024

https://huggingface.co/docs/diffusers/api/schedulers/ddim_inverse
Working with this. Will let you know how it goes.

from text2video-zero.

PaulOrasan commented on August 18, 2024

Hi @Jaxkr, sorry for replying late.

I did DDPM forward because this allows for quality results in terms of resolution and such (no bluriness). It does work to condition the video generation based on the first frame, but as expected it's not a strong guidance: the video does not preserve the exact original image.

Depending on how closely related the provided image is to SD training data, it could produce pretty reasonable representations of the original image or totally divergent outputs.

I was planning to incorporate the null text inversion also mentioned by @rob-hen and see how it works out, but things are moving slow on my end due to time constraints. Let me know if it works out for you! :D

from text2video-zero.

zhouliang-yu commented on August 18, 2024

@PaulOrasan Hey, how's the performance after applying null text inversion

from text2video-zero.

rob-hen commented on August 18, 2024

@PaulOrasan DDPM forward can normally not be inverted so that you will not be able to reproduce the first frame. You should use DDIM forward and DDIM backward.

There should not be issues with quality and bluriness when using DDIM.

from text2video-zero.

PaulOrasan commented on August 18, 2024

@zhouliang-yu hi there, sorry for replying so late, I've been away from this github account. It's still a WIP and full of hard coded stuff, haven't had much time to attend to it. It is able to reconstruct the original image with the null text optimization, but the generated output loses some of the consistency.

@rob-hen thanks for the tips, I only did DDPM forward initially to get it to generate good quality samples, which it works. I'm using DDIM forward and the null text optimization for the DDIM backward.

What I noticed is that the output loses quite a lot of consistency (foreground objects tend to become a bit deformed), do you have any intuition as to what might cause this? I have a feeling that I somehow messed up the cross-frame attention

from text2video-zero.

adhityaswami commented on August 18, 2024

Sorry to re-open this. I'm trying to condition a video using an initial image, but I'm struggling to figure out how to do either the DDPM or the DDIM forward passes. I've spend a ton of time trying to understand this, but seem to have hit a dead end. Can I have some help figuring this out? I'm currently passing my own latents and getting a blurry, weird video similar to the ones above posted in this thread.

from text2video-zero.

rob-hen commented on August 18, 2024

Have you tried to do the forward DDIM and backward DDIM with vanilla stable diffusion (on images) ? Once you get the correct results, you should be able to use it for T2V-Zero.

from text2video-zero.

adhityaswami commented on August 18, 2024

Hey, thanks for the response! I managed to figure out how to make the DDPM forward thing work like @PaulOrasan mentioned, and it's decent-ish. My next step is to implement a DDIM forward function to use instead of the DDPM forward pass that currently works. I'll try this out, but if there's any code snippets/samples or links to help me implement the DDIM forward pass in the meantime that would be super helpful.

By the way, this repo works wonderfully thanks a lot @rob-hen !

from text2video-zero.

adhityaswami commented on August 18, 2024

@rob-hen I was able to make this work. The reconstruction is pretty close, but not 100%, even when using SD generated images. Thanks a lot for your help!

from text2video-zero.

rob-hen commented on August 18, 2024

@adhityaswami here you see what you can expect from DDIM inversion.

from text2video-zero.

Condition Text To Video generation on first frame about text2video-zero HOT 16 OPEN

Comments (16)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent