Hello authors, I was wondering the possibility of utilizing stable diffusion MSE l

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

It just shows how versatile the algorithm is. <a href="https://arxiv.org/abs/2302.0712

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

Prompt Optimization without CLIP Loss about hard-prompts-made-easy HOT 7 CLOSED

yuxinwenrick commented on July 23, 2024

Prompt Optimization without CLIP Loss

from hard-prompts-made-easy.

Comments (7)

YuxinWenRick commented on July 23, 2024 2

Thanks for pointing it out! Interestingly, it's an amazing paper from my colleagues in my lab. I will take a look and optimize the training process.

from hard-prompts-made-easy.

josejhlee commented on July 23, 2024 1

@ozanciga

Yes, either noise (the paper i linked) or the prompt embedding. In "Null-text Inversion for Editing Real Images using Guided Diffusion Models", they do exactly what u wished for: optimize the null token for every step. Definitely a feasible direction. Thanks for sharing ur take.

from hard-prompts-made-easy.

YuxinWenRick commented on July 23, 2024

Hi, sorry for the late response.

I believe it's feasible based on my previous attempts. However, the main issue is that the training doesn't converge, and it's hard to select the best prompt. This is because the diffusion time step is different at each optimization step, and selecting the best prompt based on loss isn't reliable.

However, I think it's possible to overcome this by generating an image using the current prompt at each optimization step and then choosing the best prompt based on the distance to the target image. However, this approach could be computationally expensive.

Thank you for bringing this to my attention. I will conduct further experiments and keep you updated.

from hard-prompts-made-easy.

josejhlee commented on July 23, 2024

It just shows how versatile the algorithm is. https://arxiv.org/abs/2302.07121 In this paper, the authors use the predicted clean image x_0 (rather than x_t) to calculate the loss. Maybe this could be a way around the expensive computation. Thanks for sharing your opinion and for your work.

from hard-prompts-made-easy.

ozanciga commented on July 23, 2024

hey i wanted to add my 2¢ to this, mostly because i'd love to see it work under a more general setting :)

from what i can surmise @josejhlee's suggesting to find a gradient update which pushes the prompt to generate the ground truth image:

model($f_{clip}$ + $\Delta$) $\sim$ image

which is found via minimizing the loss wrt the difference between generated image and the ground truth, mse($gen-gt$).

however note the same prompt embedding can generate a different image with a different initial noise, which i deliberately omitted above, i.e., model($f_{clip}$ + $\Delta$, $noise_{init}$).

therefore, when you are optimizing above updated objective w/ the noise, you may need to optimize the noise as well, which of course is a more complicated task.

i know of a way to sorta get the noise used from the generated image by inversing the process, e.g., here: https://www.reddit.com/r/StableDiffusion/comments/xboy90/a_better_way_of_doing_img2img_by_finding_the/

~ having taking a look the paper jose referenced, it seems like there is some similarities w/ above:

"The key idea of backward guidance is to optimize for a clean image that best matches the prompt based on zˆ0, and linearly translate the guided change back to the noisy image space at step t."

i guess the clever bit is to come up with an equation that optimizes for the best prompt given a noisy image, next level of noise, where the error model $\epsilon$ is constant, and we find the prompt embedding which needs to be optimized for all noise levels, such as

$x_t=x_{t-1}+\epsilon$ for all t.

from hard-prompts-made-easy.

YuxinWenRick commented on July 23, 2024

Hi @josejhlee, @ozanciga, apologies for my delayed response. I had a busy week.

I have updated the code and added an example of how to optimize for the hard prompt through the diffusion model here. While it is still a work in progress, I believe it is effective. The current implementation involves randomly sampling a time step for each optimization step and calculating the MSE loss between the reconstructed noise and the ground-truth noise added to $x_0$. Additionally, every 50 steps, the current prompt is fed to the stable diffusion, and the CLIP score between the generated images and the target image is calculated to choose the best prompt.

I will keep optimizing the code, and any suggestions and pull requests are welcome!

Note: the current code requires ~20GB of GPU memory and ~10 mins for 1000 steps.

from hard-prompts-made-easy.

josejhlee commented on July 23, 2024

Thanks for your hard work, the insight from this experiment is well appreciated!

from hard-prompts-made-easy.

Prompt Optimization without CLIP Loss about hard-prompts-made-easy HOT 7 CLOSED

Comments (7)

Related Issues (16)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent