Giter VIP home page Giter VIP logo

Comments (7)

YuxinWenRick avatar YuxinWenRick commented on July 23, 2024 2

Thanks for pointing it out! Interestingly, it's an amazing paper from my colleagues in my lab. I will take a look and optimize the training process.

from hard-prompts-made-easy.

josejhlee avatar josejhlee commented on July 23, 2024 1

@ozanciga

Yes, either noise (the paper i linked) or the prompt embedding. In "Null-text Inversion for Editing Real Images using Guided Diffusion Models", they do exactly what u wished for: optimize the null token for every step. Definitely a feasible direction. Thanks for sharing ur take.

from hard-prompts-made-easy.

YuxinWenRick avatar YuxinWenRick commented on July 23, 2024

Hi, sorry for the late response.

I believe it's feasible based on my previous attempts. However, the main issue is that the training doesn't converge, and it's hard to select the best prompt. This is because the diffusion time step is different at each optimization step, and selecting the best prompt based on loss isn't reliable.

However, I think it's possible to overcome this by generating an image using the current prompt at each optimization step and then choosing the best prompt based on the distance to the target image. However, this approach could be computationally expensive.

Thank you for bringing this to my attention. I will conduct further experiments and keep you updated.

from hard-prompts-made-easy.

josejhlee avatar josejhlee commented on July 23, 2024

It just shows how versatile the algorithm is. https://arxiv.org/abs/2302.07121 In this paper, the authors use the predicted clean image x_0 (rather than x_t) to calculate the loss. Maybe this could be a way around the expensive computation. Thanks for sharing your opinion and for your work.

from hard-prompts-made-easy.

ozanciga avatar ozanciga commented on July 23, 2024

hey i wanted to add my 2ยข to this, mostly because i'd love to see it work under a more general setting :)

from what i can surmise @josejhlee's suggesting to find a gradient update which pushes the prompt to generate the ground truth image:

model($f_{clip}$ + $\Delta$) $\sim$ image

which is found via minimizing the loss wrt the difference between generated image and the ground truth, mse($gen-gt$).

however note the same prompt embedding can generate a different image with a different initial noise, which i deliberately omitted above, i.e., model($f_{clip}$ + $\Delta$, $noise_{init}$).

therefore, when you are optimizing above updated objective w/ the noise, you may need to optimize the noise as well, which of course is a more complicated task.

i know of a way to sorta get the noise used from the generated image by inversing the process, e.g., here: https://www.reddit.com/r/StableDiffusion/comments/xboy90/a_better_way_of_doing_img2img_by_finding_the/

~ having taking a look the paper jose referenced, it seems like there is some similarities w/ above:

"The key idea of backward guidance is to optimize for a clean image that best matches the prompt based on zห†0, and linearly translate the guided change back to the noisy image space at step t."

i guess the clever bit is to come up with an equation that optimizes for the best prompt given a noisy image, next level of noise, where the error model $\epsilon$ is constant, and we find the prompt embedding which needs to be optimized for all noise levels, such as

$x_t=x_{t-1}+\epsilon$ for all t.

from hard-prompts-made-easy.

YuxinWenRick avatar YuxinWenRick commented on July 23, 2024

Hi @josejhlee, @ozanciga, apologies for my delayed response. I had a busy week.

I have updated the code and added an example of how to optimize for the hard prompt through the diffusion model here. While it is still a work in progress, I believe it is effective. The current implementation involves randomly sampling a time step for each optimization step and calculating the MSE loss between the reconstructed noise and the ground-truth noise added to $x_0$. Additionally, every 50 steps, the current prompt is fed to the stable diffusion, and the CLIP score between the generated images and the target image is calculated to choose the best prompt.

I will keep optimizing the code, and any suggestions and pull requests are welcome!

Note: the current code requires ~20GB of GPU memory and ~10 mins for 1000 steps.

from hard-prompts-made-easy.

josejhlee avatar josejhlee commented on July 23, 2024

Thanks for your hard work, the insight from this experiment is well appreciated!

from hard-prompts-made-easy.

Related Issues (16)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.