I am trying to encoe and decode RGB images using the trained DiVAE checkpoint: <di

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

How to use RGB DiVAE tokenizer? about ml-4m HOT 3 CLOSED

apple commented on September 14, 2024 1

How to use RGB DiVAE tokenizer?

from ml-4m.

Comments (3)

alexanderswerdlow commented on September 14, 2024 1

Same question here!

from ml-4m.

garjania commented on September 14, 2024

Hi @shaibagon @alexanderswerdlow

Regarding your script, the tokenizer was trained with the inputs normalized using IMAGENET_INCEPTION_MEAN and IMAGENET_INCEPTION_STD parameters. So for a correct tokenization/reconstruction, you should use these two values instead of the ImageNet ones in normalizing and denormalizing.

Note that the tokenizer only supports resolutions between 224 and 448, and it might not work for any resolution outside of this range. Also, you need to specify the image size as the decoder input. Since the RGB tokenizer uses a diffusion decoder, it needs the image size to sample the initial noise with the correct resolution. So overall the script should look like this:

from fourm.vq.vqvae import DiVAE
from fourm.utils import denormalize, IMAGENET_INCEPTION_MEAN, IMAGENET_INCEPTION_STD
from torchvision.transforms import Normalize

tok = DiVAE.from_pretrained('EPFL-VILAB/4M_tokenizers_rgb_16k_224-448').cuda()
normalize = Normalize(mean=IMAGENET_INCEPTION_MEAN, std=IMAGENET_INCEPTION_STD)

# encode
_, _, tokens = tok.encode(normalize(rgb_b3hw).cuda())

# decode
image_size = rgb_b3hw.shape[-1]
rgb_b3hw  = tok.decode_tokens(tokens, image_size=image_size)
rgb_b3hw = denormalize(rgb_b3hw, mean=IMAGENET_INCEPTION_MEAN, std=IMAGENET_INCEPTION_STD)

Another note is that by default, the diffusion decoder uses 1000 timesteps for decoding the tokens, which is unnecessary during inference. You can do it in 50 steps to make the decoding faster by passing the timesteps argument:

tok.decode_tokens(tokens, image_size=image_size, timesteps=50)

Hope this helps.

from ml-4m.

shaibagon commented on September 14, 2024

@garjania - works like a charm!
Using 50 diffusion steps:

Using full 1000 steps:

As you said - diffusion for 1000 steps does not make such a diference.

from ml-4m.

Recommend Projects

How to use RGB DiVAE tokenizer? about ml-4m HOT 3 CLOSED

Comments (3)

Related Issues (18)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent