apple / ml-4m Goto Github PK

4M: Massively Multimodal Masked Modeling

License: Apache License 2.0

Python 100.00%

ml-4m's Issues

VRAM Requirements and Multi-GPU Inference Support

Hello, thank you for your impressive research.
Could you please provide information on the amount of GPU VRAM required for each model? Additionally, if a single GPU does not have sufficient VRAM, is it possible to distribute the inference across multiple GPUs?

Fine-tune using LoRA

Hi,

I would like to know if it is possible to fine-tune the model for the specific downstream task using LoRA?

I noticed that there is a file related to LoRA: fourm/models/lora_utils.py but could not find how it is utilized. It would be highly appreciated if you can provide a tutorial of how we can use LoRA for fine-tuning? Thank you!

How to use semantic segmentation tokenizer for precomputing tokens for this modality?

I am working on precomputing tokens for each modality in my 4M training pipeline. I’m using grayscale semantic segmentation masks as input, but I’m encountering an issue where the regenerated output does not match the original mask.

This is the code I am using for precomputing tokens

from fourm.vq.vqvae import VQVAE, DiVAE
from PIL import Image
from torchvision import transforms
from fourm.utils import denormalize, IMAGENET_INCEPTION_MEAN, IMAGENET_INCEPTION_STD
from torchvision.transforms import Normalize
transform = transforms.ToTensor()
resize = transforms.Resize((224, 224))

tok = VQVAE.from_pretrained('EPFL-VILAB/4M_tokenizers_semseg_4k_224-448').cuda()
tensors_b3hw = []
for image_path in selected_images :
    image = Image.open(image_path)
    rgb_b3hw = transform(resize(image)).unsqueeze(0)  
    tensors_b3hw.append(rgb_b3hw)
stacked_tensors_b3hw = torch.cat(tensors_b3hw, dim=0).int()
squeezed_tensor = torch.squeeze(stacked_tensors_b3hw, dim=1)
squeezed_tensor.shape
_, _, tokens = tok.encode(squeezed_tensor.cuda())
image_size = rgb_b3hw.shape[-1]
output_rgb_b3hw  = tok.decode_tokens(tokens, image_size=image_size)

The output_rgb_b3hw tensor, which is the regenerated output, consists of 134 channels. However, this does not match the original mask that I passed to the tokenizer. I expected the output to have the same number of channels as the input mask.
Am I missing something in the preprocessing or tokenization process? Is there a step I need to adjust to ensure that the regenerated output matches the original mask in terms of channel dimensions?

Any guidance or suggestions would be appreciated. Thank you!

@garjania Please help me out. Am I doing something wrong here ?

how to convert the trained FM pth model file to safetensors format?

Thanks for providing the traning source for FM model. I notice that there is the scripts fourm/vq/init.py to pause the pre-trained tokens. However, there is no scripts that can parse the trained FM pth model file (PyTorch model file) to the safetensors format.
How could we deal with this situation?

[Errno 2] No such file or directory: './fourm/utils/hmr2_utils/model_cfg.pkl'

Human pose dependencies are not installed, hence poses will not be visualized. To visualize them (optional), you can do the following:

Install via pip install timm yacs smplx pyrender pyopengl==3.1.4
You may need to follow the pyrender install instructions: https://pyrender.readthedocs.io/en/latest/install/index.html
Download SMPL data from https://smpl.is.tue.mpg.de/. See https://github.com/shubham-goel/4D-Humans/ for an example.
Copy the required SMPL files (smpl_mean_params.npz, SMPL_to_J19.pkl, smpl/SMPL_NEUTRAL.pkl) to fourm/utils/hmr2_utils/data .

I followed all the steps above but still got the error in the title.
Where is model_cfg.pkl?

Example of generating image pixels from ImageBind modality

Thanks for your excellent work!

I would like to inquire if you could provide some examples or documentation on how to use 4m to generate images from ImageBind feature or tokens. Your guidance on this matter would be greatly appreciated.

Thank you for your time and assistance.

Object Detection with Caption

First, thank you all for open sourcing this fantastic work.

I want to ask whether the object detection with caption feasible with this model and if yes how can I use it?

Thank you in advance!

Typo for tokenizer_path arg

It seems like run_training_4m.py uses the arg text_tokenizer_path to define the path of the text tokenizer, however the config files have this same variable called tokenizer_path. I believe they were supposed to be the same

Training details of RGB tokenizer

Thanks for your great work!

As you mentioned in your paper 4M, you used DiVAE to train the RGB tokenizer, first on 100 epochs of ImageNet-1K, and then for an additional 15 epochs on the CC12M dataset, I followed your training settings indicated by your paper, and used the ckpt-100 training on ImageNet-1K as full_ckpt, but I encoutered NaN loss problem when I continued training on the CC12M dataset, could you please provide some suggestions to resolve the problem?

Thanks in advance!

What are the minimum requirements to run an inference?

Hi,

I am attempting to run the model on my machine however the code keeps dying due to lack of memory even though my machine has enough memory to load all of the files. Is there any way I can know the minimum requirements needed to use the model? Thanks!

Input masks for generation - Potential small bug.

Looks like there may be a small bug in the generation:

ml-4m/fourm/models/generate.py

Line 138 in 2db0125

eos_idx = torch.where(mod_dict[domain]['tensor'] == eos_id)[1][0].item()

The input masks for text are determined by the position of the first batch eos only but subsequently applied to all batches. Is this intentional? Looks like it's commonly used with single batch generation (in the examples) so this may have fallen through the cracks? If not I'd be curious about the intention here, otherwise happy to make a PR.

Great stuff btw, thanks for open sourcing this!

CLIPScore moved in latest torchmetrics v1.4.0.post0

Firstly, thank you for releasing an amazing repo.

I see in pyproject.toml you have torchmetrics>=1.3.1. Given this is not pinned to an exact version, if a new env is created the latest v1.4.0.post0 is used and this breaks run_generation.py.

Changing the import from torchmetrics.multimodal import CLIPScore to be from torchmetrics.multimodal.clip_score import CLIPScore solves this issue.

I'm happy to raise a PR if that's helpful.

Examples of non-generative usage (and some additional discussion)

Hi,

I would like to kindly ask you to provide some pointers/examples on using 4M for such tasks as retrieval and classification. How do you load only the encoder, for instance?

What’s the best way to use Color palette and another image to condition outputs?

Thank you authors for open sourcing your amazing work.

What would be the best way to use Color palette for image generation and image retrieval please?

How to use RGB DiVAE tokenizer?

I am trying to encoe and decode RGB images using the trained DiVAE checkpoint:

from fourm.vq.vqvae import DiVAE
from fourm.utils import denormalize, IMAGENET_DEFAULT_MEAN, IMAGENET_DEFAULT_STD
from torchvision.transforms import Normalize


tok = DiVAE.from_pretrained('EPFL-VILAB/4M_tokenizers_rgb_16k_224-448')
normalize = Normalize(mean=IMAGENET_DEFAULT_MEAN, std=IMAGENET_DEFAULT_STD)
# encode
_, _, tokens = tok.encode(normalize(rgb_b3hw))
# decode
rgb_b3hw = denormalize(tok.decode_tokens(tokens))

For these input images:

I get these decoded images:

I tried to use with and without RGB normalization - it did not make any significant change tot the quality of the reconstruction I get.

What am I doing wrong? How one should use the tokenizer?

Thank you,

Question on Token Masking in 4M Implementation

Thank you for open sourcing your amazing work.

I have a question regarding the token masking implementation: https://github.com/apple/ml-4m/blob/main/fourm/models/fm.py#L429

While I understand setting the tokens to 0, I'm curious about masking the positional embeddings. If we mask both tokens and positional embeddings to 0, how does the model distinguish between different tokens? Wouldn't this cause the model to treat these tokens identically? Would it make sense to add position embeddings after masking?

We can use causal attention to remedy this, but I'm wondering if I've misinterpreted the token masking process. Could you clarify this approach? Thank you!

Depth tokenizer

Hi everyone, thanks for the nice work. I am considering using your pretrained depth tokenizer to extract precomputed (features) tokens for further training. I have some questions.

I cloned the ml-4m, and installed the diffusers library. However, get error: AttributeError: module diffusers.models has no attribute unet_2d_blocks. Could you please specify the requisites for using your repo and which diffuser version you have used?
Also, how many tokens do we get from your pretrained checkpoint model?
Is your uploaded pretrained depth tokenizer an encoder-decoder or encoder only model that would just give me the required tokens?
What normalization did you use for the depth data?

Thanks a lot!

Is it possible to prompt 4m

May thanks for making this work publicly available.

My question is on whether it is possible to prompt the available models, and if so, where might I find some examples on how to do this?

If not, do you plan to make this possible at some point?

Thanks in advance.

CUDA? Are you kidding me?

Why is Apple using CUDA? I have Apple Silicon. Are you guys high?

apple / ml-4m Goto Github PK

ml-4m's Issues

Recommend Projects

Recommend Topics

Recommend Org