Giter VIP home page Giter VIP logo

vit-lens's Introduction

ViT-Lens

Project Homepage arXiv arXiv Static Badge

TL;DR: We present ViT-Lens, an approach for advancing omni-modal representation learning by leveraging a pretrained-ViT with modality Lens to comprehend diverse modalities.

vit-lens-omni-modal

vit-lens-capabilities

πŸ“’ News

  • [2023.12.13] We release training code and models of ViT-Lens.
  • [2023.11.28] We upgrade ViT-Lens, with added modalities and applications. Stay tuned for the release of code and models [arXiv paper].
  • [2023.08.22] We release the arXiv paper, inference codes and checkpoints for 3D [arXiv paper].

πŸ“ Todo

  • Models for more modalities.
  • Code for ViT-Lens integration with InstructBLIP and SEED.
  • Online demo for ViT-Lens integration with InstructBLIP and SEED.

πŸ”¨ Installation

conda create -n vit-lens python=3.8.8 -y
conda activate vit-lens

# Install pytorch>=1.9.0 
conda install pytorch==1.11.0 torchvision==0.12.0 torchaudio==0.11.0 cudatoolkit=11.3 -c pytorch -y

# Install ViT-Lens
git clone https://github.com/TencentARC/ViT-Lens.git
cd ViT-Lens/
pip install -e vitlens/
pip install -r vitlens/requirements-training.txt
Training/Inference on OpenShape Triplets on 3D point clouds: environment setup (click to expand)
conda create -n vit-lens python=3.8.8 -y
conda activate vit-lens
conda install pytorch==1.12.1 torchvision==0.13.1 torchaudio==0.12.1 cudatoolkit=11.3 -c pytorch -y
conda install -c dglteam/label/cu113 dgl -y

# Install ViT-Lens
git clone https://github.com/TencentARC/ViT-Lens.git
cd ViT-Lens/
pip install -e vitlens/
pip install -r vitlens/requirements-training.txt

πŸ” ViT-Lens Model

MN40 SUN.D NYU.D Audioset VGGSound ESC50 Clotho AudioCaps TAG.M IN.EEG Download
ImageBind(Huge) - 35.1 54.0 17.6 27.8 66.9 6.0/28.4 9.3/42.3 - - -
ViT-Lens-L 80.6 52.2 68.5 26.7 31.7 75.9 8.1/31.2 14.4/54.9 65.8 42.7 vitlensL

We release a one-stop ViT-Lens-L model (based on Large ViT) and show its performance on ModelNet40 (MN40, top1 accuracy), SUN RGBD Depth-only (SUN.D, top1 accuracy), NYUv2 Depth-only (NYU.D, top1 accuracy), Audioset (Audioset, mAP), VGGSound (VGGSound, top1 accuracy), ESC50 (ESC50, top1 accuracy), Clotho (Clotho, R@1/R@10), AudioCaps (AudioCaps, R@1/R@10), TAG.M (Touch-and-Go Material, top1 accuracy) and IN.EEG (ImageNet EEG, top1 accuracy). ViT-Lens consistently outperforms ImageBind.

For more model checkpoints (trained on different data or with better performance), please refer to MODEL_ZOO.md.

πŸ“š Usage

  • You may set your paths for you own project in constants.py.
  • We provide an API (source file) and provide an example (here) for reference. You can use ViT-Lens to extract and compare features across modalities:
    import os
    import torch
    
    from open_clip import ModalityType
    from mm_vit_lens import ViTLens
    
    here = os.path.abspath(os.path.dirname(__file__))
    
    model = ViTLens(modality_loaded=[ModalityType.IMAGE, ModalityType.AUDIO, ModalityType.TEXT, ModalityType.PC])
    
    device = "cuda:0" if torch.cuda.is_available() else "cpu"
    model = model.to(device)
    
    # Example 1
    images = [
        os.path.join(here, "assets/example/image_bird.jpg"),
        os.path.join(here, "assets/example/image_fire.jpg"),
        os.path.join(here, "assets/example/image_dog.jpg"),
        os.path.join(here, "assets/example/image_beach.jpg"),
    ]
    audios = [
        os.path.join(here, "assets/example/audio_chirping_birds.flac"),
        os.path.join(here, "assets/example/audio_crackling_fire.flac"),
        os.path.join(here, "assets/example/audio_dog.flac"),
        os.path.join(here, "assets/example/audio_sea_wave.flac"),
    ]
    texts = [
        "a bird",
        "crackling fire",
        "a dog",
        "sea wave",
    ]
    inputs_1 = {
        ModalityType.IMAGE: images,
        ModalityType.AUDIO: audios,
        ModalityType.TEXT: texts,
    }
    
    with torch.no_grad(), torch.cuda.amp.autocast():
        outputs_1 = model.encode(inputs_1, normalize=True)
    
    sim_at = torch.softmax(100 * outputs_1[ModalityType.AUDIO] @ outputs_1[ModalityType.TEXT].T, dim=-1)
    print(
        "Audio x Text:\n",
        sim_at
    )
    # Expected output
    # Audio x Text:
    #  tensor([[9.9998e-01, 9.3977e-07, 2.1545e-05, 9.3642e-08],
    #         [3.8017e-09, 1.0000e+00, 3.1551e-09, 6.9498e-10],
    #         [9.4895e-03, 1.3270e-06, 9.9051e-01, 2.5545e-07],
    #         [9.7020e-06, 6.4767e-07, 2.8860e-06, 9.9999e-01]], device='cuda:0')
    
    sim_ai = torch.softmax(100 * outputs_1[ModalityType.AUDIO] @ outputs_1[ModalityType.IMAGE].T, dim=-1)
    print(
        "Audio x Image:\n",
        sim_ai
    )
    # Expected output
    # Audio x Image:
    #  tensor([[1.0000e+00, 1.5798e-06, 2.0614e-06, 1.6502e-07],
    #         [2.3712e-09, 1.0000e+00, 1.4446e-10, 1.2260e-10],
    #         [4.9333e-03, 1.2942e-02, 9.8212e-01, 1.8582e-06],
    #         [6.8347e-04, 1.0547e-02, 1.3476e-05, 9.8876e-01]], device='cuda:0')
    
    
    # Example 2
    pcs = [
        os.path.join(here, "assets/example/pc_car_0260.npy"),
        os.path.join(here, "assets/example/pc_guitar_0243.npy"),
        os.path.join(here, "assets/example/pc_monitor_0503.npy"),
        os.path.join(here, "assets/example/pc_person_0102.npy"),
        os.path.join(here, "assets/example/pc_piano_0286.npy"),
    ]
    text_pcs = ["a car", "a guitar", "a monitor", "a person", "a piano"]
    inputs_2 = {
        ModalityType.PC: pcs,
        ModalityType.TEXT: text_pcs,
    }
    with torch.no_grad(), torch.cuda.amp.autocast():
        outputs_2 = model.encode(inputs_2, normalize=True)
    sim_pc_t = torch.softmax(100 * outputs_2[ModalityType.PC] @ outputs_2[ModalityType.TEXT].T, dim=-1)
    print(
        "PointCould x Text:\n",
        sim_pc_t
    )
    # Expected output:
    # PointCould x Text:
    #  tensor([[9.9945e-01, 1.0483e-05, 1.4904e-04, 2.3988e-05, 3.7041e-04],
    #         [1.2574e-09, 1.0000e+00, 6.8450e-09, 2.6463e-08, 3.3659e-07],
    #         [6.2730e-09, 1.9918e-06, 9.9999e-01, 6.7161e-06, 4.9279e-06],
    #         [1.8846e-06, 7.4831e-06, 4.4594e-06, 9.9998e-01, 7.9092e-06],
    #         [1.2218e-08, 1.5571e-06, 1.8991e-07, 1.7521e-08, 1.0000e+00]],
    #        device='cuda:0')

πŸ“¦ Datasets

Please refer to DATASETS.md for dataset preparation.

πŸš€ Training & Inference

Please refer to TRAIN_INFERENCE.md for details.

🧩 Model Zoo

Please refer to MODEL_ZOO.md for details.

πŸ‘€ Visualization of Demo

[ Plug ViT-Lens into SEED: Video Demo ]vitlens-seed.video
[ Plug ViT-Lens into SEED: enabling compound Any-to-Image Generation ]vitlens-seed
[ Plug ViT-Lens into InstructBLIP: Video Demo ]insblip.video
[ Plug ViT-Lens into InstructBLIP: enabling Any instruction following ]vitlens.instblip2
[ Plug ViT-Lens into InstructBLIP: enabling Any instruction following ]mmvitlens.instblip3
[ Example: Plug 3D lens to LLM ]plant
[ Example: Plug 3D lens to LLM ]piano

πŸŽ“ Citation

If you find our work helps, please give us a star🌟 and consider citing:

@article{lei2023vitlens,
  title={Vit-lens: Towards omni-modal representations},
  author={Lei, Weixian and Ge, Yixiao and Zhang, Jianfeng and Sun, Dylan and Yi, Kun and Shan, Ying and Shou, Mike Zheng},
  journal={arXiv preprint arXiv:2308.10185},
  year={2023}
}
@article{lei2023vitlens-2,
  title={ViT-Lens-2: Gateway to Omni-modal Intelligence},
  author={Lei, Weixian and Ge, Yixiao and Yi, Kun and Zhang, Jianfeng and Gao, Difei and Sun, Dylan and Ge, Yuying and Shan, Ying and Shou, Mike Zheng},
  journal={arXiv preprint arXiv:2311.16081},
  year={2023}
}

βœ‰οΈ Contact

Questions and discussions are welcome via [email protected] or open an issue.

πŸ™ Acknowledgement

This codebase is based on open_clip, ULIP, OpenShape and LAVIS. Big thanks to the authors for their awesome contributions!

vit-lens's People

Contributors

stanlei52 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

vit-lens's Issues

Something about Training Methodologies and Experimental Approaches for Video Data

I'm thoroughly impressed with your project and I'm eager to apply the model to my video data. However, the current TRAIN_INFERENCE.md does not provide the relevant usage information. Could you kindly publish the associated training methodologies or experimental approaches? Your assistance would be greatly appreciated. Thank you!

reproduce evaluation results

Hi,
Thank you for the great open-source work.

However, I am currently facing some difficulties in reproducing the evaluation results, particularly regarding the scene classification on NYU-D and SUN-D. I have attached the results I obtained after executing the provided script.
Could you please assist me in identifying any possible steps or details that I might have missed, leading to this inconsistency in accuracy?
image

Training Time and GPU usages

Hi,

Thanks for the exceptional work you have presented. It is truly remarkable and contributes significantly to the field.

After reviewing your paper, I noted the mention of experiments being conducted on 32GB V100 GPU clusters. However, can you please give more details of the resources utilized for this project? Could you kindly provide information on the total training time and the exact number of GPUs employed during this period?

Thanks a lot.

plug in problem

The tensor matrix output by vitlens is 1*768 for each modal message right? So where in Instructblip do I plug it in can you please answer? Thanks!

Training code or training parameter configurations

Hi,

I'm very insterested in your great work and am trying to train your model on my own data. As this repo currently solely contains the inference code, I'm wondering if you can share your training part, or share the training parameter configuration, especially the parameters of the perceiver. Thanks a lot!

Alternate depth normalization

The justification in the paper for using disparity is "scale normalization". I know that this comes from OmniVore and ImageBind.
However, this does not actually achieve scale normalization.

What could scale normalization mean? disparity images are not scale invariant in the way RGB images are: If you bring a thing closer it will have larger disparities, as opposed to RBG images where colors stay the same. Instead it must mean something like: two images with the same "disparity" should take up the same number of pixels.

To achieve this, you should use f/depth instead of bf/depth. This makes sense because b is an arbitrary value associated with the particular camera setup that you have, and it provides you no information about the geometry of the scene you are looking at. If you change b physically, the depth image does not change, but the disparity does.

One other suggested improvement: when you resize to 224, you're actually implicitly changing the focal length. So if h is the original height of the image, I would suggest computing "disparity" as

(224/h)f/depth

If normalization is having any positive affect, I bet this improved normalization will do better.

SUN RGB-D is not in millimeters

I was trying to apply this model to my own data and not getting good results. I ran the NYUv2 dataset through my code, and the results seem to be in line with those reported in the ViT-Lens paper.

Digging into it, the issue is - at least partly - that the NYUv2 data is not in millimeters. Here is the matlab code for converting the png files to mm that is in the SUNRGBDtoolbox (https://rgbd.cs.princeton.edu/):

depthVis = imread(data.depthpath);
imsize = size(depthVis);
depthInpaint = bitor(bitshift(depthVis,-3), bitshift(depthVis,16-3));

In other words, the data in the png files is a circular shift left by 3 bits of the depth in mm (which for most data is just multiplying by 8).

I mention this because the code in #9 seems to indicate that it is assumed that the data is in mm. It might be important if other datasets get used that are in mm and not the SUN RGB-D format.

What kind of textual prompts do you use during the training period?

Hi,

thanks for your great work! I'm trying to adapt the ViT-Lens to the customized dataset, and I hope to aligh the textual prompts used in the inference period with the training period, so could you please share the prompt template, which may helps promote the performance.

η‚ΉδΊ‘ε’Œζ–‡ζœ¬θΎ“ε‡Ίη»“ζžœδΈε―Ή

运葌example.pyεΎ—εˆ°ε¦‚δΈ‹η»“ζžœ
PointCould x Text:
tensor([[8.5200e-04, 9.5644e-02, 5.8601e-01, 1.9369e-02, 2.9812e-01],
[8.8911e-04, 1.7004e-01, 3.2570e-01, 1.1302e-02, 4.9207e-01],
[2.9327e-04, 6.9276e-02, 4.6433e-01, 1.2254e-02, 4.5384e-01],
[1.9555e-03, 7.8262e-02, 3.8027e-01, 5.8164e-02, 4.8135e-01],
[3.0467e-04, 1.0489e-01, 4.9719e-01, 2.1044e-02, 3.7657e-01]],
device='cuda:0')

Can not load eeg ckpt

Hi, I just want to load the eeg ckpt from your huggingface repo but it seems that tons of keys are not matched. Any chance can you double check on your side if it can be loaded successful?

InstructBLIP and SEED Implementation

Hi, I have checked the Clip-Vision embedding (last hidden state) of Blip2&InstructBlip on huggingface (instructblip-vicuna-7b), the dimension is 257x1408. However, the multi-modal matching space of ViT-Lens uses 1x768 dimension. I wonder how to use InstructBlip and Seed for text and image generation directly, have they been fine-tuned?

Reproducing NYUv2 Results

This code documents the processing pipeline well, but it starts with disparity images, whereas the NYUv2 starts with depth images.
What baseline and focal length are you using for converting NYUv2.D to disparity? My best guess is

f = 518.857901
b = 75

However, that seems like it could be off by an order of magnitude. Help would be appreciated.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    πŸ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. πŸ“ŠπŸ“ˆπŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❀️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.