Giter VIP home page Giter VIP logo

languagebind's Introduction

If you like our project, please give us a star โœจ on Github for latest update.

๐Ÿ“– Paper ย ๏ฝœย  ๐Ÿค—Demo ย ย |ย ย  ๐Ÿค– Model zoo ย ย |ย ย  ๐Ÿ“„Instruction ย ๏ฝœ ๐Ÿ’ฅDatasets

  • The following first figure shows the architecture of LanguageBind. LanguageBind can be easily extended to segmentation, detection tasks, and potentially to unlimited modalities.
  • The second figure shows our proposed VIDAL-10M dataset, which includes five modalities: video, infrared, depth, audio, and language.

๐Ÿ“ฐ News

[2023.10.04] ๐Ÿ“ Code, checkpoints and demo are available now! Welcome to watch this repository for the latest updates.

๐Ÿค— Demo

  • Local demo. Highly recommend trying out our web demo, which incorporates all features currently supported by LanguageBind.
python gradio_app.py --languagebind_weight LanguageBind.pt
  • Online demo. We provide the online demo in Huggingface Spaces. In this demo, you can calculate the similarity of modalities to language, such as audio-to-language, video-to-language, and depth-to-image.

๐Ÿ˜ฎ Highlights

๐Ÿ’ก High performance, but NO intermediate modality required

LanguageBind is a language-centric multimodal pretraining approach, taking the language as the bind across different modalities because the language modality is well-explored and contains rich semantics.

โšก๏ธ A multimodal, fully aligned and voluminous dataset

We propose VIDAL-10M, 10 Million data with Video, Infrared, Depth, Audio and their corresponding Language, which greatly expands the data beyond visual modalities.

๐Ÿ”ฅ Multi-view enhanced description for training

We make multi-view enhancements to language. We produce multi-view description that combines meta-data, spatial, and temporal to greatly enhance the semantic information of the language. In addition we further enhance the language with ChatGPT to create a good semantic space for each modality aligned language.

๐Ÿค– Model Zoo

  • We list the pretrained checkpoints of LanguageBind below. We provide an aggregated weight (LanguageBind) for online demo and inference. Additionally, LanguageBind can be disassembled into different branches to handle different tasks.
  • We additionally trained Video-Language with the LanguageBind method, which is stronger than on CLIP4Clip framework.
  • The cache comes from OpenCLIP, which we downloaded from HuggingFace. Note that the original cache for pretrained weights is the Image-Language weights, just a few more HF profiles.
ModelBaidu YunGoogle CloudPeking University Yun
LanguageBindLinkLinkTODO
Video-Language (LanguageBind)LinkLinkLink
Video-Language (CLIP4Clip)LinkLinkLink
Audio-LanguageLinkLinkLink
Depth-LanguageLinkLinkLink
Thermal(Infrared)-LanguageLinkLinkLink
Image-LanguageLinkLinkLink
Cache for pretrained weightLinkLinkLink

๐Ÿš€ Main Results

โœจ Video-Language

We focus on reporting the parameters of the vision encoder. Our experiments are based on 3 million video-text pairs of VIDAL-10M, and we train on the CLIP4Clip framework.

โœจ Multiple Modalities

Infrared-Language, Depth-Language, and Audio-Language zero-shot classification. We report the top-1 classification accuracy for all datasets.

๐Ÿ› ๏ธ Requirements and Installation

  • Python >= 3.8
  • Pytorch >= 1.13.0
  • CUDA Version >= 10.2 (recommend 11.6)
  • Install required packages:
git clone https://github.com/PKU-YuanGroup/LanguageBind
cd LanguageBind
pip install -r requirements.txt

๐ŸชงUsage

We open source all modal preprocessing code. Here is a simple script for multi-modal inference with LanguageBind.

modality_transform = {
        'language': get_tokenizer(HF_HUB_PREFIX + args.model, cache_dir=args.cache_dir),
	'video': get_video_transform(args),
	'audio': get_audio_transform(args),
	'depth': get_depth_transform(args),
	'thermal': get_thermal_transform(args),
	'image': get_image_transform(args),
}

image = ['image1.jpg', 'image2jpgwav']
audio = ['audio1.wav', 'audio2.wav']
video = ['video1.mp4', 'video2.mp4']
depth = ['depth1.png', 'depth2.png']
thermal = ['thermal1.jpg', 'thermal2.jpg']
language = ["text1", 'text2']

inputs = {
	    'image': stack_dict([load_and_transform_image(i, modality_transform['image']) for i in image], device),
	    'video': stack_dict([load_and_transform_video(i, modality_transform['video']) for i in video], device),
	    'audio': stack_dict([load_and_transform_audio(i, modality_transform['audio']) for i in audio], device),
	    'thermal': stack_dict([load_and_transform_thermal(i, modality_transform['thermal']) for i in thermal], device),
	    'depth': stack_dict([load_and_transform_depth(i, modality_transform['depth']) for i in depth], device),
            'language': stack_dict([load_and_transform_text(i, modality_transform['language']) for i in language], device)
}

with torch.no_grad():
    embeddings = model(inputs)

print("Video x Language: \n", torch.softmax(embeddings['video'] @ embeddings['language'].T, dim=-1).detach().cpu().numpy())
print("Image x Language: \n", torch.softmax(embeddings['image'] @ embeddings['language'].T, dim=-1).detach().cpu().numpy())
print("Depth x Language: \n", torch.softmax(embeddings['depth'] @ embeddings['language'].T, dim=-1).detach().cpu().numpy())
print("Audio x Language: \n", torch.softmax(embeddings['audio'] @ embeddings['language'].T, dim=-1).detach().cpu().numpy())
print("Thermal x Language: \n", torch.softmax(embeddings['thermal'] @ embeddings['language'].T, dim=-1).detach().cpu().numpy())

More details are in inference.py. Run the following command to start.

python inference.py --languagebind_weight LanguageBind.pt

๐Ÿ’ฅ VIDAL-10M

The datasets is in DATASETS.md.

๐Ÿ—๏ธ Training & Validating

The training & validating instruction is in TRAIN_AND_VALIDATE.md.

๐Ÿ‘ Acknowledgement

  • OpenCLIP An open source pretraining framework.
  • CLIP4Clip An open source Video-Text retrieval framework.
  • sRGB-TIR An open source framework to generate infrared (thermal) images.
  • GLPN An open source framework to generate depth images.

๐Ÿ”’ License

  • The majority of this project is released under the MIT license as found in the LICENSE file.
  • The dataset of this project is released under the CC-BY-NC 4.0 license as found in the DATASET_LICENSE file.

โœ๏ธ Citation

If you find our paper and code useful in your research, please consider giving a star โญ and citation ๐Ÿ“.

@misc{zhu2023languagebind,
      title={LanguageBind: Extending Video-Language Pretraining to N-modality by Language-based Semantic Alignment}, 
      author={Bin Zhu and Bin Lin and Munan Ning and Yang Yan and Jiaxi Cui and Wang HongFa and Yatian Pang and Wenhao Jiang and Junwu Zhang and Zongwei Li and Cai Wan Zhang and Zhifeng Li and Wei Liu and Li Yuan},
      year={2023},
      eprint={2310.01852},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}

languagebind's People

Contributors

linb203 avatar

Stargazers

Bin Zhu avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.