The languagebind from binzhu-ece

LanguageBind: Extending Video-Language Pretraining to N-modality by Language-based Semantic Alignment

If you like our project, please give us a star ✨ on Github for latest update.

📖 Paper ｜ 🤗Demo | 🤖 Model zoo | 📄Instruction ｜ 💥Datasets

The following first figure shows the architecture of LanguageBind. LanguageBind can be easily extended to segmentation, detection tasks, and potentially to unlimited modalities.
The second figure shows our proposed VIDAL-10M dataset, which includes five modalities: video, infrared, depth, audio, and language.

📰 News

[2023.10.04] 📍 Code, checkpoints and demo are available now! Welcome to watch this repository for the latest updates.

🤗 Demo

Local demo. Highly recommend trying out our web demo, which incorporates all features currently supported by LanguageBind.

python gradio_app.py --languagebind_weight LanguageBind.pt

Online demo. We provide the online demo in Huggingface Spaces. In this demo, you can calculate the similarity of modalities to language, such as audio-to-language, video-to-language, and depth-to-image.

😮 Highlights

💡 High performance, but NO intermediate modality required

LanguageBind is a language-centric multimodal pretraining approach, taking the language as the bind across different modalities because the language modality is well-explored and contains rich semantics.

⚡️ A multimodal, fully aligned and voluminous dataset

We propose VIDAL-10M, 10 Million data with Video, Infrared, Depth, Audio and their corresponding Language, which greatly expands the data beyond visual modalities.

🔥 Multi-view enhanced description for training

We make multi-view enhancements to language. We produce multi-view description that combines meta-data, spatial, and temporal to greatly enhance the semantic information of the language. In addition we further enhance the language with ChatGPT to create a good semantic space for each modality aligned language.

🤖 Model Zoo

We list the pretrained checkpoints of LanguageBind below. We provide an aggregated weight (LanguageBind) for online demo and inference. Additionally, LanguageBind can be disassembled into different branches to handle different tasks.
We additionally trained Video-Language with the LanguageBind method, which is stronger than on CLIP4Clip framework.
The cache comes from OpenCLIP, which we downloaded from HuggingFace. Note that the original cache for pretrained weights is the Image-Language weights, just a few more HF profiles.

Model	Baidu Yun	Google Cloud	Peking University Yun
LanguageBind	Link	Link	TODO
Video-Language (LanguageBind)	Link	Link	Link
Video-Language (CLIP4Clip)	Link	Link	Link
Audio-Language	Link	Link	Link
Depth-Language	Link	Link	Link
Thermal(Infrared)-Language	Link	Link	Link
Image-Language	Link	Link	Link
Cache for pretrained weight	Link	Link	Link

🚀 Main Results

✨ Video-Language

We focus on reporting the parameters of the vision encoder. Our experiments are based on 3 million video-text pairs of VIDAL-10M, and we train on the CLIP4Clip framework.

✨ Multiple Modalities

Infrared-Language, Depth-Language, and Audio-Language zero-shot classification. We report the top-1 classification accuracy for all datasets.

🛠️ Requirements and Installation

Python >= 3.8
Pytorch >= 1.13.0
CUDA Version >= 10.2 (recommend 11.6)
Install required packages:

git clone https://github.com/PKU-YuanGroup/LanguageBind
cd LanguageBind
pip install -r requirements.txt

🪧Usage

We open source all modal preprocessing code. Here is a simple script for multi-modal inference with LanguageBind.

modality_transform = {
        'language': get_tokenizer(HF_HUB_PREFIX + args.model, cache_dir=args.cache_dir),
	'video': get_video_transform(args),
	'audio': get_audio_transform(args),
	'depth': get_depth_transform(args),
	'thermal': get_thermal_transform(args),
	'image': get_image_transform(args),
}

image = ['image1.jpg', 'image2jpgwav']
audio = ['audio1.wav', 'audio2.wav']
video = ['video1.mp4', 'video2.mp4']
depth = ['depth1.png', 'depth2.png']
thermal = ['thermal1.jpg', 'thermal2.jpg']
language = ["text1", 'text2']

inputs = {
	    'image': stack_dict([load_and_transform_image(i, modality_transform['image']) for i in image], device),
	    'video': stack_dict([load_and_transform_video(i, modality_transform['video']) for i in video], device),
	    'audio': stack_dict([load_and_transform_audio(i, modality_transform['audio']) for i in audio], device),
	    'thermal': stack_dict([load_and_transform_thermal(i, modality_transform['thermal']) for i in thermal], device),
	    'depth': stack_dict([load_and_transform_depth(i, modality_transform['depth']) for i in depth], device),
            'language': stack_dict([load_and_transform_text(i, modality_transform['language']) for i in language], device)
}

with torch.no_grad():
    embeddings = model(inputs)

print("Video x Language: \n", torch.softmax(embeddings['video'] @ embeddings['language'].T, dim=-1).detach().cpu().numpy())
print("Image x Language: \n", torch.softmax(embeddings['image'] @ embeddings['language'].T, dim=-1).detach().cpu().numpy())
print("Depth x Language: \n", torch.softmax(embeddings['depth'] @ embeddings['language'].T, dim=-1).detach().cpu().numpy())
print("Audio x Language: \n", torch.softmax(embeddings['audio'] @ embeddings['language'].T, dim=-1).detach().cpu().numpy())
print("Thermal x Language: \n", torch.softmax(embeddings['thermal'] @ embeddings['language'].T, dim=-1).detach().cpu().numpy())

More details are in inference.py. Run the following command to start.

python inference.py --languagebind_weight LanguageBind.pt

💥 VIDAL-10M

The datasets is in DATASETS.md.

🗝️ Training & Validating

The training & validating instruction is in TRAIN_AND_VALIDATE.md.

👍 Acknowledgement

OpenCLIP An open source pretraining framework.
CLIP4Clip An open source Video-Text retrieval framework.
sRGB-TIR An open source framework to generate infrared (thermal) images.
GLPN An open source framework to generate depth images.

🔒 License

The majority of this project is released under the MIT license as found in the LICENSE file.
The dataset of this project is released under the CC-BY-NC 4.0 license as found in the DATASET_LICENSE file.

✏️ Citation

If you find our paper and code useful in your research, please consider giving a star ⭐ and citation 📝.

@misc{zhu2023languagebind,
      title={LanguageBind: Extending Video-Language Pretraining to N-modality by Language-based Semantic Alignment}, 
      author={Bin Zhu and Bin Lin and Munan Ning and Yang Yan and Jiaxi Cui and Wang HongFa and Yatian Pang and Wenhao Jiang and Junwu Zhang and Zongwei Li and Cai Wan Zhang and Zhifeng Li and Wei Liu and Li Yuan},
      year={2023},
      eprint={2310.01852},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}

binzhu-ece / languagebind Goto Github PK

languagebind's Introduction