Giter VIP home page Giter VIP logo

macaw-llm's Introduction

Logo
Logo         Logo         Logo         Logo         Logo

Macaw-LLM: Multi-Modal Language Modeling with Image, Audio, Video, and Text Integration

Version License Stars Issues Python

¹ ² Chenyang Lyu, ³ Minghao Wu, ¹ * Longyue Wang, ¹ Xinting Huang,

¹ Bingshuai Liu, ¹ Zefeng Du, ¹ Shuming Shi, ¹ Zhaopeng Tu

¹ Tencent AI Lab, ² Dublin City University, ³ Monash University

*Longyue Wang is the corresponding author: [email protected]

Macaw-LLM is an exploratory endeavor that pioneers multi-modal language modeling by seamlessly combining image🖼️, video📹, audio🎵, and text📝 data, built upon the foundations of CLIP, Whisper, and LLaMA.

📰 Paper 🏗️ Model (via dropbox) 🏗️ Model (via weiyun) 🗃️ Dataset 🧱 Code 🧐 Video 🧑‍💻 Demo

Table of Contents 📚

Introduction Logo

Figure Description or Alt Text

In recent years, the field of language modeling has witnessed remarkable advancements. However, the integration of multiple modalities, such as images, videos, audios, and text, has remained a challenging task. Macaw-LLM is a model of its kind, bringing together state-of-the-art models for processing visual, auditory, and textual information, namely CLIP, Whisper, and LLaMA.

Key Features 🔑

Macaw-LLM boasts the following unique features:

  1. Simple & Fast Alignment: Macaw-LLM enables seamless integration of multi-modal data through simple and fast alignment to LLM embeddings. This efficient process ensures quick adaptation of diverse data types.
  2. One-Stage Instruction Fine-Tuning: Our model streamlines the adaptation process through one-stage instruction fine-tuning, promoting a more efficient learning experience.
  3. New Multi-modal Instruction Dataset: We create a new multi-modal instruction dataset that covers diverse instructional tasks leveraging image and video modalities, which facilitates future work on multi-modal LLMs.

Architecture Logo

Macaw-LLM is composed of three main components:

  1. CLIP: Responsible for encoding images and video frames.
  2. Whisper: Responsible for encoding audio data.
  3. LLM (LLaMA/Vicuna/Bloom): The language model that encodes instructions and generates responses.

The integration of these models allows Macaw-LLM to process and analyze multi-modal data effectively.

Alignment Strategy Logo

Our novel alignment strategy enables faster adaptation by efficiently bridging multi-modal features to textual features. The process involves:

  1. Encoding multi-modal features with CLIP and Whisper.
  2. Feeding the encoded features into an attention function, wherein the multi-modal features serve as the query and the embedding matrix of LLaMA as the key and value.
  3. Injecting the outputs into the input sequence (before instruction tokens) of LLaMA, allowing for a streamlined alignment process with minimal additional parameters.

New Multi-modal Instruction Dataset 🆕

Figure Description or Alt Text
In this project, we generate a dataset using GPT-3.5-Turbo by providing image or video captions as prompts. To create this dataset, we use captions from the MS COCO dataset for images and the Charades and AVSD datasets for videos. Our dataset consists of approximately 69K examples based on COCO image captions and 50K examples based on Charades and AVSD video captions. We currently focus on single-turn dialogues but plan to expand into multi-turn dialogues and diverse multi-modal content in the future. This will enrich the dataset and improve fine-tuning for language learning models (LLMs).
Figure Description or Alt Text

Installation Logo

To install Macaw-LLM, follow these steps:

# Clone the repository
git clone https://github.com/lyuchenyang/Macaw-LLM.git

# Change to the Macaw-LLM directory
cd Macaw-LLM

# Install required packages
pip install -r requirements.txt

# Install ffmpeg
yum install ffmpeg -y

# Install apex
git clone https://github.com/NVIDIA/apex.git
cd apex
python setup.py install
cd ..

Usage 🚀

  1. Downloading dataset:

  2. Dataset preprocessing:

    • Place the data in three modalities to specific folders - data/text/, data/image/, data/video/
    • Extract frames and audio from videos:
      python preprocess_data.py
      
    • Transform supervised data to dataset:
      python preprocess_data_supervised.py
      
    • Transform unsupervised data to dataset:
      python preprocess_data_unsupervised.py
      
  3. Training:

    • Execute the training script (you can specify the training parameters inside):
      ./train.sh
      
  4. Inference:

    • Execute the inference script (you can give any customized inputs inside):
      ./inference.sh
      

Examples Logo

We present several examples that highlight the proficiency of our Macaw-LLM in understanding and following multi-modal instructions. These examples showcase our system's multi-modal ability to understand and generate responses based on images and videos. These examples demonstrate how our system comprehends visual content and produces high-quality, fluent responses in natural language conversations. Our system generates contextually relevant and informative answers to various questions about the image, demonstrating its capability to communicate about visual content naturally and fluently.

Figure Description or Alt Text

Future Work and Contributions 🚀

While our model is still in its early stages, we believe that Macaw-LLM paves the way for future research in the realm of multi-modal language modeling. The integration of diverse data modalities holds immense potential for pushing the boundaries of artificial intelligence and enhancing our understanding of complex real-world scenarios. By introducing Macaw-LLM, we hope to inspire further exploration and innovation in this exciting area of study.

We welcome contributions from the community to improve and expand Macaw-LLM's capabilities. 🤝

ToDo 👨‍💻

  • Evaluation: We show some examples showcasing the multi-modal ability of our Macaw-LLM. However, we acknowledge that these efforts may not be fully adequate for accurately and comprehensively demonstrate model capabilities. We aim to conduct extensive evaluation on our systems to evaluate its capability.

  • More Language Models: We aim to extend Macaw-LLM by incorporating additional language models like Dolly, BLOOM, T-5, etc. This will enable more robust and versatile processing and understanding of multi-modal data.

  • Multilingual Support: Our next step is to support multiple languages, moving towards true multi-modal and multilingual language models. We believe this will significantly broaden Macaw-LLM's applicability and enhance its understanding of diverse, global contexts.

Acknowledgements 🙏

We would like to express our gratitude to the following open-source projects for their valuable contributions to Macaw-LLM:

  • Stanford Alpaca for providing the Alpaca dataset, which we used in our experiments.
  • Parrot for providing a helpful implementation of the training of LLaMA.
  • CLIP for providing a strong image and video encoding model.
  • Whisper for providing a strong audio encoding model.
  • LLaMA for providing a powerful LLM.

We would also like to thank the developers and maintainers of these projects for their dedication and hard work in making their projects open-source and accessible to the community.

Citation

@article{lyu2023macaw,
  title={Macaw-LLM: Multi-Modal Language Modeling with Image, Audio, Video, and Text Integration},
  author={Lyu, Chenyang and Wu, Minghao and Wang, Longyue and Huang, Xinting and Liu, Bingshuai and Du, Zefeng and Shi, Shuming and Tu, Zhaopeng},
  journal={arXiv preprint arXiv:2306.09093},
  year={2023}
}

macaw-llm's People

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

macaw-llm's Issues

please update the demo code?

Hi, dear authors:
Thanks for sharing the great work. I noticed that you have upload the training and evaluation code, but without demo code such as VQA. It would be grateful if you could release the demo code。 thanks you.

Requirement Versions

Multiple requirement versions are not specified. This is leading to problems during install.

protobuf
scikit-learn
moviepy
ffmpeg-python
tqdm
pandas
opencv-python
clip
openai-whisper
appdirs
loralib
bitsandbytes
black
black[jupyter]
fire
gradio
peft
deepspeed

Question about setting pad token

Hi, may I know how to set the pad token?
In the previous version of code, it was set as [32006]. I checked that in LLaMA token files, 32006 isnot used yet. Can I use any num not used before?
image

image

Different LLM backbones?

Hi, the README mentions several different LLM backbones, but the paper seems to reference only LLaMA and a brief code search didn't turn up any mentions of Vicuna or Bloom. Did you train this with other LLMs beyond LLaMA and if so, where can we find the trained weights for these?

Thank you!

Using pad_token, but it is not set yet.

Hi, when I run "preprocess_data_supervised.py" by using llama-7b-hf tokenizer, it shows "Using pad_token, but it is not set yet" and "Truncation was not explicitly activated but max_length is provided a specific value,...".

Is it ok?
截屏2023-08-17 13 23 59

Questions about Model

Dear Author,

I would like to express my sincere gratitude for your open-source contributions. Your neural network model has left a deep impression on me. It seems that your model is driven by text information (CLIP aligns images and text, while Whisper aligns audio and text), and the ultimate goal of the model appears to be more inclined towards multimodal QA and multimodal captioning. However, I have the following questions:

  1. The dimensions of different modalities are vastly different. How do you balance the information from different modalities in your network?
  2. In real-world scenarios, there may be missing modalities. Do you need to input information from all three modalities during the training/inference process of your model, or can you only input certain modalities?

I am looking forward to your work and hope to see your article soon. Thank you.

Best regards,
RitchieAlpha

Missing License File

Hello,
Thanks for sharing this work.

But the repo seems to be missing a LICENSE file and hence makes it difficult for people to decide if they can use this project in their work or not.

Has a decision regarding licensing been made?

Thanks!

missing file ”data/all_visual_names.json“

hi, thank you for making such great work open source.
However, I have encountered some issues:

  1. When I run inference.sh, there is a file missing error on 'data/all_visual_names.json', how can I get this file?
  2. Is there trained models we can do inference directly?
image

Data filtering step

When processing the dataset, there is a filter criteria:

if 'caption' in e['instruction'] or 'caption' in e['response'] or ' no ' in e['response'] or 'not' in e['response']:
        continue

Why we need such a filtering step?

Call for paper

Hi, appreciate your great job! I wonder that is there any paper related to this project.

Always have same response

Hi, I have loaded your pre-trained weights and tried some instructions. However, I found the model responded with the same answer no matter what image I gave.

model = MM_LLMs.from_pretrained(
        "trained_model/mm_llms_trainer",
        config = model_config,
    )
model.eval()
# ...

instruction = "How many boats are in the picture?"
template = f"Below is an instruction that describes a task. Write a response that appropriately completes the request.\n\n### Instruction:\n{instruction}\n\n### Response:"

input_ids = tokenizer.encode(template.format(instruction))
eos_token_id = tokenizer.eos_token_id
if eos_token_id in input_ids:
    input_ids.remove(eos_token_id)
input_ids = torch.tensor([input_ids], dtype=torch.int).to(device)

# image
# image = preprocess(Image.open("data/image_sample/COCO_train2014_000000492606.jpg"))
# image = preprocess(Image.open("data/image_sample/COCO_train2014_000000344896.jpg"))
image = preprocess(Image.open("data/image_sample/COCO_train2014_000000407061.jpg"))
image = image.unsqueeze(0)

with torch.no_grad():
    bs = 1
    
    inputs = {
        "videos": None,
        "images": image.half(),
        "audios": None,
        "input_ids": input_ids,
        'image_starts': torch.tensor([tokenizer.convert_tokens_to_ids('<image>')] * bs, dtype=torch.int),
        'image_ends': torch.tensor([tokenizer.convert_tokens_to_ids('</image>')] * bs, dtype=torch.int),
        'audio_starts': torch.tensor([tokenizer.convert_tokens_to_ids('<audio>')] * bs, dtype=torch.int),
        'audio_ends': torch.tensor([tokenizer.convert_tokens_to_ids('</audio>')] * bs, dtype=torch.int),
        'video_starts': torch.tensor([tokenizer.convert_tokens_to_ids('<video>')] * bs, dtype=torch.int),
        'video_ends': torch.tensor([tokenizer.convert_tokens_to_ids('</video>')] * bs, dtype=torch.int),
    }

    for k,v in inputs.items():
        if v is not None:
            inputs[k] = v.to(device)
    inputs['inference'] = True
    
    
    text_embeddings, attention_mask, labels, debug = model.prepare_inputs_for_generation(inputs)
    
    print()
    print(text_embeddings.size())
        

    model_output = model.llm(inputs_embeds=text_embeddings, attention_mask=attention_mask, labels=labels)
    generate_ids = model.llm.generate(inputs_embeds=text_embeddings, max_new_tokens=128, eos_token_id=2, bos_token_id=1, pad_token_id=32006)
Below is an instruction that describes a task. Write a response that appropriately completes the request.

### Instruction:
How many boats are in the picture?

### Response:
========================================
There are 5000 in the picture.
========================================

No matter what image I gave to the model. The model always replies There are 5000 in the picture. with the same prompt. It seems the model just ignored any multi-modal inputs and replied based on text.

Did I do anything wrong? Thank you.

What is the pad ID for tokenizer?

In the trainer file, I saw:

special_tokens = {
'': 32000,
'': 32001,
'': 32002,
'': 32003,
'': 32005,
}

But in the preprocessing files, I didn't see these tokens are setted. Instead, I tried to print out the token ids, and found that it seems the PAD token id is 32000. What is the potential problem? What is the pad_id for tokenizer?

TypeError: string indices must be integers, not 'str'

preprocess_data_unsupervised.py", line 105, in preprocess_alpaca_to_tensor_dataset
texts = PROMPT_DICT['prompt_input'].format(e['instruction'], e['input']) if e['input'] != "" else PROMPT_DICT['prompt_no_input'].format(e['instruction'])

Resource problem?

I wonder with 3 big models like CLIP, LLama, Whisper, at least how much VRAM will we need to host a demo? Is it possible to host them on a single 4090 GPU?

Performance of the model

Hello,
I tried to load the pre-trained model you provided and run the following example from AVSD data:

  {
        "instruction": "Is the woman already in the room?",
        "input": "",
        "output": "Yes ahe is already in the room",
        "image": null,
        "audio": null,
        "video": "7UPGT.mp4"
    },

Basically, to prepare the whisper model, clip model, and llama model, I used the following:

   # save whisper, clip, and llama models for future use.
from transformers import CLIPModel, LlamaModel
clip_model = CLIPModel.from_pretrained("openai/clip-vit-base-patch16")
from transformers import WhisperForConditionalGeneration
whisper_model = WhisperForConditionalGeneration.from_pretrained("openai/whisper-base")
llama7b_model = LlamaModel.from_pretrained("decapoda-research/llama-7b-hf")

clip_model.save_pretrained('pretrained_models/clip_model/')
whisper_model.save_pretrained('pretrained_models/whisper_model/')
llama7b_model.save_pretrained('pretrained_models/llama7b_model/')

To load the macaw model you provided, I used the following:

if name == "main":
clip_config = CLIPConfig.from_pretrained('pretrained_models/clip_model/')
whisper_config = WhisperConfig.from_pretrained('pretrained_models/whisper_model/')
llm_config = AutoConfig.from_pretrained('pretrained_models/llama7b_model/')
tokenizer = get_tokenizer("pretrained_models/macaw/", tokenizer_cls=LlamaTokenizer)
llm_config.vocab_size = len(tokenizer)
print("llm_config: ", llm_config)

model_config = MM_LLMs_Config(
    n_frames=6, 
    attention_heads=32, 
    image_conv_kernel=48, 
    image_conv_stride=36, 
    video_conv_kernel=36, 
    video_conv_stride=30, 
    audio_conv_kernel=240, 
    audio_conv_stride=220,
    clip_config=clip_config, whisper_config=whisper_config, llm_config=llm_config
)

macaw_model = MM_LLMs.from_pretrained(
    'pretrained_models/macaw/',
    config = model_config,
    # load_in_8bit=True,
    # torch_dtype=torch.float16,
    # device_map=device_map,
)
TOKENIZER =  get_tokenizer("pretrained_models/macaw/", tokenizer_cls=LlamaTokenizer)

I run the model by:

macaw_model.eval()
with torch.no_grad():
    generate_ids = macaw_model(data_item)
print("generate_ids: ", generate_ids)
input_texts = TOKENIZER.batch_decode(data_item["input_ids"], skip_special_tokens=True, clean_up_tokenization_spaces=False)
generated_texts = TOKENIZER.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)
print("input_texts: ", input_texts)
print("generated_texts: ", generated_texts)

Then I tested the above avsd example. What I get is:

input_texts: ['Below is an instruction that describes a task, with or without input. Write a response that appropriately completes the request.\n\n### Instruction:\nIs the woman already in the room?\n\n### Response:\n\n']
generated_texts: ['\n\n']

So you can see, the output is nonsense. I tried some other examples, and I also tried pure text input, but they results are not satisfying. May I ask what may be wrong?

which llama tokenizer to use?

In the preprocessing file, we have tokenizer = AutoTokenizer.from_pretrained('trained_models/llama_tokenizer'). This seems won't get the llama tokenizer from HF. which llama tokenizer we should use? As there are several versions on HF. Thanks.

Questions about the files - which files to download

Thanks for the cool project! I have two questions:

  1. which files exactly we should download? In the COCO, VQA, etc. datasets, there are many files. However, I believe only a part of them are needed. For example, I downloaded the following:

Stage 1:

1. Download the COCO image dataset (2014 Train images [83K/13GB]) from: https://cocodataset.org/#download, unzip to current folder (train2014/).

2. Download the Macaw dataset: https://github.com/lyuchenyang/Macaw-LLM/blob/main/data/generated_examples_coco.json

3. Download the Macaw dataset: https://github.com/lyuchenyang/Macaw-LLM/blob/main/data/generated_examples_avsd.json

4. Download the Charades video dataset (Data (scaled to 480p, 13 GB)) from: https://prior.allenai.org/projects/charades, unzip to current folder (Charades_v1_480/).

5. In the current folder, create a folder named "avsd/". In "./avsd/", create "./avsd/videos/", "./avsd/audios/", and "./avsd/images/". Move all the videos from "Charades_v1_480/" to "./avsd/videos/".

6. In the current folder, create a folder named "coco/". In "./coco/", create "./coco/images/". Move all the images from "train2014/" to "./coco/images/".

Stage 2:

1. From https://visualqa.org/download.html download "Training annotations 2017 v2.0*", "Validation annotations 2017 v2.0*", "Training questions 2017 v2.0*", "Validation questions 2017 v2.0*". Put them in "./vqa/" and unzip.

2. From https://video-dialog.com/ download AVSD Dataset (4 files), put them into "./avsd/".

But I'm not sure whether it is all we needs.

  1. In the combine_visual_and_audio_names(): of preprocessing supervised python script, there is a:

def add_image_names(dir=None):
all_examples = json_load(dir)['annotations']

    for ind, e in enumerate(tqdm(all_examples)):
        
        _image_dir = e['image_path']
        if len(_image_dir.split('_')[-1].split('.')[0]) < 12:
            i_str = _image_dir.split('_')[-1].split('.')[0]
            n_str = '0' * (12 - len(i_str)) + i_str
            _image_dir = _image_dir.replace(i_str, n_str)

However, I can't find any "image_path" field in any of the above json files.

Looking forward to your answer. Thank you.

Paths for pretrained models

Hi, can you please provide huggingface paths for the following?

clip_config = CLIPConfig.from_pretrained('trained_models/clip_model')
whisper_config = WhisperConfig.from_pretrained('trained_models/whisper_model')

I tried with openai/clip-vit-base-patch16 and openai/whisper-base but there seems to be a mismatch in shapes upon loading the model.

Thanks

Some weights of MM_LLMs were not initialized from the model checkpoint at ./mm_llms_trainer/ and are newly initialized:

Thank you very much for your outstanding work. I encountered the following problem when loading model weights. When I used torch.load to load pytorch_model.bin, I found that this part of the weights was indeed missing.
Some weights of MM_LLMs were not initialized from the model checkpoint at ./mm_llms_trainer/ and are newly initialized: ['video_long_self_attention.in_proj_bias', 'video_long_self_attention.bias_v', 'video_long_self_attention.in_proj_weight', 'video_long_self_attention.out_proj.bias', 'video_long_self_attention_attention_ .bias_k', 'video_long_self_attention.out_proj.weight']

How to get the whisper, clip, and llama model used by macaw?

I used the following code to get the pretrained models:

from transformers import CLIPModel, LlamaModel
clip_model = CLIPModel.from_pretrained("openai/clip-vit-base-patch16")
from transformers import WhisperForConditionalGeneration
whisper_model = WhisperForConditionalGeneration.from_pretrained("openai/whisper-base")
llama7b_model = LlamaModel.from_pretrained("decapoda-research/llama-7b-hf")
clip_model.save_pretrained('trained_models/clip_model/')
whisper_model.save_pretrained('trained_models/whisper_model/')
llama7b_model.save_pretrained('trained_models/llama7b_model/')

Is this correct?

GPU Memory Requirement

Thank you for your awesome work! I would like to know how much GPU memory at least can run on this project, can It run on a 2*3090 GPU?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.