lyuchenyang / macaw-llm Goto Github PK

View Code? Open in Web Editor NEW

1.4K 32.0 106.0 36.72 MB

Macaw-LLM: Multi-Modal Language Modeling with Image, Video, Audio, and Text Integration

License: Apache License 2.0

Python 98.38% Shell 1.62%

language-model multi-modal-learning natural-language-processing deep-learning machine-learning neural-networks

macaw-llm's Introduction

Macaw-LLM: Multi-Modal Language Modeling with Image, Audio, Video, and Text Integration

¹ ² Chenyang Lyu, ³ Minghao Wu, ¹ ^* Longyue Wang, ¹ Xinting Huang,

¹ Bingshuai Liu, ¹ Zefeng Du, ¹ Shuming Shi, ¹ Zhaopeng Tu

¹ Tencent AI Lab, ² Dublin City University, ³ Monash University

^*Longyue Wang is the corresponding author: [email protected]

Macaw-LLM is an exploratory endeavor that pioneers multi-modal language modeling by seamlessly combining image🖼️, video📹, audio🎵, and text📝 data, built upon the foundations of CLIP, Whisper, and LLaMA.

📰 Paper 🏗️ Model (via dropbox) 🏗️ Model (via weiyun) 🗃️ Dataset 🧱 Code 🧐 Video 🧑‍💻 Demo

Introduction

In recent years, the field of language modeling has witnessed remarkable advancements. However, the integration of multiple modalities, such as images, videos, audios, and text, has remained a challenging task. Macaw-LLM is a model of its kind, bringing together state-of-the-art models for processing visual, auditory, and textual information, namely CLIP, Whisper, and LLaMA.

Key Features 🔑

Macaw-LLM boasts the following unique features:

Simple & Fast Alignment: Macaw-LLM enables seamless integration of multi-modal data through simple and fast alignment to LLM embeddings. This efficient process ensures quick adaptation of diverse data types.
One-Stage Instruction Fine-Tuning: Our model streamlines the adaptation process through one-stage instruction fine-tuning, promoting a more efficient learning experience.
New Multi-modal Instruction Dataset: We create a new multi-modal instruction dataset that covers diverse instructional tasks leveraging image and video modalities, which facilitates future work on multi-modal LLMs.

Architecture

Macaw-LLM is composed of three main components:

CLIP: Responsible for encoding images and video frames.
Whisper: Responsible for encoding audio data.
LLM (LLaMA/Vicuna/Bloom): The language model that encodes instructions and generates responses.

The integration of these models allows Macaw-LLM to process and analyze multi-modal data effectively.

Alignment Strategy

Our novel alignment strategy enables faster adaptation by efficiently bridging multi-modal features to textual features. The process involves:

Encoding multi-modal features with CLIP and Whisper.
Feeding the encoded features into an attention function, wherein the multi-modal features serve as the query and the embedding matrix of LLaMA as the key and value.
Injecting the outputs into the input sequence (before instruction tokens) of LLaMA, allowing for a streamlined alignment process with minimal additional parameters.

New Multi-modal Instruction Dataset 🆕

In this project, we generate a dataset using GPT-3.5-Turbo by providing image or video captions as prompts. To create this dataset, we use captions from the MS COCO dataset for images and the Charades and AVSD datasets for videos. Our dataset consists of approximately 69K examples based on COCO image captions and 50K examples based on Charades and AVSD video captions. We currently focus on single-turn dialogues but plan to expand into multi-turn dialogues and diverse multi-modal content in the future. This will enrich the dataset and improve fine-tuning for language learning models (LLMs).

Installation

To install Macaw-LLM, follow these steps:

# Clone the repository
git clone https://github.com/lyuchenyang/Macaw-LLM.git

# Change to the Macaw-LLM directory
cd Macaw-LLM

# Install required packages
pip install -r requirements.txt

# Install ffmpeg
yum install ffmpeg -y

# Install apex
git clone https://github.com/NVIDIA/apex.git
cd apex
python setup.py install
cd ..

Usage 🚀

Downloading dataset:
- Text data: stanford_alpaca/alpaca_data.json
- Image data: COCO Dataset VQA Dataset
- Video data: Charades and Video Dialog
- Image instruction data: Macaw-LLM image instruction dataset
- Video instruction data: Macaw-LLM video instruction dataset
Dataset preprocessing:
- Place the data in three modalities to specific folders - data/text/, data/image/, data/video/
- Extract frames and audio from videos:
```
python preprocess_data.py
```
- Transform supervised data to dataset:
```
python preprocess_data_supervised.py
```
- Transform unsupervised data to dataset:
```
python preprocess_data_unsupervised.py
```
Training:
- Execute the training script (you can specify the training parameters inside):
```
./train.sh
```
Inference:
- Execute the inference script (you can give any customized inputs inside):
```
./inference.sh
```

Examples

We present several examples that highlight the proficiency of our Macaw-LLM in understanding and following multi-modal instructions. These examples showcase our system's multi-modal ability to understand and generate responses based on images and videos. These examples demonstrate how our system comprehends visual content and produces high-quality, fluent responses in natural language conversations. Our system generates contextually relevant and informative answers to various questions about the image, demonstrating its capability to communicate about visual content naturally and fluently.

Future Work and Contributions 🚀

While our model is still in its early stages, we believe that Macaw-LLM paves the way for future research in the realm of multi-modal language modeling. The integration of diverse data modalities holds immense potential for pushing the boundaries of artificial intelligence and enhancing our understanding of complex real-world scenarios. By introducing Macaw-LLM, we hope to inspire further exploration and innovation in this exciting area of study.

We welcome contributions from the community to improve and expand Macaw-LLM's capabilities. 🤝

ToDo 👨‍💻

Evaluation: We show some examples showcasing the multi-modal ability of our Macaw-LLM. However, we acknowledge that these efforts may not be fully adequate for accurately and comprehensively demonstrate model capabilities. We aim to conduct extensive evaluation on our systems to evaluate its capability.
More Language Models: We aim to extend Macaw-LLM by incorporating additional language models like Dolly, BLOOM, T-5, etc. This will enable more robust and versatile processing and understanding of multi-modal data.
Multilingual Support: Our next step is to support multiple languages, moving towards true multi-modal and multilingual language models. We believe this will significantly broaden Macaw-LLM's applicability and enhance its understanding of diverse, global contexts.

Acknowledgements 🙏

We would like to express our gratitude to the following open-source projects for their valuable contributions to Macaw-LLM:

Stanford Alpaca for providing the Alpaca dataset, which we used in our experiments.
Parrot for providing a helpful implementation of the training of LLaMA.
CLIP for providing a strong image and video encoding model.
Whisper for providing a strong audio encoding model.
LLaMA for providing a powerful LLM.

We would also like to thank the developers and maintainers of these projects for their dedication and hard work in making their projects open-source and accessible to the community.

Citation

@article{lyu2023macaw,
  title={Macaw-LLM: Multi-Modal Language Modeling with Image, Audio, Video, and Text Integration},
  author={Lyu, Chenyang and Wu, Minghao and Wang, Longyue and Huang, Xinting and Liu, Bingshuai and Du, Zefeng and Shi, Shuming and Tu, Zhaopeng},
  journal={arXiv preprint arXiv:2306.09093},
  year={2023}
}

macaw-llm's People

Stargazers

Watchers

Forkers

tencent-ailab codeaudit dezigns333 lycokie vamoko stancx1 xigua369 zb-zhouhao haorand ishine billionerd timhuang1 cauuliflower closegoingaway zaku-zaku s8xy xupercoin techthiyanes soon14 zth9730 jieyoujun myrichardx maigone hhy5277 ccaiccie jaze-developement tonywhite11 jbluv iam20cm burgosny goswamig codeconnoisseur45 liangofthechen wensiyuansix staccats apollohuang1 mfkiwl ljunius ngthanhtin lokyliu vmsearch tvbboy2015 masemxiao hongwen-sun wolfworld6 roman-212 bgagandeep coder-drinker paramedick glenda1965 sorokinvld ntt720 d3p10y n0wwa spicyguml mingkin windb3ll hay-man obsidian6s keyzf nicolesherwood e-kiss-me danxiangjie tutuna mlslavender brewswang jeanmoumou francismontalbo corongozo szpsunkk joeyee007 cquptxx-sangfor riddhi73 zhangjiwei-japan phoebussi leeaandrob keyman9848 mwksandman eru1030 tangdk jingwu6 riolei sararijo cryptowealth-technology tmukande-debug wangdian215 zongdaoming jaedukseo anthony-wss tsunami2 budiholan-github aifylabs anshkumar mdwoicke xc0r cyrilmagsuci airhors cccmz316423 rayluo88 goelmk

macaw-llm's Issues

please update the demo code?

Hi, dear authors:
Thanks for sharing the great work. I noticed that you have upload the training and evaluation code, but without demo code such as VQA. It would be grateful if you could release the demo code。 thanks you.

Requirement Versions

Multiple requirement versions are not specified. This is leading to problems during install.

protobuf
scikit-learn
moviepy
ffmpeg-python
tqdm
pandas
opencv-python
clip
openai-whisper
appdirs
loralib
bitsandbytes
black
black[jupyter]
fire
gradio
peft
deepspeed

Could you please share the code to generate the instruct data?

Hi, I want to generate instruct data on my dataset with GPT4. But I don't know how to write the code. And I also notice that there is rate limit from openai. So I 'd like to have some suggestion or help from you~~~

Question about setting pad token

Hi, may I know how to set the pad token?
In the previous version of code, it was set as [32006]. I checked that in LLaMA token files, 32006 isnot used yet. Can I use any num not used before?

missing file 'data/avsd/avsd_train.json'

Hi, may I know what are these two files in preprocess_data_supervised.py?
Could you please share these files?

Model code error

modeling.py 986 line and modeling.py 1004 line also use video_align_attention. Is this an error? Should we use the corresponding audio_align_attention and image_align_attention

How many GPU memory needed to finetune the model?

Can I finetune it with my dataset with 4 * 3090?

Different LLM backbones?

Hi, the README mentions several different LLM backbones, but the paper seems to reference only LLaMA and a brief code search didn't turn up any mentions of Vicuna or Bloom. Did you train this with other LLMs beyond LLaMA and if so, where can we find the trained weights for these?

Thank you!

Missing file "data_alp_hf.json"

Hi, there is an unknow config file in train.sh
Could you please share it to us?

Using pad_token, but it is not set yet.

Hi, when I run "preprocess_data_supervised.py" by using llama-7b-hf tokenizer, it shows "Using pad_token, but it is not set yet" and "Truncation was not explicitly activated but max_length is provided a specific value,...".

Is it ok?

Questions about Model

Dear Author,

I would like to express my sincere gratitude for your open-source contributions. Your neural network model has left a deep impression on me. It seems that your model is driven by text information (CLIP aligns images and text, while Whisper aligns audio and text), and the ultimate goal of the model appears to be more inclined towards multimodal QA and multimodal captioning. However, I have the following questions:

The dimensions of different modalities are vastly different. How do you balance the information from different modalities in your network?
In real-world scenarios, there may be missing modalities. Do you need to input information from all three modalities during the training/inference process of your model, or can you only input certain modalities?

I am looking forward to your work and hope to see your article soon. Thank you.

Best regards,
RitchieAlpha

Missing License File

Hello,
Thanks for sharing this work.

But the repo seems to be missing a LICENSE file and hence makes it difficult for people to decide if they can use this project in their work or not.

Has a decision regarding licensing been made?

Thanks!

missing file ”data/all_visual_names.json“

hi, thank you for making such great work open source.
However, I have encountered some issues:

When I run inference.sh, there is a file missing error on 'data/all_visual_names.json', how can I get this file?
Is there trained models we can do inference directly?

Data filtering step

When processing the dataset, there is a filter criteria:

if 'caption' in e['instruction'] or 'caption' in e['response'] or ' no ' in e['response'] or 'not' in e['response']:
        continue

Why we need such a filtering step?

Call for paper

Hi, appreciate your great job! I wonder that is there any paper related to this project.

Always have same response

Hi, I have loaded your pre-trained weights and tried some instructions. However, I found the model responded with the same answer no matter what image I gave.

model = MM_LLMs.from_pretrained(
        "trained_model/mm_llms_trainer",
        config = model_config,
    )
model.eval()
# ...

instruction = "How many boats are in the picture?"
template = f"Below is an instruction that describes a task. Write a response that appropriately completes the request.\n\n### Instruction:\n{instruction}\n\n### Response:"

input_ids = tokenizer.encode(template.format(instruction))
eos_token_id = tokenizer.eos_token_id
if eos_token_id in input_ids:
    input_ids.remove(eos_token_id)
input_ids = torch.tensor([input_ids], dtype=torch.int).to(device)

# image
# image = preprocess(Image.open("data/image_sample/COCO_train2014_000000492606.jpg"))
# image = preprocess(Image.open("data/image_sample/COCO_train2014_000000344896.jpg"))
image = preprocess(Image.open("data/image_sample/COCO_train2014_000000407061.jpg"))
image = image.unsqueeze(0)

with torch.no_grad():
    bs = 1
    
    inputs = {
        "videos": None,
        "images": image.half(),
        "audios": None,
        "input_ids": input_ids,
        'image_starts': torch.tensor([tokenizer.convert_tokens_to_ids('<image>')] * bs, dtype=torch.int),
        'image_ends': torch.tensor([tokenizer.convert_tokens_to_ids('</image>')] * bs, dtype=torch.int),
        'audio_starts': torch.tensor([tokenizer.convert_tokens_to_ids('<audio>')] * bs, dtype=torch.int),
        'audio_ends': torch.tensor([tokenizer.convert_tokens_to_ids('</audio>')] * bs, dtype=torch.int),
        'video_starts': torch.tensor([tokenizer.convert_tokens_to_ids('<video>')] * bs, dtype=torch.int),
        'video_ends': torch.tensor([tokenizer.convert_tokens_to_ids('</video>')] * bs, dtype=torch.int),
    }

    for k,v in inputs.items():
        if v is not None:
            inputs[k] = v.to(device)
    inputs['inference'] = True
    
    
    text_embeddings, attention_mask, labels, debug = model.prepare_inputs_for_generation(inputs)
    
    print()
    print(text_embeddings.size())
        

    model_output = model.llm(inputs_embeds=text_embeddings, attention_mask=attention_mask, labels=labels)
    generate_ids = model.llm.generate(inputs_embeds=text_embeddings, max_new_tokens=128, eos_token_id=2, bos_token_id=1, pad_token_id=32006)

Below is an instruction that describes a task. Write a response that appropriately completes the request.

### Instruction:
How many boats are in the picture?

### Response:
========================================
There are 5000 in the picture.
========================================

No matter what image I gave to the model. The model always replies There are 5000 in the picture. with the same prompt. It seems the model just ignored any multi-modal inputs and replied based on text.

Did I do anything wrong? Thank you.

What is the pad ID for tokenizer?

In the trainer file, I saw:

special_tokens = {
'': 32000,
'': 32001,
'': 32002,
'': 32003,
'': 32005,
}

But in the preprocessing files, I didn't see these tokens are setted. Instead, I tried to print out the token ids, and found that it seems the PAD token id is 32000. What is the potential problem? What is the pad_id for tokenizer?

TypeError: string indices must be integers, not 'str'

preprocess_data_unsupervised.py", line 105, in preprocess_alpaca_to_tensor_dataset
texts = PROMPT_DICT['prompt_input'].format(e['instruction'], e['input']) if e['input'] != "" else PROMPT_DICT['prompt_no_input'].format(e['instruction'])

Does it support langchain?

fantastic multimodality model, does it support langchain?

Resource problem?

I wonder with 3 big models like CLIP, LLama, Whisper, at least how much VRAM will we need to host a demo? Is it possible to host them on a single 4090 GPU?

Performance of the model

Hello,
I tried to load the pre-trained model you provided and run the following example from AVSD data:

  {
        "instruction": "Is the woman already in the room?",
        "input": "",
        "output": "Yes ahe is already in the room",
        "image": null,
        "audio": null,
        "video": "7UPGT.mp4"
    },

Basically, to prepare the whisper model, clip model, and llama model, I used the following:

   # save whisper, clip, and llama models for future use.
from transformers import CLIPModel, LlamaModel
clip_model = CLIPModel.from_pretrained("openai/clip-vit-base-patch16")
from transformers import WhisperForConditionalGeneration
whisper_model = WhisperForConditionalGeneration.from_pretrained("openai/whisper-base")
llama7b_model = LlamaModel.from_pretrained("decapoda-research/llama-7b-hf")

clip_model.save_pretrained('pretrained_models/clip_model/')
whisper_model.save_pretrained('pretrained_models/whisper_model/')
llama7b_model.save_pretrained('pretrained_models/llama7b_model/')

To load the macaw model you provided, I used the following:

if name == "main":
clip_config = CLIPConfig.from_pretrained('pretrained_models/clip_model/')
whisper_config = WhisperConfig.from_pretrained('pretrained_models/whisper_model/')
llm_config = AutoConfig.from_pretrained('pretrained_models/llama7b_model/')
tokenizer = get_tokenizer("pretrained_models/macaw/", tokenizer_cls=LlamaTokenizer)
llm_config.vocab_size = len(tokenizer)
print("llm_config: ", llm_config)

model_config = MM_LLMs_Config(
    n_frames=6, 
    attention_heads=32, 
    image_conv_kernel=48, 
    image_conv_stride=36, 
    video_conv_kernel=36, 
    video_conv_stride=30, 
    audio_conv_kernel=240, 
    audio_conv_stride=220,
    clip_config=clip_config, whisper_config=whisper_config, llm_config=llm_config
)

macaw_model = MM_LLMs.from_pretrained(
    'pretrained_models/macaw/',
    config = model_config,
    # load_in_8bit=True,
    # torch_dtype=torch.float16,
    # device_map=device_map,
)
TOKENIZER =  get_tokenizer("pretrained_models/macaw/", tokenizer_cls=LlamaTokenizer)

I run the model by:

macaw_model.eval()
with torch.no_grad():
    generate_ids = macaw_model(data_item)
print("generate_ids: ", generate_ids)
input_texts = TOKENIZER.batch_decode(data_item["input_ids"], skip_special_tokens=True, clean_up_tokenization_spaces=False)
generated_texts = TOKENIZER.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)
print("input_texts: ", input_texts)
print("generated_texts: ", generated_texts)

Then I tested the above avsd example. What I get is:

input_texts: ['Below is an instruction that describes a task, with or without input. Write a response that appropriately completes the request.\n\n### Instruction:\nIs the woman already in the room?\n\n### Response:\n\n']
generated_texts: ['\n\n']

So you can see, the output is nonsense. I tried some other examples, and I also tried pure text input, but they results are not satisfying. May I ask what may be wrong?

How can i run train.sh on only one GPU?

I try to delete the parameters about nccl but it makes no sense. Need your help!

which llama tokenizer to use?

In the preprocessing file, we have tokenizer = AutoTokenizer.from_pretrained('trained_models/llama_tokenizer'). This seems won't get the llama tokenizer from HF. which llama tokenizer we should use? As there are several versions on HF. Thanks.

Questions about the files - which files to download

Thanks for the cool project! I have two questions:

which files exactly we should download? In the COCO, VQA, etc. datasets, there are many files. However, I believe only a part of them are needed. For example, I downloaded the following:

Stage 1:

1. Download the COCO image dataset (2014 Train images [83K/13GB]) from: https://cocodataset.org/#download, unzip to current folder (train2014/).

2. Download the Macaw dataset: https://github.com/lyuchenyang/Macaw-LLM/blob/main/data/generated_examples_coco.json

3. Download the Macaw dataset: https://github.com/lyuchenyang/Macaw-LLM/blob/main/data/generated_examples_avsd.json

4. Download the Charades video dataset (Data (scaled to 480p, 13 GB)) from: https://prior.allenai.org/projects/charades, unzip to current folder (Charades_v1_480/).

5. In the current folder, create a folder named "avsd/". In "./avsd/", create "./avsd/videos/", "./avsd/audios/", and "./avsd/images/". Move all the videos from "Charades_v1_480/" to "./avsd/videos/".

6. In the current folder, create a folder named "coco/". In "./coco/", create "./coco/images/". Move all the images from "train2014/" to "./coco/images/".

Stage 2:

1. From https://visualqa.org/download.html download "Training annotations 2017 v2.0", "Validation annotations 2017 v2.0", "Training questions 2017 v2.0", "Validation questions 2017 v2.0". Put them in "./vqa/" and unzip.

2. From https://video-dialog.com/ download AVSD Dataset (4 files), put them into "./avsd/".

But I'm not sure whether it is all we needs.

In the combine_visual_and_audio_names(): of preprocessing supervised python script, there is a:

def add_image_names(dir=None):
all_examples = json_load(dir)['annotations']

    for ind, e in enumerate(tqdm(all_examples)):
        
        _image_dir = e['image_path']
        if len(_image_dir.split('_')[-1].split('.')[0]) < 12:
            i_str = _image_dir.split('_')[-1].split('.')[0]
            n_str = '0' * (12 - len(i_str)) + i_str
            _image_dir = _image_dir.replace(i_str, n_str)

However, I can't find any "image_path" field in any of the above json files.

Looking forward to your answer. Thank you.

Question about finetuning all parameters of LLM?

Hello, I found that you propose a one-step instruction fine-tuning approach on all parameters of LLM. Will this cause catastrophic forgetting or Overfitting on the instruction dataset ？

Paths for pretrained models

Hi, can you please provide huggingface paths for the following?

clip_config = CLIPConfig.from_pretrained('trained_models/clip_model')
whisper_config = WhisperConfig.from_pretrained('trained_models/whisper_model')

I tried with openai/clip-vit-base-patch16 and openai/whisper-base but there seems to be a mismatch in shapes upon loading the model.

Thanks

Some weights of MM_LLMs were not initialized from the model checkpoint at ./mm_llms_trainer/ and are newly initialized:

Thank you very much for your outstanding work. I encountered the following problem when loading model weights. When I used torch.load to load pytorch_model.bin, I found that this part of the weights was indeed missing.
Some weights of MM_LLMs were not initialized from the model checkpoint at ./mm_llms_trainer/ and are newly initialized: ['video_long_self_attention.in_proj_bias', 'video_long_self_attention.bias_v', 'video_long_self_attention.in_proj_weight', 'video_long_self_attention.out_proj.bias', 'video_long_self_attention_attention_ .bias_k', 'video_long_self_attention.out_proj.weight']

How to get the whisper, clip, and llama model used by macaw?

I used the following code to get the pretrained models:

from transformers import CLIPModel, LlamaModel
clip_model = CLIPModel.from_pretrained("openai/clip-vit-base-patch16")
from transformers import WhisperForConditionalGeneration
whisper_model = WhisperForConditionalGeneration.from_pretrained("openai/whisper-base")
llama7b_model = LlamaModel.from_pretrained("decapoda-research/llama-7b-hf")
clip_model.save_pretrained('trained_models/clip_model/')
whisper_model.save_pretrained('trained_models/whisper_model/')
llama7b_model.save_pretrained('trained_models/llama7b_model/')

Is this correct?