Giter VIP home page Giter VIP logo

macaw-llm's Introduction

Logo

Macaw-LLM: Multi-Modal Language Modeling with Image, Video, Audio, and Text Integration ๐ŸŒ๐Ÿ–ผ๏ธ๐Ÿ“น๐ŸŽต๐Ÿ“

Version License Stars Issues Python

ยน Chenyang Lyu, ยฒ Bingshuai Liu, ยณ Minghao Wu, โด Zefeng Du,

โต Xinting Huang, โต Zhaopeng Tu, โต Shuming Shi, โต Longyue Wang

ยน Dublin City University, ยฒ Xiamen University, ยณ Monash University, โด University of Macau, โต Tencent AI Lab

Macaw-LLM is an exploratory endeavor that pioneers multi-modal language modeling by seamlessly combining image, video, audio, and text data, built upon the foundations of CLIP, Whisper, and LLaMA.

Table of Contents ๐Ÿ“š

Introduction ๐Ÿ“–

Figure Description or Alt Text

In recent years, the field of language modeling has witnessed remarkable advancements. However, the integration of multiple modalities, such as images, videos, audios, and text, has remained a challenging task. Macaw-LLM is a model of its kind, bringing together state-of-the-art models for processing visual, auditory, and textual information, namely CLIP, Whisper, and LLaMA.

Key Features ๐Ÿ”‘

Macaw-LLM boasts the following unique features:

  1. Simple & Fast Alignment: Macaw-LLM enables seamless integration of multi-modal data through simple and fast alignment to LLM embeddings. This efficient process ensures quick adaptation of diverse data types.
  2. One-Stage Instruction Fine-Tuning: Our model streamlines the adaptation process through one-stage instruction fine-tuning, promoting a more efficient learning experience.

Architecture ๐Ÿ”ง

Macaw-LLM is composed of three main components:

  1. CLIP: Responsible for encoding images and video frames.
  2. Whisper: Responsible for encoding audio data.
  3. LLM(LLaMA/Vicuna/Bloom): The language model that encodes instructions and generates responses.

The integration of these models allows Macaw-LLM to process and analyze multi-modal data effectively.

Alignment Strategy ๐Ÿ“

Our novel alignment strategy enables faster adaptation by efficiently bridging multi-modal features to textual features. The process involves:

  1. Encoding multi-modal features with CLIP and Whisper.
  2. Feeding the encoded features into an attention function, wherein the multi-modal features serve as the query and the embedding matrix of LLaMA as the key and value.
  3. Injecting the outputs into the input sequence (before instruction tokens) of LLaMA, allowing for a streamlined alignment process with minimal additional parameters.

Installation ๐Ÿ’ป

To install Macaw-LLM, follow these steps:

# Clone the repository
git clone https://github.com/lyuchenyang/Macaw-LLM.git

# Change to the Macaw-LLM directory
cd Macaw-LLM

# Install required packages
pip install -r requirements.txt

# Install ffmpeg
yum install ffmpeg -y

# Install apex
git clone https://github.com/NVIDIA/apex.git
cd apex
python setup.py install
cd ..

Usage ๐Ÿš€

  1. Downloading dataset:

  2. Dataset preprocessing:

    • Place the data in three modalities to specific folders - data/text/, data/image/, data/video/
    • Extract frames and audio from videos:
      python preprocess_data.py
      
    • Transform supervised data to dataset:
      python preprocess_data_supervised.py
      
    • Transform unsupervised data to dataset:
      python preprocess_data_unsupervised.py
      
  3. Training:

    • Execute the training script (you can specify the training parameters inside):
      ./train.sh
      
  4. Inference:

    • Execute the inference script (you can give any customized inputs inside):
      ./inference.sh
      

Future Work and Contributions ๐Ÿš€

While our model is still in its early stages, we believe that Macaw-LLM paves the way for future research in the realm of multi-modal language modeling. The integration of diverse data modalities holds immense potential for pushing the boundaries of artificial intelligence and enhancing our understanding of complex real-world scenarios. By introducing Macaw-LLM, we hope to inspire further exploration and innovation in this exciting area of study.

We welcome contributions from the community to improve and expand Macaw-LLM's capabilities. ๐Ÿค

ToDo ๐Ÿ‘จโ€๐Ÿ’ป

  • More Language Models: We aim to extend Macaw-LLM by incorporating additional language models like Dolly, BLOOM, T-5, etc. This will enable more robust and versatile processing and understanding of multi-modal data.

  • Multilingual Support: Our next step is to support multiple languages, moving towards true multi-modal and multilingual language models. We believe this will significantly broaden Macaw-LLM's applicability and enhance its understanding of diverse, global contexts.

Citation

@misc{Macaw-LLM,
  author = {Chenyang Lyu and Bingshuai Liu and Minghao Wu and Zefeng Du and Longyue Wang},
  title = {Macaw-LLM: Multi-Modal Language Modeling with Image, Video, Audio, and Text Integration},
  year = {2023},
  publisher = {GitHub},
  journal = {GitHub repository},
  howpublished = {\url{https://github.com/lyuchenyang/Macaw-LLM}},
}

macaw-llm's People

Contributors

lyuchenyang avatar longyuewangdcu avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.