Giter VIP home page Giter VIP logo

palo's Introduction

๐ŸŒ PALO: A Polyglot Large Multimodal Model for 5B People

Oryx Video-ChatGPT

* Equally contributing first authors

Vision-language conversation in 10 languages including English, Chinese, French, Spanish, Russian, Japanese, Arabic, Hindi, Bengali and Urdu

Demo paper Dataset


๐Ÿ“ข Latest Updates

  • Mar-25-24: PALO training and evaluation codes, and pretrained checkpoints are released. ๐Ÿ”ฅ๐Ÿ”ฅ
  • Mar-03-24: PALO multi-lingual evaluation dataset is released. Check it out at MBZUAI/multilingual-llava-bench-in-the-wild. ๐Ÿ”ฅ๐Ÿ”ฅ
  • Feb-27-24: PALO multi-lingual training dataset is released. Check it out at MBZUAI/palo_multilingual_dataset. ๐Ÿ”ฅ๐Ÿ”ฅ
  • Feb-23-24: PALO paper and online demo are released. Code, pretrained models and training/evaluation scripts are coming soon!

Overview

In pursuit of more inclusive Vision-Language Models (VLMs), this study introduces a Large Multilingual Multimodal Model called PALO. PALO offers visual reasoning capabilities in 10 major languages, including English, Chinese, Hindi, Spanish, French, Arabic, Bengali, Russian, Urdu, and Japanese, that span a total of ~5B people (65% of the world population).

Palo Results

๐Ÿ† Contributions

  1. We develop Palo: the first multilingual Large Multimodal Model (LMM), capable of generating responses in 10 languages.
  2. We created an extensive multilingual instruction-tuning dataset (~2.1M instructions) by translating LLaVA-Instruct-150K.
  3. We train models across three distinct scales i.e., 1.7B, 7B, and 13B parameters to demonstrate the scalability of our training pipeline. The models demonstrate good performance on low-resource languages, e.g., Hindi, Arabic, Bengali, and Urdu, without compromising its high-performance on high-resource languages e.g., English, Chinese, French, and Spanish.

๐Ÿ“‚ PALO Multi-Lingual Dataset Access

We develop a diverse instruction set (~2.1M instructions) comprising conversations from ten languages. Specifically, 665K instructions from LLaVA-Instruct-665K are used for English, and approximately 150K conversations from LLaVA-Instruct-150K are translated to Chinese, French, Spanish, Russian, Japanese, Arabic, Hindi, Bengali and Urdu using our proposed semi-automated translation pipeline.

๐Ÿ“ฅ Download the Training Dataset: Access our multi-lingual dataset on Hugging Face: MBZUAI/palo_multilingual_dataset.

We also develop a multi-lingual evaluation set to conduct a comprehensive evaluation across various languages. This set is constructed by translating the LLaVA-Bench into all target languages using GPT-4-Turbo, with particular attention to preserving linguistic authenticity and mitigating common issues of automated translations through careful human correction.

๐Ÿ“ฅ Download the Evaluation Dataset: Access our multi-lingual evaluation dataset on Hugging Face: MBZUAI/MBZUAI/multilingual-llava-bench-in-the-wild.

๐Ÿง  Model Zoo

Model Name HuggingFace Link
MobilePALO-1.7B MBZUAI/MobilePALO-1.7B
PALO-7B MBZUAI/PALO-7B
PALO-13B MBZUAI/PALO-13B

๐Ÿ”ง Installation

We recommend setting up a conda environment for the project:

conda create --name=palo python=3.10
conda activate palo

git clone https://github.com/mbzuai-oryx/PALO
cd PALO

pip install -r requirements.txt
pip instal flash-attn==2.3.2

export PYTHONPATH="./:$PYTHONPATH"

๐Ÿ’ฟ Running Demo Offline

Please follow the instructions below to run the PALO demo on your local GPU machine.

1. Launch a controller

python palo/serve/controller.py --host 0.0.0.0 --port 10000

2. Launch a gradio web server.

python palo/serve/gradio_web_server.py --controller http://localhost:10000 --model-list-mode reload

3. Launch a model worker

python palo/serve/model_worker.py --host 0.0.0.0 --controller http://localhost:10000 --port 40000 --worker http://localhost:40000 --model-path MBZUAI/PALO-13B

You can launch as many workers as you want, and compare between different model checkpoints in the same Gradio interface. Please keep the --controller the same, and modify the --port and --worker to a different port number for each worker.

๐Ÿš‹ Training

1. Prepare data

Please download the annotations from MBZUAI/palo_multilingual_dataset and all images following the below links.

After downloading all of them, organize the data as follows in ./playground/data,

data
    โ”œโ”€โ”€ coco
    โ”‚   โ””โ”€โ”€ train2017
    โ”œโ”€โ”€ gqa
    โ”‚   โ””โ”€โ”€ images
    โ”œโ”€โ”€ ocr_vqa
    โ”‚   โ””โ”€โ”€ images
    โ”œโ”€โ”€ textvqa
    โ”‚   โ””โ”€โ”€ train_images
    โ””โ”€โ”€ vg
        โ”œโ”€โ”€ VG_100K
        โ””โ”€โ”€ VG_100K_2
    โ”œโ”€โ”€ palo_multilingual_dataset
        โ”œโ”€โ”€ palo_multilingual_dataset.json

Please note that all images should be in the .jpg format.

2. Download Pretrained Projection Weights

Model Name Projector Weights
MobilePALO-1.7B MBZUAI/palo_1.7B_stage1_mm_projector
PALO-7B liuhaotian/llava-v1.5-mlp2x-336px-pretrain-vicuna-7b-v1.5
PALO-13B liuhaotian/llava-v1.5-mlp2x-336px-pretrain-vicuna-13b-v1.5

3. Run Training

# For MobilePALO-1.7B
bash scripts/train/finetune_palo.sh "mtgv/MobileLLaMA-1.4B-Chat" "data/palo_multilingual_dataset/palo_multilingual_dataset.json" <path to palo_1.7B_stage1_mm_projector.bin> "ldpnet" "results/PALO-1.7B" "2" "2e-5"

# For PALO-7B
bash scripts/train/finetune_lora_palo.sh "lmsys/vicuna-7b-v1.5" "data/palo_multilingual_dataset/palo_multilingual_dataset.json" <path to llava-v1.5-mlp2x-336px-pretrain-vicuna-7b-v1.5.bin> "mlp2x_gelu" "results/PALO-7B" "3" "2e-4"

# For PALO-13B
bash scripts/train/finetune_lora_palo.sh "lmsys/vicuna-13b-v1.5" "data/palo_multilingual_dataset/palo_multilingual_dataset.json" <path to llava-v1.5-mlp2x-336px-pretrain-vicuna-13b-v1.5.bin> "mlp2x_gelu" "results/PALO-13B" "3" "2e-4"

๐Ÿ“Š Quantitative Evaluation

Please download PALO multi-lingual evaluation data from MBZUAI/MBZUAI/multilingual-llava-bench-in-the-wild and arrange it as follows,

data
    โ”œโ”€โ”€ multilingual-llava-bench-in-the-wild 
        โ”œโ”€โ”€ arabic
            โ”œโ”€โ”€ question.jsonl
            โ”œโ”€โ”€ answers.jsonl
            โ”œโ”€โ”€ context.jsonl
        โ”œโ”€โ”€ bengali
            โ”œโ”€โ”€ question.jsonl
            โ”œโ”€โ”€ answers.jsonl
            โ”œโ”€โ”€ context.jsonl
        ...
        ...
        ...

Use the following scripts to perform evaluation,

bash scripts/eval/eval_all_languages.sh <path to the trained model> <Output file name> <OpenAI API Key>

Palo Results

๐Ÿ“š Qualitative Examples of Multilingual Capabilities

Palo Sample

Palo Sample

๐Ÿ“œ Citation

  @article{PALO2024,
  title={Palo: A Large Multilingual Multimodal Language Model},
  author={Maaz, Muhammad and Rasheed, Hanoona and Shaker, Abdelrahman and Khan, Salman and Cholakal, Hisham and Anwer, Rao M. and Baldwin, Tim and Felsberg, Michael and Khan, Fahad S.},
  journal={arXiv 2402.14818},
  year={2024},
  url={https://arxiv.org/abs/2402.14818}
}

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.