🌍 PALO: A Polyglot Large Multimodal Model for 5B People

Muhammad Maaz, Hanoona Rasheed, Abdelrahman Shaker, Salman Khan, Hisham Cholakkal, Rao M. Anwer, Timothy Baldwin, Michael Felsberg and Fahad Khan

* Equally contributing first authors

Vision-language conversation in 10 languages including English, Chinese, French, Spanish, Russian, Japanese, Arabic, Hindi, Bengali and Urdu

📢 Latest Updates

Mar-25-24: PALO training and evaluation codes, and pretrained checkpoints are released. 🔥🔥
Mar-03-24: PALO multi-lingual evaluation dataset is released. Check it out at MBZUAI/multilingual-llava-bench-in-the-wild. 🔥🔥
Feb-27-24: PALO multi-lingual training dataset is released. Check it out at MBZUAI/palo_multilingual_dataset. 🔥🔥
Feb-23-24: PALO paper and online demo are released. Code, pretrained models and training/evaluation scripts are coming soon!

Overview

In pursuit of more inclusive Vision-Language Models (VLMs), this study introduces a Large Multilingual Multimodal Model called PALO. PALO offers visual reasoning capabilities in 10 major languages, including English, Chinese, Hindi, Spanish, French, Arabic, Bengali, Russian, Urdu, and Japanese, that span a total of ~5B people (65% of the world population).

🏆 Contributions

We develop Palo: the first multilingual Large Multimodal Model (LMM), capable of generating responses in 10 languages.
We created an extensive multilingual instruction-tuning dataset (~2.1M instructions) by translating LLaVA-Instruct-150K.
We train models across three distinct scales i.e., 1.7B, 7B, and 13B parameters to demonstrate the scalability of our training pipeline. The models demonstrate good performance on low-resource languages, e.g., Hindi, Arabic, Bengali, and Urdu, without compromising its high-performance on high-resource languages e.g., English, Chinese, French, and Spanish.

📂 PALO Multi-Lingual Dataset Access

We develop a diverse instruction set (~2.1M instructions) comprising conversations from ten languages. Specifically, 665K instructions from LLaVA-Instruct-665K are used for English, and approximately 150K conversations from LLaVA-Instruct-150K are translated to Chinese, French, Spanish, Russian, Japanese, Arabic, Hindi, Bengali and Urdu using our proposed semi-automated translation pipeline.

📥 Download the Training Dataset: Access our multi-lingual dataset on Hugging Face: MBZUAI/palo_multilingual_dataset.

We also develop a multi-lingual evaluation set to conduct a comprehensive evaluation across various languages. This set is constructed by translating the LLaVA-Bench into all target languages using GPT-4-Turbo, with particular attention to preserving linguistic authenticity and mitigating common issues of automated translations through careful human correction.

📥 Download the Evaluation Dataset: Access our multi-lingual evaluation dataset on Hugging Face: MBZUAI/MBZUAI/multilingual-llava-bench-in-the-wild.

🧠 Model Zoo

Model Name	HuggingFace Link
MobilePALO-1.7B	MBZUAI/MobilePALO-1.7B
PALO-7B	MBZUAI/PALO-7B
PALO-13B	MBZUAI/PALO-13B

🔧 Installation

We recommend setting up a conda environment for the project:

conda create --name=palo python=3.10
conda activate palo

git clone https://github.com/mbzuai-oryx/PALO
cd PALO

pip install -r requirements.txt
pip instal flash-attn==2.3.2

export PYTHONPATH="./:$PYTHONPATH"

💿 Running Demo Offline

Please follow the instructions below to run the PALO demo on your local GPU machine.

1. Launch a controller

python palo/serve/controller.py --host 0.0.0.0 --port 10000

2. Launch a gradio web server.

python palo/serve/gradio_web_server.py --controller http://localhost:10000 --model-list-mode reload

3. Launch a model worker

python palo/serve/model_worker.py --host 0.0.0.0 --controller http://localhost:10000 --port 40000 --worker http://localhost:40000 --model-path MBZUAI/PALO-13B

You can launch as many workers as you want, and compare between different model checkpoints in the same Gradio interface. Please keep the --controller the same, and modify the --port and --worker to a different port number for each worker.

🚋 Training

1. Prepare data

Please download the annotations from MBZUAI/palo_multilingual_dataset and all images following the below links.

COCO: train2017
GQA: images
OCR-VQA: download script,
TextVQA: train_val_images
VisualGenome: part1, part2

After downloading all of them, organize the data as follows in ./playground/data,

data
    ├── coco
    │   └── train2017
    ├── gqa
    │   └── images
    ├── ocr_vqa
    │   └── images
    ├── textvqa
    │   └── train_images
    └── vg
        ├── VG_100K
        └── VG_100K_2
    ├── palo_multilingual_dataset
        ├── palo_multilingual_dataset.json

Please note that all images should be in the .jpg format.

2. Download Pretrained Projection Weights

Model Name	Projector Weights
MobilePALO-1.7B	MBZUAI/palo_1.7B_stage1_mm_projector
PALO-7B	liuhaotian/llava-v1.5-mlp2x-336px-pretrain-vicuna-7b-v1.5
PALO-13B	liuhaotian/llava-v1.5-mlp2x-336px-pretrain-vicuna-13b-v1.5

3. Run Training

# For MobilePALO-1.7B
bash scripts/train/finetune_palo.sh "mtgv/MobileLLaMA-1.4B-Chat" "data/palo_multilingual_dataset/palo_multilingual_dataset.json" <path to palo_1.7B_stage1_mm_projector.bin> "ldpnet" "results/PALO-1.7B" "2" "2e-5"

# For PALO-7B
bash scripts/train/finetune_lora_palo.sh "lmsys/vicuna-7b-v1.5" "data/palo_multilingual_dataset/palo_multilingual_dataset.json" <path to llava-v1.5-mlp2x-336px-pretrain-vicuna-7b-v1.5.bin> "mlp2x_gelu" "results/PALO-7B" "3" "2e-4"

# For PALO-13B
bash scripts/train/finetune_lora_palo.sh "lmsys/vicuna-13b-v1.5" "data/palo_multilingual_dataset/palo_multilingual_dataset.json" <path to llava-v1.5-mlp2x-336px-pretrain-vicuna-13b-v1.5.bin> "mlp2x_gelu" "results/PALO-13B" "3" "2e-4"

📊 Quantitative Evaluation

Please download PALO multi-lingual evaluation data from MBZUAI/MBZUAI/multilingual-llava-bench-in-the-wild and arrange it as follows,

data
    ├── multilingual-llava-bench-in-the-wild 
        ├── arabic
            ├── question.jsonl
            ├── answers.jsonl
            ├── context.jsonl
        ├── bengali
            ├── question.jsonl
            ├── answers.jsonl
            ├── context.jsonl
        ...
        ...
        ...

Use the following scripts to perform evaluation,

bash scripts/eval/eval_all_languages.sh <path to the trained model> <Output file name> <OpenAI API Key>

📚 Qualitative Examples of Multilingual Capabilities

📜 Citation

  @article{PALO2024,
  title={Palo: A Large Multilingual Multimodal Language Model},
  author={Maaz, Muhammad and Rasheed, Hanoona and Shaker, Abdelrahman and Khan, Salman and Cholakal, Hisham and Anwer, Rao M. and Baldwin, Tim and Felsberg, Michael and Khan, Fahad S.},
  journal={arXiv 2402.14818},
  year={2024},
  url={https://arxiv.org/abs/2402.14818}
}

Model	English	Chinese
PALO-7B (paper)	64.2	55.7
PALO-7B (my results)	54.0	43.0

mbzuai-oryx / palo Goto Github PK

palo's Introduction

🌍 PALO: A Polyglot Large Multimodal Model for 5B People

Muhammad Maaz*, Hanoona Rasheed*, Abdelrahman Shaker, Salman Khan, Hisham Cholakkal, Rao M. Anwer, Timothy Baldwin, Michael Felsberg and Fahad Khan

Vision-language conversation in 10 languages including English, Chinese, French, Spanish, Russian, Japanese, Arabic, Hindi, Bengali and Urdu

📢 Latest Updates

Overview

🏆 Contributions

📂 PALO Multi-Lingual Dataset Access

🧠 Model Zoo

🔧 Installation

💿 Running Demo Offline

🚋 Training

📊 Quantitative Evaluation

📚 Qualitative Examples of Multilingual Capabilities

📜 Citation

palo's People

Contributors

Stargazers

Watchers

Forkers

palo's Issues

Recommend Projects

Recommend Topics

Recommend Org

Muhammad Maaz, Hanoona Rasheed, Abdelrahman Shaker, Salman Khan, Hisham Cholakkal, Rao M. Anwer, Timothy Baldwin, Michael Felsberg and Fahad Khan