Giter VIP home page Giter VIP logo

palo's Introduction

🌍 PALO: A Polyglot Large Multimodal Model for 5B People

Oryx Video-ChatGPT

* Equally contributing first authors

Vision-language conversation in 10 languages including English, Chinese, French, Spanish, Russian, Japanese, Arabic, Hindi, Bengali and Urdu

Demo paper Dataset


πŸ“’ Latest Updates

  • Mar-25-24: PALO training and evaluation codes, and pretrained checkpoints are released. πŸ”₯πŸ”₯
  • Mar-03-24: PALO multi-lingual evaluation dataset is released. Check it out at MBZUAI/multilingual-llava-bench-in-the-wild. πŸ”₯πŸ”₯
  • Feb-27-24: PALO multi-lingual training dataset is released. Check it out at MBZUAI/palo_multilingual_dataset. πŸ”₯πŸ”₯
  • Feb-23-24: PALO paper and online demo are released. Code, pretrained models and training/evaluation scripts are coming soon!

Overview

In pursuit of more inclusive Vision-Language Models (VLMs), this study introduces a Large Multilingual Multimodal Model called PALO. PALO offers visual reasoning capabilities in 10 major languages, including English, Chinese, Hindi, Spanish, French, Arabic, Bengali, Russian, Urdu, and Japanese, that span a total of ~5B people (65% of the world population).

Palo Results

πŸ† Contributions

  1. We develop Palo: the first multilingual Large Multimodal Model (LMM), capable of generating responses in 10 languages.
  2. We created an extensive multilingual instruction-tuning dataset (~2.1M instructions) by translating LLaVA-Instruct-150K.
  3. We train models across three distinct scales i.e., 1.7B, 7B, and 13B parameters to demonstrate the scalability of our training pipeline. The models demonstrate good performance on low-resource languages, e.g., Hindi, Arabic, Bengali, and Urdu, without compromising its high-performance on high-resource languages e.g., English, Chinese, French, and Spanish.

πŸ“‚ PALO Multi-Lingual Dataset Access

We develop a diverse instruction set (~2.1M instructions) comprising conversations from ten languages. Specifically, 665K instructions from LLaVA-Instruct-665K are used for English, and approximately 150K conversations from LLaVA-Instruct-150K are translated to Chinese, French, Spanish, Russian, Japanese, Arabic, Hindi, Bengali and Urdu using our proposed semi-automated translation pipeline.

πŸ“₯ Download the Training Dataset: Access our multi-lingual dataset on Hugging Face: MBZUAI/palo_multilingual_dataset.

We also develop a multi-lingual evaluation set to conduct a comprehensive evaluation across various languages. This set is constructed by translating the LLaVA-Bench into all target languages using GPT-4-Turbo, with particular attention to preserving linguistic authenticity and mitigating common issues of automated translations through careful human correction.

πŸ“₯ Download the Evaluation Dataset: Access our multi-lingual evaluation dataset on Hugging Face: MBZUAI/MBZUAI/multilingual-llava-bench-in-the-wild.

🧠 Model Zoo

Model Name HuggingFace Link
MobilePALO-1.7B MBZUAI/MobilePALO-1.7B
PALO-7B MBZUAI/PALO-7B
PALO-13B MBZUAI/PALO-13B

πŸ”§ Installation

We recommend setting up a conda environment for the project:

conda create --name=palo python=3.10
conda activate palo

git clone https://github.com/mbzuai-oryx/PALO
cd PALO

pip install -r requirements.txt
pip instal flash-attn==2.3.2

export PYTHONPATH="./:$PYTHONPATH"

πŸ’Ώ Running Demo Offline

Please follow the instructions below to run the PALO demo on your local GPU machine.

1. Launch a controller

python palo/serve/controller.py --host 0.0.0.0 --port 10000

2. Launch a gradio web server.

python palo/serve/gradio_web_server.py --controller http://localhost:10000 --model-list-mode reload

3. Launch a model worker

python palo/serve/model_worker.py --host 0.0.0.0 --controller http://localhost:10000 --port 40000 --worker http://localhost:40000 --model-path MBZUAI/PALO-13B

You can launch as many workers as you want, and compare between different model checkpoints in the same Gradio interface. Please keep the --controller the same, and modify the --port and --worker to a different port number for each worker.

πŸš‹ Training

1. Prepare data

Please download the annotations from MBZUAI/palo_multilingual_dataset and all images following the below links.

After downloading all of them, organize the data as follows in ./playground/data,

data
    β”œβ”€β”€ coco
    β”‚   └── train2017
    β”œβ”€β”€ gqa
    β”‚   └── images
    β”œβ”€β”€ ocr_vqa
    β”‚   └── images
    β”œβ”€β”€ textvqa
    β”‚   └── train_images
    └── vg
        β”œβ”€β”€ VG_100K
        └── VG_100K_2
    β”œβ”€β”€ palo_multilingual_dataset
        β”œβ”€β”€ palo_multilingual_dataset.json

Please note that all images should be in the .jpg format.

2. Download Pretrained Projection Weights

Model Name Projector Weights
MobilePALO-1.7B MBZUAI/palo_1.7B_stage1_mm_projector
PALO-7B liuhaotian/llava-v1.5-mlp2x-336px-pretrain-vicuna-7b-v1.5
PALO-13B liuhaotian/llava-v1.5-mlp2x-336px-pretrain-vicuna-13b-v1.5

3. Run Training

# For MobilePALO-1.7B
bash scripts/train/finetune_palo.sh "mtgv/MobileLLaMA-1.4B-Chat" "data/palo_multilingual_dataset/palo_multilingual_dataset.json" <path to palo_1.7B_stage1_mm_projector.bin> "ldpnet" "results/PALO-1.7B" "2" "2e-5"

# For PALO-7B
bash scripts/train/finetune_lora_palo.sh "lmsys/vicuna-7b-v1.5" "data/palo_multilingual_dataset/palo_multilingual_dataset.json" <path to llava-v1.5-mlp2x-336px-pretrain-vicuna-7b-v1.5.bin> "mlp2x_gelu" "results/PALO-7B" "3" "2e-4"

# For PALO-13B
bash scripts/train/finetune_lora_palo.sh "lmsys/vicuna-13b-v1.5" "data/palo_multilingual_dataset/palo_multilingual_dataset.json" <path to llava-v1.5-mlp2x-336px-pretrain-vicuna-13b-v1.5.bin> "mlp2x_gelu" "results/PALO-13B" "3" "2e-4"

πŸ“Š Quantitative Evaluation

Please download PALO multi-lingual evaluation data from MBZUAI/MBZUAI/multilingual-llava-bench-in-the-wild and arrange it as follows,

data
    β”œβ”€β”€ multilingual-llava-bench-in-the-wild 
        β”œβ”€β”€ arabic
            β”œβ”€β”€ question.jsonl
            β”œβ”€β”€ answers.jsonl
            β”œβ”€β”€ context.jsonl
        β”œβ”€β”€ bengali
            β”œβ”€β”€ question.jsonl
            β”œβ”€β”€ answers.jsonl
            β”œβ”€β”€ context.jsonl
        ...
        ...
        ...

Use the following scripts to perform evaluation,

bash scripts/eval/eval_all_languages.sh <path to the trained model> <Output file name> <OpenAI API Key>

Palo Results

πŸ“š Qualitative Examples of Multilingual Capabilities

Palo Sample

Palo Sample

πŸ“œ Citation

  @article{PALO2024,
  title={Palo: A Large Multilingual Multimodal Language Model},
  author={Maaz, Muhammad and Rasheed, Hanoona and Shaker, Abdelrahman and Khan, Salman and Cholakal, Hisham and Anwer, Rao M. and Baldwin, Tim and Felsberg, Michael and Khan, Fahad S.},
  journal={arXiv 2402.14818},
  year={2024},
  url={https://arxiv.org/abs/2402.14818}
}

palo's People

Contributors

mmaaz60 avatar ival-mbzuai avatar hanoonar avatar

Stargazers

Vincent Tsai avatar Ashvanth.S avatar  avatar  avatar Vaibhav Sethia avatar  avatar NOKUBI Takatsugu avatar  avatar Mirza Mushtaq Baig avatar Ifty Mohammad Rezwan avatar  avatar T. emre avatar Muhammad Naufil avatar Muhammad Huzaifa avatar Shehan Munasinghe avatar zhen avatar HaΜ€ NhΓ’Μ£t NguyΓͺn VuΜƒ avatar James Le avatar cin-hubert avatar Yassine Najem avatar Hongyu Wang avatar 3 a l i avatar Mario Garcia avatar  avatar tanminhtran avatar aryan avatar GΓΌvenΓ§ Usanmaz avatar  avatar UglyStupidHonest avatar Liang Zhang avatar Marc Cymontkowski avatar Shahina Kunhimon avatar Thomas Efer avatar Fazeel Usmani avatar  avatar Ghulam Jilani Raza avatar  avatar Mr. Jack Tung avatar  avatar Shareef Ifthekhar avatar 唐国撁Tommy avatar  avatar kyle avatar Prasannan N avatar Huu Hiep Nguyen avatar mrfakename avatar θ™žε€§θƒ† avatar Muhammed Machrouh avatar Flo Schneider avatar  avatar M. Yusuf SarΔ±gΓΆz avatar Arslan Mehmood avatar Pooya Mohammadi Kazaj avatar  avatar  avatar Abdelrahman Shaker avatar Xiangyu Guo avatar Sahal Shaji avatar Hisham Cholakkal avatar  avatar waby avatar Pol Vila avatar  avatar Miguel GP avatar Ahmad Mustafa Anis avatar  avatar Bill Xue avatar Vijayasri I. avatar  avatar gotomypc avatar  avatar  avatar Joseph K J avatar Salman Khan avatar  avatar

Watchers

Bill Xue avatar Joseph K J avatar Usman Zia avatar  avatar  avatar Muhammad Ajmal Siddiqui avatar Abdelrahman Shaker avatar

palo's Issues

Dataset release

Hi
Will you be releasing the dataset? Specially looking for the Bengali one.

Plan of opensourcing models and datasets

Dear authors,
Thanks for your great work! I am very interesting in the multilingual ability of LMM. Do you have any plan to release the dataset and checkpoint of that paper? They are really helpful to me!

loading pretrained models using transformers library

Hi,

I'm really excited to try your pretrained models, but it seems they haven't been integrated into the transformers library yet.
I tried loading MBZUAI/PALO-7B using both the "image-to-text" pipeline and using AutoModel with the latest version of transformers (v4.39.3) and got this error:
The checkpoint you are trying to load has model type palo but Transformers does not recognize this architecture

Am I missing something?
Thanks!

About evaluation result

Hi @mmaaz60,
Thanks for your great work and open sourcing!
I am trying to evaluate PALO-7B (loaded from transformers) on the multilingual-llava-in-the-wild, but I find the performance is much lower than the reported numbers. Here are the results I got:

Model English Chinese
PALO-7B (paper) 64.2 55.7
PALO-7B (my results) 54.0 43.0

Here are the generated content files:
PALO-7B_English_content.json
PALO-7B_Chinese_content.json

Here are the evaluation files with scores:
PALO-7B_English.json
PALO-7B_Chinese.json

Summaries produced by palo/eval/summarize_gpt_review.py

PALO-7B_English                                                                                                                              
all 54.0 85.2 46.0                                                                                                                           
llava_bench_complex 62.8 82.5 51.8                                                                                                           
llava_bench_conv 52.4 86.5 45.3                                                                                                              
llava_bench_detail 40.6 88.7 36.0

PALO-7B_Chinese                                                                                                                    
all 43.0 86.0 37.0
llava_bench_complex 55.6 82.9 46.1
llava_bench_conv 27.2 88.8 24.1
llava_bench_detail 39.1 88.7 34.7

Is there a significant discrepancy between the content I generated and yours, or there are issues in evaluation? Do you have any idea about this, or share the generated result files with me?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    πŸ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. πŸ“ŠπŸ“ˆπŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❀️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.