Giter VIP home page Giter VIP logo

obsidian's Introduction

Obsidian: Multimodal LLM for Everyone


Obsidian is a joint work between Nous Research and Virtual Interactive. Special thanks to LDJ and qnguyen3 for making this work possible.

Easiest way to try out: Colab - After open the Gradio, give the model about 2 minutes to load then refresh the Gradio.

Usage

  1. Install Obsidian
  • Clone this project and navigate to the Obsidian folder
git clone https://github.com/NousResearch/Obsidian.git
cd Obsidian
  • Download the multimodal projector from Huggingface
sh script/download_mm_projector.sh
  • Install packages
conda create -n obsidian python=3.10 -y
conda activate obsidian
pip install --upgrade pip  # enable PEP 660 support
pip install -e .
  • Install additional packages for training cases (required)
pip install ninja
pip install flash-attn --no-build-isolation
  • Install the latest version of transformers
pip install --upgrade transformers
  1. Run the Demo UI

Launch a controller

python -m llava.serve.controller --host 0.0.0.0 --port 10000

Launch a gradio web server.

python -m llava.serve.gradio_web_server --controller http://localhost:10000 --model-list-mode reload

You just launched the Gradio web interface. Now, you can open the web interface with the URL printed on the screen. You may notice that there is no model in the model list. Do not worry, as we have not launched any model worker yet. It will be automatically updated when you launch a model worker.

Launch a model worker

This is the actual worker that performs the inference on the GPU. Each worker is responsible for a single model specified in --model-path.

python -m llava.serve.model_worker --host 0.0.0.0 --controller http://localhost:10000 --port 40000 --worker http://localhost:40000 --model-path NousResearch/Obsidian-3B-V0.5

Wait until the process finishes loading the model and you see "Uvicorn running on ...". Now, refresh your Gradio web UI, and you will see the model you just launched in the model list.

Training

1. Pretraining

Please download the 558K subset of the LAION-CC-SBU dataset with BLIP captions here.

Pretrain takes around 2.5 hours for Obsidian-3B-V0.5 on 4x A100 (80G), at 336px for the vision module.

Training script with DeepSpeed ZeRO-2: pretrain.sh.

  • --mm_projector_type mlp2x_gelu: the two-layer MLP vision-language connector.
  • --vision_tower openai/clip-vit-large-patch14-336: CLIP ViT-L/14 336px.

2. Instructional Finetuning

Please download the annotation of the final mixture our instruction tuning data llava_v1_5_mix665k.json, and download the images from constituting datasets:

After downloading all of them, organize the data as follows in ./playground/data,

โ”œโ”€โ”€ coco
โ”‚   โ””โ”€โ”€ train2017
โ”œโ”€โ”€ gqa
โ”‚   โ””โ”€โ”€ images
โ”œโ”€โ”€ ocr_vqa
โ”‚   โ””โ”€โ”€ images
โ”œโ”€โ”€ textvqa
โ”‚   โ””โ”€โ”€ train_images
โ””โ”€โ”€ vg
    โ”œโ”€โ”€ VG_100K
    โ””โ”€โ”€ VG_100K_2

Evaluation

GPT-assisted Evaluation

Our GPT-assisted evaluation pipeline for multimodal modeling is provided for a comprehensive understanding of the capabilities of vision-language models. Please see our paper for more details.

  1. Generate LLaVA responses
python model_vqa.py \
    --model-path ./checkpoints/LLaVA-13B-v0 \
    --question-file \
    playground/data/coco2014_val_qa_eval/qa90_questions.jsonl \
    --image-folder \
    /path/to/coco2014_val \
    --answers-file \
    /path/to/answer-file-our.jsonl
  1. Evaluate the generated responses. In our case, answer-file-ref.jsonl is the response generated by text-only GPT-4 (0314), with the context captions/boxes provided.
OPENAI_API_KEY="sk-***********************************" python llava/eval/eval_gpt_review_visual.py \
    --question playground/data/coco2014_val_qa_eval/qa90_questions.jsonl \
    --context llava/eval/table/caps_boxes_coco2014_val_80.jsonl \
    --answer-list \
    /path/to/answer-file-ref.jsonl \
    /path/to/answer-file-our.jsonl \
    --rule llava/eval/table/rule.json \
    --output /path/to/review.json
  1. Summarize the evaluation results
python summarize_gpt_review.py

ScienceQA

Please check out the documentation here.

Acknowledgement

  • ORIGINAL PAPER and LINKS: Visual instruction tuning towards large language and vision models with GPT-4 level capabilities.

[Project Page] [Demo] [Data] [Model Zoo]

Improved Baselines with Visual Instruction Tuning [Paper]
Haotian Liu, Chunyuan Li, Yuheng Li, Yong Jae Lee

Visual Instruction Tuning (NeurIPS 2023, Oral) [Paper]
Haotian Liu*, Chunyuan Li*, Qingyang Wu, Yong Jae Lee (*Equal Contribution)

obsidian's People

Contributors

qnguyen3 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.