Giter VIP home page Giter VIP logo

lisa's Introduction

Gradio Open in OpenXLab

LISA: Reasoning Segmentation via Large Language Model

LISA: Large Language Instructed Segmentation Assistant

Input Output Input Output
"Who was the president of the US in this image? Please output segmentation mask and explain the reason." "Sure, the segmentation result is [SEG]. The President of the United States in the image is President Obama. "Who was the president of the US in this image? Please output segmentation mask and explain why." "Sure, [SEG]. In the image, the President of the United States is President Trump."
"What can make the woman stand higher? Please output segmentation mask and explain why." "Sure, [SEG]. The woman is standing higher by using a ladder..." "Can you segment the camera lens that is more suitable for photographing nearby objects in this image?" "Sure, [SEG]."
"Can you segment the unusual part in this image and explain why." "Sure, it is [SEG]. In the image, the unusual part is the dog wearing a reindeer antler headband..." "Where to wash hands in this image? Please output segmentation mask." "Sure, [SEG]."
"Can you segment the founder of Alibaba in this image and explain why?" "Sure, it is [SEG]. In the image, the man wearing a red tie and sitting on one of the chairs is Jack Ma, the co-founder of Alibaba Group..." "Please segment Lisa in this figure." "Sure, [SEG]."

News

LISA: Reasoning Segmentation via Large Language Model [Paper]
Xin Lai, Zhuotao Tian, Yukang Chen, Yanwei Li, Yuhui Yuan, Shu Liu, Jiaya Jia

Abstract

In this work, we propose a new segmentation task --- reasoning segmentation. The task is designed to output a segmentation mask given a complex and implicit query text. We establish a benchmark comprising over one thousand image-instruction pairs, incorporating intricate reasoning and world knowledge for evaluation purposes. Finally, we present LISA: Large-language Instructed Segmentation Assistant, which inherits the language generation capabilities of the multi-modal Large Language Model (LLM) while also possessing the ability to produce segmentation masks. For more details, please refer to the paper.

Highlights

LISA unlocks the new segmentation capabilities of multi-modal LLMs, and can handle cases involving:

  1. complex reasoning;
  2. world knowledge;
  3. explanatory answers;
  4. multi-turn conversation.

LISA also demonstrates robust zero-shot capability when trained exclusively on reasoning-free datasets. In addition, fine-tuning the model with merely 239 reasoning segmentation image-instruction pairs results in further performance enhancement.

Experimental results

Installation

pip install -r requirements.txt
pip install flash-attn --no-build-isolation

Training

Training Data Preparation

The training data consists of 4 types of data:

  1. Semantic segmentation datasets: ADE20K, COCO-Stuff, Mapillary, PACO-LVIS, PASCAL-Part, COCO Images

    Note: For COCO-Stuff, we use the annotation file stuffthingmaps_trainval2017.zip. We only use the PACO-LVIS part in PACO. COCO Images should be put into the dataset/coco/ directory.

  2. Referring segmentation datasets: refCOCO, refCOCO+, refCOCOg, refCLEF (saiapr_tc-12)

    Note: the original links of refCOCO series data are down, and we update them with new ones. If the download speed is super slow or unstable, we also provide a OneDrive link to download. You must also follow the rules that the original datasets require.

  3. Visual Question Answering dataset: LLaVA-Instruct-150k

  4. Reasoning segmentation dataset: ReasonSeg

Download them from the above links, and organize them as follows.

├── dataset
│   ├── ade20k
│   │   ├── annotations
│   │   └── images
│   ├── coco
│   │   └── train2017
│   │       ├── 000000000009.jpg
│   │       └── ...
│   ├── cocostuff
│   │   └── train2017
│   │       ├── 000000000009.png
│   │       └── ...
│   ├── llava_dataset
│   │   └── llava_instruct_150k.json
│   ├── mapillary
│   │   ├── config_v2.0.json
│   │   ├── testing
│   │   ├── training
│   │   └── validation
│   ├── reason_seg
│   │   └── ReasonSeg
│   │       ├── train
│   │       ├── val
│   │       └── explanatory
│   ├── refer_seg
│   │   ├── images
│   │   |   ├── saiapr_tc-12 
│   │   |   └── mscoco
│   │   |       └── images
│   │   |           └── train2014
│   │   ├── refclef
│   │   ├── refcoco
│   │   ├── refcoco+
│   │   └── refcocog
│   └── vlpart
│       ├── paco
│       │   └── annotations
│       └── pascal_part
│           ├── train.json
│           └── VOCdevkit

Pre-trained weights

LLaVA

To train LISA-7B or 13B, you need to follow the instruction to merge the LLaVA delta weights. Typically, we use the final weights LLaVA-Lightning-7B-v1-1 and LLaVA-13B-v1-1 merged from liuhaotian/LLaVA-Lightning-7B-delta-v1-1 and liuhaotian/LLaVA-13b-delta-v1-1, respectively. For Llama2, we can directly use the LLaVA full weights liuhaotian/llava-llama-2-13b-chat-lightning-preview.

SAM ViT-H weights

Download SAM ViT-H pre-trained weights from the link.

Training

deepspeed --master_port=24999 train_ds.py \
  --version="PATH_TO_LLaVA" \
  --dataset_dir='./dataset' \
  --vision_pretrained="PATH_TO_SAM" \
  --dataset="sem_seg||refer_seg||vqa||reason_seg" \
  --sample_rates="9,3,3,1" \
  --exp_name="lisa-7b"

When training is finished, to get the full model weight:

cd ./runs/lisa-7b/ckpt_model && python zero_to_fp32.py . ../pytorch_model.bin

Merge LoRA Weight

Merge the LoRA weights of pytorch_model.bin, save the resulting model into your desired path in the Hugging Face format:

CUDA_VISIBLE_DEVICES="" python merge_lora_weights_and_save_hf_model.py \
  --version="PATH_TO_LLaVA" \
  --weight="PATH_TO_pytorch_model.bin" \
  --save_path="PATH_TO_SAVED_MODEL"

For example:

CUDA_VISIBLE_DEVICES="" python3 merge_lora_weights_and_save_hf_model.py \
  --version="./LLaVA/LLaVA-Lightning-7B-v1-1" \
  --weight="lisa-7b/pytorch_model.bin" \
  --save_path="./LISA-7B"

Validation

deepspeed --master_port=24999 train_ds.py \
  --version="PATH_TO_LISA_HF_Model_Directory" \
  --dataset_dir='./dataset' \
  --vision_pretrained="PATH_TO_SAM" \
  --exp_name="lisa-7b" \
  --eval_only

Note: the v1 model is trained using both train+val sets, so please use the v0 model to reproduce the validation results. (To use the v0 models, please first checkout to the legacy version repo with git checkout 0e26916.)

Inference

To chat with LISA-13B-llama2-v1 or LISA-13B-llama2-v1-explanatory: (Note that chat.py currently does not support v0 models (i.e., LISA-13B-llama2-v0 and LISA-13B-llama2-v0-explanatory), if you want to use the v0 models, please first checkout to the legacy version repo git checkout 0e26916.)

CUDA_VISIBLE_DEVICES=0 python chat.py --version='xinlai/LISA-13B-llama2-v1'
CUDA_VISIBLE_DEVICES=0 python chat.py --version='xinlai/LISA-13B-llama2-v1-explanatory'

To use bf16 or fp16 data type for inference:

CUDA_VISIBLE_DEVICES=0 python chat.py --version='xinlai/LISA-13B-llama2-v1' --precision='bf16'

To use 8bit or 4bit data type for inference (this enables running 13B model on a single 24G or 12G GPU at some cost of generation quality):

CUDA_VISIBLE_DEVICES=0 python chat.py --version='xinlai/LISA-13B-llama2-v1' --precision='fp16' --load_in_8bit
CUDA_VISIBLE_DEVICES=0 python chat.py --version='xinlai/LISA-13B-llama2-v1' --precision='fp16' --load_in_4bit

Hint: for 13B model, 16-bit inference consumes 30G VRAM with a single GPU, 8-bit inference consumes 16G, and 4-bit inference consumes 9G.

After that, input the text prompt and then the image path. For example,

- Please input your prompt: Where can the driver see the car speed in this image? Please output segmentation mask.
- Please input the image path: imgs/example1.jpg

- Please input your prompt: Can you segment the food that tastes spicy and hot?
- Please input the image path: imgs/example2.jpg

The results should be like:

Deployment

CUDA_VISIBLE_DEVICES=0 python app.py --version='xinlai/LISA-13B-llama2-v1 --load_in_4bit'
CUDA_VISIBLE_DEVICES=0 python app.py --version='xinlai/LISA-13B-llama2-v1-explanatory --load_in_4bit'

By default, we use 4-bit quantization. Feel free to delete the --load_in_4bit argument for 16-bit inference or replace it with --load_in_8bit argument for 8-bit inference.

Dataset

In ReasonSeg, we have collected 1218 images (239 train, 200 val, and 779 test). The training and validation sets can be download from this link.

Each image is provided with an annotation JSON file:

image_1.jpg, image_1.json
image_2.jpg, image_2.json
...
image_n.jpg, image_n.json

Important keys contained in JSON files:

- "text": text instructions.
- "is_sentence": whether the text instructions are long sentences.
- "shapes": target polygons.

The elements of the "shapes" exhibit two categories, namely "target" and "ignore". The former category is indispensable for evaluation, while the latter category denotes the ambiguous region and hence disregarded during the evaluation process.

We provide a script that demonstrates how to process the annotations:

python3 utils/data_processing.py

Besides, we leveraged GPT-3.5 for rephrasing instructions, so images in the training set may have more than one instructions (but fewer than six) in the "text" field. During training, users may randomly select one as the text query to obtain a better model.

Citation

If you find this project useful in your research, please consider citing:

@article{lai2023lisa,
  title={LISA: Reasoning Segmentation via Large Language Model},
  author={Lai, Xin and Tian, Zhuotao and Chen, Yukang and Li, Yanwei and Yuan, Yuhui and Liu, Shu and Jia, Jiaya},
  journal={arXiv preprint arXiv:2308.00692},
  year={2023}
}
@article{yang2023improved,
  title={An Improved Baseline for Reasoning Segmentation with Large Language Model},
  author={Yang, Senqiao and Qu, Tianyuan and Lai, Xin and Tian, Zhuotao and Peng, Bohao and Liu, Shu and Jia, Jiaya},
  journal={arXiv preprint arXiv:2312.17240},
  year={2023}
}

Acknowledgement

  • This work is built upon the LLaVA and SAM.

lisa's People

Contributors

chongruo avatar deepaicrazy avatar eltociear avatar enderfga avatar robert-zwr avatar tianzhuotao avatar x-lai avatar xbkaishui avatar yukang2017 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

lisa's Issues

loading checkpoint error

I can load checkpoint correctly if I run train_ds.py, but when I use deepspeed as the given example, this error occurs. Can you tell me how to fix it?

You are using the legacy behaviour of the <class 'transformers.models.llama.tokenization_llama.LlamaTokenizer'>. This means that tokens that come after special tokens will not be properly handled. We recommend you to read the related pull request available at huggingface/transformers#24565
Loading checkpoint shards: 50%|███████████████████████████████████████████████████████████▌ | 1/2 [00:10<00:10, 10.36s/it][2023-08-22 11:46:22,908] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 3210
[2023-08-22 11:46:23,766] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 3211
[2023-08-22 11:46:24,696] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 3212
Loading checkpoint shards: 50%|███████████████████████████████████████████████████████████▌ | 1/2 [00:18<00:18, 18.49s/it][2023-08-22 11:46:25,962] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 3213
[2023-08-22 11:46:26,758] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 3214
[2023-08-22 11:46:26,758] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 3215
[2023-08-22 11:46:27,597] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 3216
[2023-08-22 11:46:28,151] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 3218
[2023-08-22 11:46:28,827] [ERROR] [launch.py:321:sigkill_handler] ['/home/TianYunjie/anaconda3/envs/lisa/bin/python', '-u', 'train_ds.py', '--local_rank=7', '--version=liuhaotian/LLaVA-Lightning-7B-delta-v1-1', '--dataset_dir=/home/ubuntu/Workspace/TianYunjie/datasets/LISA_datasets/', '--vision_pretrained=sam_vit_h_4b8939.pth', '--dataset=sem_seg||refer_seg||vqa||reason_seg', '--sample_rates=9,3,3,1', '--exp_name=lisa-7b'] exits with return code = -9

and here is the related information:

[2023-08-22 11:45:50,020] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2023-08-22 11:45:52,164] [WARNING] [runner.py:201:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only.
[2023-08-22 11:45:52,224] [INFO] [runner.py:567:main] cmd = /home/TianYunjie/anaconda3/envs/lisa/bin/python -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMCwgMSwgMiwgMywgNCwgNSwgNiwgN119 --master_addr=127.0.0.1 --master_port=24999 --enable_each_rank_log=None train_ds.py --version=liuhaotian/LLaVA-Lightning-7B-delta-v1-1 --dataset_dir=/home/ubuntu/Workspace/TianYunjie/datasets/LISA_datasets/ --vision_pretrained=sam_vit_h_4b8939.pth --dataset=sem_seg||refer_seg||vqa||reason_seg --sample_rates=9,3,3,1 --exp_name=lisa-7b
[2023-08-22 11:45:54,525] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2023-08-22 11:45:56,379] [INFO] [launch.py:145:main] WORLD INFO DICT: {'localhost': [0, 1, 2, 3, 4, 5, 6, 7]}

About Multi-turn Conversation

Hi~ Nice work!
The paper mentioned that LISA has the capability of multi-turn conversation.
I would like to know how LISA get this capability?
Is the training instruction data containing multi-turn conversation? Will the previous inputs and outputs be fed into MLLM again in subsequent conversations?

Error while training

Hello

I have downloaded the datasets as specified in the README

I have run the following command for training :

deepspeed --master_port=24999 train_ds.py \ --version="PATH_TO_LLaVA" \ --dataset_dir='./dataset' \ --vision_pretrained="PATH_TO_SAM" \ --exp_name="lisa-7b" \ --weight='PATH_TO_pytorch_model.bin' \ --eval_only

However, I receive the following error :

Traceback (most recent call last): File "/home/ameenali/anaconda3/envs/ameen/lib/python3.10/site-packages/torch/utils/data/_utils/worker.py", line 302, in _worker_loop data = fetcher.fetch(index) File "/home/ameenali/anaconda3/envs/ameen/lib/python3.10/site-packages/torch/utils/data/_utils/fetch.py", line 58, in fetch data = [self.dataset[idx] for idx in possibly_batched_index] File "/home/ameenali/anaconda3/envs/ameen/lib/python3.10/site-packages/torch/utils/data/_utils/fetch.py", line 58, in <listcomp> data = [self.dataset[idx] for idx in possibly_batched_index] File "/home/ameenali/LISA/utils/dataset.py", line 244, in __getitem__ return *data[0], inference File "/home/ameenali/LISA/utils/sem_seg_dataset.py", line 284, in __getitem__ classes = [self.data2classes[ds][class_id - 1] for class_id in unique_label] File "/home/ameenali/LISA/utils/sem_seg_dataset.py", line 284, in <listcomp> classes = [self.data2classes[ds][class_id - 1] for class_id in unique_label] IndexError: index 182 is out of bounds for axis 0 with size 182

Any idea why this is happening ?

Failed to download saiapr_tc-12 dataset

Hi

Thank you for your impressive work. I am wondering if it is possible to share the saiapr_tc-12 dataset using something like Google Drive. The download speed of web archive is extremely slow and unstable. It took me many hours but I still failed to download it.

Availability of ReasonSeg Testing Set

Hi

Thank you for your impressive work. In order to reproduce the results of your paper and do further research. Could you let me know if the testing set of ReasonSeg and the testing script will be released? Thank you in advance!

Kind regards

About the inference model

when I use "CUDA_VISIBLE_DEVICES=0 python3 chat.py --version='xinlai/LISA-13B-llama2-v0' --precision='fp16' --load_in_4bit

", there is an error:

self.mlp_output_mp(mp_replace, reversed_dim=reversed_dim)
File "/home/user/miniconda3/envs/python_3.9/lib/python3.9/site-packages/deepspeed/module_inject/containers/base.py", line 257, in mlp_output_mp
self.module.mlp.output_w = mp_replace.copy(self.module.mlp.output_w, self._4hh_w, int8=reversed_dim)
File "/home/user/miniconda3/envs/python_3.9/lib/python3.9/site-packages/deepspeed/module_inject/auto_tp.py", line 97, in copy
self.merge_assert(src_shape[inner_dim], dst_shape[self.in_dim])
File "/home/user/miniconda3/envs/python_3.9/lib/python3.9/site-packages/deepspeed/module_inject/auto_tp.py", line 31, in merge_assert
assert dim1 > dim2,
AssertionError: Merging tensors is not allowed here! Please use deepspeed load_checkpoint for merging your checkpoints before replacing the transformer layer with inference-kernels

Training Log

May I ask if you could provide the training log information? Additionally, we have noticed slower training speeds on 8 V100 GPUs. We are unsure of the reason for this.

DeepSpeed Load Problem

when i trying to load a model using deepspeed

CUDA_VISIBLE_DEVICES=0 python3 chat.py --version='xinlai/LISA-13B-llama2-v0' --precision='fp16' --load_in_4bit

It results in :

[2023-08-07 14:45:14,038] [INFO] [real_accelerator.py:133:get_accelerator] Setting ds_accelerator to cuda (auto detect)
You are using the legacy behaviour of the <class 'transformers.models.llama.tokenization_llama.LlamaTokenizer'>. This means that tokens that come after special tokens will not be properly handled. We recommend you to read the related pull request available at https://github.com/huggingface/transformers/pull/24565
Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:19<00:00,  6.41s/it]
[2023-08-07 14:45:46,446] [INFO] [logging.py:96:log_dist] [Rank -1] DeepSpeed info: version=0.10.0, git-hash=unknown, git-branch=unknown
[2023-08-07 14:45:46,449] [WARNING] [config_utils.py:69:_process_deprecated_field] Config parameter replace_method is deprecated. This parameter is no longer needed, please remove from your call to DeepSpeed-inference
[2023-08-07 14:45:46,450] [INFO] [logging.py:96:log_dist] [Rank -1] quantize_bits = 8 mlp_extra_grouping = False, quantize_groups = 1
Using /home/dml/.cache/torch_extensions/py39_cu117 as PyTorch extensions root...
Detected CUDA files, patching ldflags
Emitting ninja build file /home/dml/.cache/torch_extensions/py39_cu117/transformer_inference/build.ninja...
Building extension module transformer_inference...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module transformer_inference...
Time to load transformer_inference op: 0.5588159561157227 seconds
[2023-08-07 14:45:47,705] [INFO] [logging.py:96:log_dist] [Rank -1] DeepSpeed-Inference config: {'layer_id': 0, 'hidden_size': 1, 'intermediate_size': 35389440, 'heads': 40, 'num_hidden_layers': -1, 'dtype': torch.float16, 'pre_layer_norm': True, 'norm_type': <NormType.RMSNorm: 3>, 'local_rank': -1, 'stochastic_mode': False, 'epsilon': 1e-05, 'mp_size': 1, 'scale_attention': True, 'triangular_masking': True, 'local_attention': False, 'window_size': 1, 'rotary_dim': 0, 'rotate_half': True, 'rotate_every_two': False, 'return_tuple': True, 'mlp_after_attn': True, 'mlp_act_func_type': <ActivationFuncType.GATED_SILU: 4>, 'specialized_mode': False, 'training_mp_size': 1, 'bigscience_bloom': False, 'max_out_tokens': 1024, 'min_out_tokens': 1, 'scale_attn_by_inverse_layer_idx': False, 'enable_qkv_quantization': False, 'use_mup': False, 'return_single_tuple': False, 'set_empty_params': False, 'transposed_mode': False, 'use_triton': False, 'triton_autotune': False}
Using /home/dml/.cache/torch_extensions/py39_cu117 as PyTorch extensions root...
No modifications detected for re-loaded extension module transformer_inference, skipping build step...
Loading extension module transformer_inference...
Time to load transformer_inference op: 0.08758425712585449 seconds
Traceback (most recent call last):
  File "/home/dml/yhq/LISA/chat.py", line 169, in <module>
    main(sys.argv[1:])
  File "/home/dml/yhq/LISA/chat.py", line 86, in main
    model_engine = deepspeed.init_inference(model=model, 
  File "/home/dml/anaconda3/envs/lisa/lib/python3.9/site-packages/deepspeed/__init__.py", line 342, in init_inference
    engine = InferenceEngine(model, config=ds_inference_config)
  File "/home/dml/anaconda3/envs/lisa/lib/python3.9/site-packages/deepspeed/inference/engine.py", line 192, in __init__
    self._apply_injection_policy(config)
  File "/home/dml/anaconda3/envs/lisa/lib/python3.9/site-packages/deepspeed/inference/engine.py", line 426, in _apply_injection_policy
    replace_transformer_layer(client_module, self.module, checkpoint, config, self.config)
  File "/home/dml/anaconda3/envs/lisa/lib/python3.9/site-packages/deepspeed/module_inject/replace_module.py", line 523, in replace_transformer_layer
    replaced_module = replace_module(model=model,
  File "/home/dml/anaconda3/envs/lisa/lib/python3.9/site-packages/deepspeed/module_inject/replace_module.py", line 766, in replace_module
    replaced_module, _ = _replace_module(model, policy, state_dict=sd)
  File "/home/dml/anaconda3/envs/lisa/lib/python3.9/site-packages/deepspeed/module_inject/replace_module.py", line 847, in _replace_module
    _, layer_id = _replace_module(child,
  File "/home/dml/anaconda3/envs/lisa/lib/python3.9/site-packages/deepspeed/module_inject/replace_module.py", line 847, in _replace_module
    _, layer_id = _replace_module(child,
  File "/home/dml/anaconda3/envs/lisa/lib/python3.9/site-packages/deepspeed/module_inject/replace_module.py", line 847, in _replace_module
    _, layer_id = _replace_module(child,
  File "/home/dml/anaconda3/envs/lisa/lib/python3.9/site-packages/deepspeed/module_inject/replace_module.py", line 823, in _replace_module
    replaced_module = policies[child.__class__][0](child,
  File "/home/dml/anaconda3/envs/lisa/lib/python3.9/site-packages/deepspeed/module_inject/replace_module.py", line 500, in replace_fn
    new_module = replace_with_policy(child,
  File "/home/dml/anaconda3/envs/lisa/lib/python3.9/site-packages/deepspeed/module_inject/replace_module.py", line 354, in replace_with_policy
    _container.apply_tensor_parallelism(mp_replace)
  File "/home/dml/anaconda3/envs/lisa/lib/python3.9/site-packages/deepspeed/module_inject/containers/features/hybrid_engine.py", line 94, in apply_tensor_parallelism
    self.mlp_output_mp(mp_replace, reversed_dim=reversed_dim)
  File "/home/dml/anaconda3/envs/lisa/lib/python3.9/site-packages/deepspeed/module_inject/containers/base.py", line 257, in mlp_output_mp
    self.module.mlp.output_w = mp_replace.copy(self.module.mlp.output_w, self._4hh_w, int8=reversed_dim)
  File "/home/dml/anaconda3/envs/lisa/lib/python3.9/site-packages/deepspeed/module_inject/replace_module.py", line 109, in copy
    self.merge_assert(src_shape[inner_dim], dst_shape[self.in_dim])
  File "/home/dml/anaconda3/envs/lisa/lib/python3.9/site-packages/deepspeed/module_inject/replace_module.py", line 43, in merge_assert
    assert dim1 > dim2, \
AssertionError: Merging tensors is not allowed here! Please use deepspeed load_checkpoint            for merging your checkpoints before replacing the transformer layer with            inference-kernels

Memory Usage

Hi:
I tried to use a single 3090(24G) to debug. It is the default setting, but it shows that CUDA is out of memory.
How to run this in a single GPU?

Further questions to the previous issue

Thanks for helping me out with this issue(issues/41).
Is v0 the version before the code rewrite on August 23rd? I actually use v0, so if I want to reproduce the non-ft results in the paper, I have to remove the ReasonSeg data set from the training set, right?
To reproduce the results of the ft version in the paper, how should I configure the fine-tuning on ReasonSeg, including the number of training steps and learning rate?

The number of pred_mask is inconsistent with the number of gt_mask

File "/home/LISA/model/LISA.py", line 336, in model_forward
gt_mask.shape[0] == pred_mask.shape[0]
AssertionError: gt_mask.shape: torch.Size([3, 427, 640]), pred_mask.shape: torch.Size([6, 427, 640])

I found the reason that causes this problem is (in train_ds.py, line 129):
args.seg_token_idx = tokenizer("[SEG]", add_special_tokens=False).input_ids[0]
since the token idx for [SEG] should be 29871, 32000. But here only uses 29871.

However, the token_idx for <im_end> is 29871, 32002. This causes the model to output double masks than GT masks.

I get this error by running the following code:
deepspeed --master_port=24999 train_ds.py \ --version=/home/LLaVA/checkpoints/llava-7b-llama-2-7b-chat \ --dataset_dir='dataset/' \ --vision_pretrained=./pretrain_weights/sam_vit_h_4b8939.pth \ --dataset=refer_seg \ --refer_seg_data=refcoco \ --sample_rates=1 \ --conv_type=llava_llama_2 \ --exp_name=lisa-7b \ --load_in_4bit \

The llava model is tested and correct.

train project

运行指令: deepspeed --master_port=24999 train_ds.py --version='./llava_path' --dataset_dir='./dataset' --vision_pretrained='sam_weights' --dataset='ade20k' --sample_rates="9,3,3,1" --exp_name="lisa-7b" --load_in_8bit

training data size/volume

Hello! I would like to know the approximate amount of data in the datasets of segmentation, referring segmentation, and VQA, ie, the number of images, masks, and instructions. Thanks!

For Llama2, How should we use the "liuhaotian/llava-llama-2-7b-chat-lightning-lora-preview"

We want to use Llama2, so according to the source code to try "liuhaotian/llava-llama-2-7b-chat-lightning-lora-preview", during which the following problems arise: How can I solve it?

发生异常: KeyError
'LlavaConfig'
File "/mnt/21T/zhangyupeng/code/LISA/train_ds.py", line 121, in main
tokenizer = transformers.AutoTokenizer.from_pretrained(
File "/mnt/21T/zhangyupeng/code/LISA/train_ds.py", line 585, in
main(sys.argv[1:])
KeyError: 'LlavaConfig'

Compute requirement

Hello, Excellent work by the team. May I know what sort of computational resources (no. of GPUs, RAM etc.) I will require for both reproducing this (LISA) and further fine tuning. Thanks

ResoningSeg Instructions

Great work!!!
I was wondering if you could shed some lights on how you guys annotate the reasoning segmentation image-instruction pairs.
Section 3.2 didn't elaborate much on the formulation of the text instructions.

Image preprocess

Hi, I notice that the image input for LLM goes through the CLIP-preprocessor which involves a center-crop operation. Therefore, some visual information for LLM might be discarded, would this affect the segmentation result if the target is not contained or only partially contained in the center-crop region?

Token indices sequence length is longer than the specified maximum sequence length

When we run the following command

deepspeed --master_port=24999 train_ds.py \
    --version LLaVA/LLaVA-Lightning-7B-v1-1 \
    --dataset_dir dataset \
    --vision_pretrained dataset/sam/sam_vit_h_4b8939.pth \
    --dataset "sem_seg||refer_seg||vqa||reason_seg" \
    --sample_rates 9,3,3,1 \
    --refer_seg_data "refcoco||refcoco+||refcocog" \
    --exp_name lisa-7b

The following warning is printed repeatedly

Token indices sequence length is longer than the specified maximum sequence length for this model (564 > 512). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (738 > 512). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (549 > 512). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (781 > 512). Running this sequence through the model will result in indexing errors
Epoch: [0][  1/500]     Time 112.567 (124.016)  Loss 3.0156 (6.4026)    CeLoss 3.0156 (6.4026)  MaskLoss 1.4305 (1.5772)        MaskBCELoss 0.9352 (1.1417)     MaskDICELoss 0.4953 (0.4355)
Token indices sequence length is longer than the specified maximum sequence length for this model (566 > 512). Running this sequence through the model will result in indexing errors
Epoch: [0][  2/500]     Time 113.307 (113.307)  Loss 9.5625 (5.7506)    CeLoss 9.5625 (5.7506)  MaskLoss 2.5348 (1.3935)        MaskBCELoss 2.0431 (0.9253)     MaskDICELoss 0.4916 (0.4682)
Token indices sequence length is longer than the specified maximum sequence length for this model (626 > 512). Running this sequence through the model will result in indexing errors

We cannot find the location the prints the warning. Will it be a problem? Any help would be appreciated.

pascal_part train.json

Thanks for sharing this work!

One question, where can i get the train.json file for pascal_part + is VOCdevkit the one for pascal 2010 or 2012?

Thanks!

load model error

when i trying to load a model using local model:
we use this local model : https://huggingface.co/xinlai/LISA-13B-llama2-v0/tree/main

LlavaLlamaForCausalLM.from_pretrained(pretrained_model_name_or_path="xxx/xxx/LISA-13B-llama2-v0")

it result in :

loading checkpoint shards: 0% 
*** OSError :/ UNanable to load weights from pytorch checkpoint file for
 "xxx/xxx/LISA-13B-llama2-v0/pytorch_model-00001-of-00003.bin“ 
at ”xxx/xxx/LISA-13B-llama2-v0/pytorch_model-00001-of-00003.bin“ .
 If you tried to load a pytorch model from a TF 2.0 checkpoint ,please set from_tf=True

torch2.0.1,torchvision0.15.2,cuda 11.7, python3.8, and DeepSpeed 0.9.0.

The model "llava-7b-llama-2-7b-chat" merged by myself had problems during training.

Hello, we have merged the model "zhangyupeng/llava-7b-llama-2-7b-chat" by ourselves. Two 3090 Gpus are used for training, Batch_size=2 and grad_accumulation_steps=40. The following problems appear during training. Is this the reason for our own merged models?

Traceback (most recent call last):
File "/home/zhangyupeng/anaconda3/envs/lisa/lib/python3.9/site-packages/torch/utils/data/_utils/worker.py", line 302, in _worker_loop
data = fetcher.fetch(index)
File "/home/zhangyupeng/anaconda3/envs/lisa/lib/python3.9/site-packages/torch/utils/data/_utils/fetch.py", line 61, in fetch
return self.collate_fn(data)
File "/mnt/21T/zhangyupeng/code/LISA/utils/dataset.py", line 135, in collate_fn
assert cur_len == total_len
AssertionError

[2023-09-05 20:58:14,118] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 77018
[2023-09-05 20:58:14,119] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 77019
[2023-09-05 20:58:15,023] [ERROR] [launch.py:321:sigkill_handler] ['/home/zhangyupeng/anaconda3/envs/lisa/bin/python', '-u', 'train_ds.py',
'--local_rank=1'] exits with return code = 1

How to inference with multiple GPUs?

I try to use follwoing command to inference

CUDA_VISIBLE_DEVICES=0 python chat.py --version='xinlai/LISA-13B-llama2-v1'

But it shows CUDA out of memory:
image

How can I use multiple GPU to inference?
I tried "CUDA_VISIBLE_DEVICES=4"; And also "CUDA_VISIBLE_DEVICES=0,1,2,3". But neither approach works.

training time

Hello! You said in your paper "training LISA-7B requires only 10,000 training steps on 8 NVIDIA 24G 3090 GPUs". I wonder how many days/hours did you take to train LISA-7b model? Thanks~

Number of <seg> in the sequence

Hi, I'm curious about the number of <seg> that can be predicted in a sequence.
Based on the test demo below, it appears that only one <seg> can be present in the sequence.

image

And can LISA segment multiple instances in a single round?

是否有批量测试代码?

chat脚本推理需要交互使用。贵团队在训练的时候是否有使用到批量测试的脚本呢,能否公开?谢谢~

some question for reproduct the result

Hello,
I'm attempting to reproduce the performance data from the paper. I currently have only four GPUs and I have a few questions:

  1. How much of a performance difference is there between SAM's "huge" model and the "large" model?
  2. Approximately how much data should I run to achieve convergence with the model? also the train hours
  3. Would it be possible for you to share the training logs from your experiments ?

Thanks

The necessity of the VQA dataset.

Nice work!!
In Table 5, why the VQA dataset is not ablated?
The paper explains that "To preserve the original Visual Question Answering (VQA) ability of the multi-modal LLM, we also include the VQA dataset during training."
In other words, when there is no need to explain the reason for result and only locate the target object, whether the VQA dataset is required.

Confidence for Masks?

Is there any confidence value I can get corresponding to the object found.
For example in case of multi objects detected for a given prompt, how can I choose the most probable one.

The training script and the description in the paper are different.

Nice work! I notice that you present two versions of results in Table 1 of the main paper. One version is called (ft) and described as training without any ReasonSeg data first and then fine-tuned only on 239 ReasonSeg data. But i notice that the training script in this repo use ReasonSeg by default. The result I got using this script is between the (ft) version and the non-(ft) version. Is this result inconsistent with the paper?

loading checkpoint error

I can load checkpoint correctly if I run train_ds.py, but when I use deepspeed as the given example, this error occurs. Can you tell me how to fix it?

[2023-08-22 11:45:50,020] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2023-08-22 11:45:52,164] [WARNING] [runner.py:201:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only.
[2023-08-22 11:45:52,224] [INFO] [runner.py:567:main] cmd = /home/TianYunjie/anaconda3/envs/lisa/bin/python -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMCwgMSwgMiwgMywgNCwgNSwgNiwgN119 --master_addr=127.0.0.1 --master_port=24999 --enable_each_rank_log=None train_ds.py --version=liuhaotian/LLaVA-Lightning-7B-delta-v1-1 --dataset_dir=/home/ubuntu/Workspace/TianYunjie/datasets/LISA_datasets/ --vision_pretrained=sam_vit_h_4b8939.pth --dataset=sem_seg||refer_seg||vqa||reason_seg --sample_rates=9,3,3,1 --exp_name=lisa-7b
[2023-08-22 11:45:54,525] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2023-08-22 11:45:56,379] [INFO] [launch.py:145:main] WORLD INFO DICT: {'localhost': [0, 1, 2, 3, 4, 5, 6, 7]}
[2023-08-22 11:45:56,379] [INFO] [launch.py:151:main] nnodes=1, num_local_procs=8, node_rank=0
[2023-08-22 11:45:56,379] [INFO] [launch.py:162:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0, 1, 2, 3, 4, 5, 6, 7]})
[2023-08-22 11:45:56,379] [INFO] [launch.py:163:main] dist_world_size=8
[2023-08-22 11:45:56,379] [INFO] [launch.py:165:main] Setting CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
[2023-08-22 11:45:58,787] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2023-08-22 11:45:58,831] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2023-08-22 11:45:58,872] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2023-08-22 11:45:58,877] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2023-08-22 11:45:58,878] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (

The online demo seems not good

Thank you for opening your work!
I tested three examples, but LISA didn't provide the correct responses. These examples weren't cherry-picked to find the weakness of LISA; I just randomly picked them from the internet."
图片
图片
图片

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.