mbzuai-oryx / geochat Goto Github PK

View Code? Open in Web Editor NEW

379.0 9.0 28.0 59.45 MB

[CVPR 2024 🔥] GeoChat, the first grounded Large Vision Language Model for Remote Sensing

Home Page: https://mbzuai-oryx.github.io/GeoChat

Python 97.47% Shell 2.53%

remote-sensing vlm

geochat's Issues

size mismatch error when finetuning

I came into the following error:

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]
Loading checkpoint shards:  50%|█████     | 1/2 [00:11<00:11, 11.91s/it]
Loading checkpoint shards: 100%|██████████| 2/2 [00:17<00:00,  7.90s/it]
Loading checkpoint shards: 100%|██████████| 2/2 [00:17<00:00,  8.51s/it]
Traceback (most recent call last):
  File "/xxxxx/Documents/code/GeoChat/geochat/train/train_mem.py", line 13, in <module>
    train()
  File "/xxxxx/Documents/code/GeoChat/geochat/train/train.py", line 828, in train
    model = GeoChatLlamaForCausalLM.from_pretrained(
  File "/xxxxx/anaconda3/envs/geochat/lib/python3.10/site-packages/transformers/modeling_utils.py", line 2903, in from_pretrained
    ) = cls._load_pretrained_model(
  File "/xxxxx/anaconda3/envs/geochat/lib/python3.10/site-packages/transformers/modeling_utils.py", line 3310, in _load_pretrained_model
    raise RuntimeError(f"Error(s) in loading state_dict for {model.__class__.__name__}:\n\t{error_msg}")
RuntimeError: Error(s) in loading state_dict for GeoChatLlamaForCausalLM:
	size mismatch for model.vision_tower.vision_tower.vision_model.embeddings.position_embedding.weight: copying a param with shape torch.Size([577, 1024]) from checkpoint, the shape in current model is torch.Size([1297, 1024]).
	You may consider adding `ignore_mismatched_sizes=True` in the model `from_pretrained` method.

The script for finetuning:

srun --jobid $SLURM_JOBID \
    bash -c "python -m torch.distributed.run \
        --nproc_per_node $GPUS_PER_NODE \
        --nnodes $SLURM_NNODES \
        --node_rank $SLURM_PROCID \
        --master_addr $MASTER_ADDR \
        --master_port $MASTER_PORT \
        geochat/train/train_mem.py \
            --lora_enable True \
            --model_name_or_path $CODE_DIR/llava-v1.5-7b/ \
            --version $PROMPT_VERSION \
            --data_path $DATASET_DIR/GeoChat_Instruct.json \
            --image_folder $DATASET_DIR/share/softwares/kartik/GeoChat_finetuning/final_images_llava/  \
            --vision_tower openai/clip-vit-large-patch14-336/ \
            --mm_projector_type mlp2x_gelu \
            --pretrain_mm_mlp_adapter $CODE_DIR/llava-v1.5-7b/mm_projector.bin \
            --mm_vision_select_layer -2 \
            --mm_use_im_start_end False \
            --mm_use_im_patch_token False \
            --image_aspect_ratio pad \
            --bf16 True \
            --output_dir $OUTPUT_DIR \
            --num_train_epochs 1 \
            --per_device_train_batch_size 32 \
            --per_device_eval_batch_size 4 \
            --gradient_accumulation_steps 1 \
            --evaluation_strategy 'no' \
            --save_strategy 'epoch' \
            --save_steps 10000 \
            --save_total_limit 1 \
            --learning_rate 2e-4 \
            --weight_decay 0. \
            --warmup_ratio 0.03 \
            --lr_scheduler_type 'cosine' \
            --logging_steps 1 \
            --tf32 True \
            --model_max_length 2048 \
            --gradient_checkpointing True \
            --lazy_preprocess True \
            --dataloader_num_workers 16 \
            --report_to wandb \
            --deepspeed ./scripts/zero2.json"

Please note that I use the latest commit.

Fail to load CLIP with deepspeed

I am using finetune_lora.sh with zero3_offload.json to train (context below) and get the following error.

Traceback (most recent call last):
  File "/deep/u/emily712/GeoChat/geochat/train/train_mem.py", line 13, in <module>
    train()
  File "/deep/u/emily712/GeoChat/geochat/train/train.py", line 886, in train
    model.get_model().initialize_vision_modules(
  File "/deep/u/emily712/GeoChat/geochat/model/geochat_arch.py", line 62, in initialize_vision_modules
        model.get_model().initialize_vision_modules(model.get_model().initialize_vision_modules(

  File "/deep/u/emily712/GeoChat/geochat/model/geochat_arch.py", line 62, in initialize_vision_modules
  File "/deep/u/emily712/GeoChat/geochat/model/geochat_arch.py", line 62, in initialize_vision_modules
    vision_tower.load_model()
  File "/deep/u/emily712/GeoChat/geochat/model/multimodal_encoder/clip_encoder.py", line 103, in load_model
    vision_tower.load_model()
  File "/deep/u/emily712/GeoChat/geochat/model/multimodal_encoder/clip_encoder.py", line 103, in load_model
    vision_tower.load_model()
  File "/deep/u/emily712/GeoChat/geochat/model/multimodal_encoder/clip_encoder.py", line 103, in load_model
    self.clip_interpolate_embeddings(image_size=504, patch_size=14)
  File "/deep/u/emily712/GeoChat/geochat/model/multimodal_encoder/clip_encoder.py", line 30, in clip_interpolate_embeddings
    self.clip_interpolate_embeddings(image_size=504, patch_size=14)
  File "/deep/u/emily712/GeoChat/geochat/model/multimodal_encoder/clip_encoder.py", line 30, in clip_interpolate_embeddings
    n, seq_length, hidden_dim = pos_embedding.shape
ValueError    n, seq_length, hidden_dim = pos_embedding.shape: 
not enough values to unpack (expected 3, got 2)
ValueError: not enough values to unpack (expected 3, got 2)
    self.clip_interpolate_embeddings(image_size=504, patch_size=14)
  File "/deep/u/emily712/GeoChat/geochat/model/multimodal_encoder/clip_encoder.py", line 30, in clip_interpolate_embeddings
    n, seq_length, hidden_dim = pos_embedding.shape
ValueError: not enough values to unpack (expected 3, got 2)

Further examination shows that the issue is the CLIP weights are not loaded at the time of positional interpolation. When I load CLIP via CLIPVisionModel.from_pretrained("openai/clip-vit-large-patch14-336") within deepspeed, none of the model weights are loaded (i.e. they are a tensor of size zero). Running

from transformers import CLIPVisionModel
vision_tower = CLIPVisionModel.from_pretrained("openai/clip-vit-large-patch14-336")
state_dict = vision_tower.vision_model.embeddings.position_embedding.state_dict()
pos_embedding = state_dict['weight']
print("pos embedding shape: ", pos_embedding.shape)

with deepspeed within the CLIPVisionTower.load_model() method prints torch.Size([0]), whereas running the same lines of code within a program without deepspeed or within a python shell yields torch.Size([577, 1024]), which is the correct size.

Expected behavior
pos_embedding should have shape torch.Size([577, 1024]).

ds_report output

DeepSpeed C++/CUDA extension op report

NOTE: Ops not installed will be just-in-time (JIT) compiled at
runtime if needed. Op compatibility means that your system
meet the required dependencies to JIT install the op.

JIT compiled ops requires ninja
ninja .................. [OKAY]

op name ................ installed .. compatible

[WARNING] async_io requires the dev libaio .so object and headers but these were not found.
[WARNING] async_io: please install the libaio-dev package with apt
[WARNING] If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
async_io ............... [NO] ....... [NO]
fused_adam ............. [NO] ....... [OKAY]
cpu_adam ............... [NO] ....... [OKAY]
cpu_adagrad ............ [NO] ....... [OKAY]
cpu_lion ............... [NO] ....... [OKAY]
[WARNING] Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
evoformer_attn ......... [NO] ....... [NO]
fused_lamb ............. [NO] ....... [OKAY]
fused_lion ............. [NO] ....... [OKAY]
inference_core_ops ..... [NO] ....... [OKAY]
cutlass_ops ............ [NO] ....... [OKAY]
quantizer .............. [NO] ....... [OKAY]
ragged_device_ops ...... [NO] ....... [OKAY]
ragged_ops ............. [NO] ....... [OKAY]
random_ltd ............. [NO] ....... [OKAY]
[WARNING] sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.0
[WARNING] using untested triton version (2.0.0), only 1.0.0 is known to be compatible
sparse_attn ............ [NO] ....... [NO]
spatial_inference ...... [NO] ....... [OKAY]
transformer ............ [NO] ....... [OKAY]
stochastic_transformer . [NO] ....... [OKAY]
transformer_inference .. [NO] ....... [OKAY]

DeepSpeed general environment info:
torch install path ............... ['/deep/group/aicc-bootcamp/packages/miniconda3/envs/vllava/lib/python3.9/site-packages/torch']
torch version .................... 2.0.1+cu117
deepspeed install path ........... ['/deep/group/aicc-bootcamp/packages/miniconda3/envs/vllava/lib/python3.9/site-packages/deepspeed']
deepspeed info ................... 0.13.1, unknown, unknown
torch cuda version ............... 11.7
torch hip version ................ None
nvcc version ..................... 12.1
deepspeed wheel compiled w. ...... torch 2.0, cuda 11.7
shared memory (/dev/shm) size .... 251.77 GB

System info (please complete the following information):

OS: Ubuntu 20.04.3 LTS
1 machine with x3 A4000s
Python version: 3.9.16

Evaluation results about Grounding

Hi，I use the script to evaluate on the grounding task, but I got the prediction jsonl file contains the obvious wrong answer. For example, the first row is:
{"question_id": "fast_6217", "image_id": "train_5007_0017", "answer": "{<89><47><97><55>|<58>}{<50><24><54><28>|<58>}{<48><16><52><20>|<58>}", "ground_truth": [[[584.0, 337.0], [619.0, 313.0], [601.0, 282.0], [565.0, 304.0]], [[553.0, 287.0], [592.0, 262.0], [573.0, 229.0], [534.0, 254.0]], [[517.0, 237.0], [555.0, 214.0], [534.0, 181.0], [498.0, 204.0]]], "question": "3 airplanes at the right", "type": "ref", "dataset": "FAST", "obj_ids": [1, 2, 3], "size_group": "small"}
The difference between the answer and gt is too large. Is it normal?
Thanks!

Multi images

I see that both the eval code and the demo code seem to accept just one image.

https://github.com/mbzuai-oryx/GeoChat/blob/main/geochat/conversation.py

https://github.com/mbzuai-oryx/GeoChat/blob/main/geochat/eval/batch_geochat_grounding.py

Is there any method to accept two images and one question just to get one response?(for example: find the differences between two images)

The error encountered when using ZeRO-2 for training.

when l use ZeRO-2 for training, l meet the folllowing issue. Does anyone face the same issue? if anyone can resolve this issue?

wandb: Currently logged in as: huduu. Use wandb login --relogin to force relogin
wandb: Network error (ReadTimeout), entering retry loop.
wandb: ERROR Run initialization has timed out after 90.0 sec.
wandb: ERROR Please refer to the documentation for additional information: https://docs.wandb.ai/guides/track/tracking-faq#initstarterror-error-communicating-with-wandb-process-
Traceback (most recent call last):
File "/media/kou/Data2/ty/GeoChat/geochat/train/train.py", line 960, in
train()
File "/media/kou/Data2/ty/GeoChat/geochat/train/train.py", line 938, in train
trainer.train()
File "/home/junjie/.conda/envs/ty_GeoChat/lib/python3.10/site-packages/transformers/trainer.py", line 1539, in train
return inner_training_loop(
File "/home/junjie/.conda/envs/ty_GeoChat/lib/python3.10/site-packages/transformers/trainer.py", line 1752, in _inner_training_loop
self.control = self.callback_handler.on_train_begin(args, self.state, self.control)
File "/home/junjie/.conda/envs/ty_GeoChat/lib/python3.10/site-packages/transformers/trainer_callback.py", line 353, in on_train_begin
return self.call_event("on_train_begin", args, state, control)
File "/home/junjie/.conda/envs/ty_GeoChat/lib/python3.10/site-packages/transformers/trainer_callback.py", line 397, in call_event
result = getattr(callback, event)(
File "/home/junjie/.conda/envs/ty_GeoChat/lib/python3.10/site-packages/transformers/integrations.py", line 760, in on_train_begin
self.setup(args, state, model, **kwargs)
File "/home/junjie/.conda/envs/ty_GeoChat/lib/python3.10/site-packages/transformers/integrations.py", line 734, in setup
self._wandb.init(
File "/home/junjie/.conda/envs/ty_GeoChat/lib/python3.10/site-packages/wandb/sdk/wandb_init.py", line 1185, in init
wandb._sentry.reraise(e)
File "/home/junjie/.conda/envs/ty_GeoChat/lib/python3.10/site-packages/wandb/analytics/sentry.py", line 155, in reraise
raise exc.with_traceback(sys.exc_info()[2])
File "/home/junjie/.conda/envs/ty_GeoChat/lib/python3.10/site-packages/wandb/sdk/wandb_init.py", line 1171, in init
return wi.init()
File "/home/junjie/.conda/envs/ty_GeoChat/lib/python3.10/site-packages/wandb/sdk/wandb_init.py", line 776, in init
raise error
wandb.errors.CommError: Run initialization has timed out after 90.0 sec.
Please refer to the documentation for additional information: https://docs.wandb.ai/guides/track/tracking-faq#initstarterror-error-communicating-with-wandb-process-

Eval model. AttributeError: 'NoneType' object has no attribute 'preprocess'

I'm trying to reproduce it for scene categorization, but it's reporting an error in batch_geochat_scene.py.
Here is what the error is reported:
Traceback (most recent call last):
File "/opt/data/private/RS-MLLM/GeoChat/./geochat/eval/batch_geochat_scene.py", line 149, in
eval_model(args)
File "/opt/data/private/RS-MLLM/GeoChat/./geochat/eval/batch_geochat_scene.py", line 102, in eval_model
image_tensor_batch = image_processor.preprocess(image_folder,crop_size ={'height': 504, 'width': 504},size = {'shortest_edge': 504}, return_tensors='pt')['pixel_values']
AttributeError: 'NoneType' object has no attribute 'preprocess'

and I checked image_processor and it is assigned to None in builder.py

Problems about launching demo

Thank you for your excellent work. I encountered some problems when trying to use the geochat_demo.py file. Although I tried changing the conda environment many times and changing computers and servers, my problem still did not solve. I'm looking for your help here. My problem is:

In the "Install" section you wrote "Clone this repository and navigate to LLaVA folder". What is "LLaVA folder"? Do I need to create a new folder myself? If so, what is the directory structure of this folder compared to other files?
When I configure the environment according to your instructions and execute the code "python geochat_demo.py --model-path ./models", I will encounter the following error:
(geochat) root@Y9000K:/mnt/d/GeoChat# python geochat_demo.py --model-path ./models
/root/anaconda3/envs/geochat/lib/python3.8/site-packages/gradio_client/documentation.py:103: UserWarning: Could not get documentation group for <class 'gradio.mix.Parallel'>: No known documentation group for module 'gradio.mix'
warnings.warn(f"Could not get documentation group for {cls}: {exc}")
/root/anaconda3/envs/geochat/lib/python3.8/site-packages/gradio_client/documentation.py:103: UserWarning: Could not get documentation group for <class 'gradio.mix.Series'>: No known documentation group for module 'gradio.mix'
warnings.warn(f"Could not get documentation group for {cls}: {exc}")
Initializing Chat
Loading checkpoint shards: 50%|████████████████████████████████████████████████████ | 1/2 [02:06<02:06, 126.92s/it]
Traceback (most recent call last):
File "geochat_demo.py", line 53, in
tokenizer, model, image_processor, context_len = load_pretrained_model(args.model_path, args.model_base, model_name, args.load_8bit, args.load_4bit, device=args.device)
File "/mnt/d/GeoChat/geochat/model/builder.py", line 125, in load_pretrained_model
model = AutoModelForCausalLM.from_pretrained(model_path, low_cpu_mem_usage=True, **kwargs)
File "/root/anaconda3/envs/geochat/lib/python3.8/site-packages/transformers/models/auto/auto_factory.py", line 493, in from_pretrained
return model_class.from_pretrained(
File "/root/anaconda3/envs/geochat/lib/python3.8/site-packages/transformers/modeling_utils.py", line 2903, in from_pretrained
) = cls._load_pretrained_model(
File "/root/anaconda3/envs/geochat/lib/python3.8/site-packages/transformers/modeling_utils.py", line 3260, in _load_pretrained_model
new_error_msgs, offload_index, state_dict_index = _load_state_dict_into_meta_model(
File "/root/anaconda3/envs/geochat/lib/python3.8/site-packages/transformers/modeling_utils.py", line 717, in _load_state_dict_into_meta_model
set_module_tensor_to_device(model, param_name, param_device, **set_module_kwargs)
File "/root/anaconda3/envs/geochat/lib/python3.8/site-packages/accelerate/utils/modeling.py", line 348, in set_module_tensor_to_device
raise ValueError(
ValueError: Trying to set a tensor of shape torch.Size([577, 1024]) in "weight" (which has shape torch.Size([1297, 1024])), this look incorrect.
If I do not use the model from GeoChat-7B, but directly use "facebook/opt-350m", the interface can be loaded, but nothing happens when clicking the "send" button

Sorry to bother you, I haven't been able to solve this problem even though I tried many times. I hope I can get your help.

untimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:4 and cuda:0!

Hi, authors,
It is nice work.
I meet one problem: forward of the CLIPVisionEmbeddings class in modeling_clip.py

Could you help me to solve it?

Thanks

launching demo

hi, thanks for your great work!

i had some issues when launching the demo, as no image_processor was loaded by default (same bug as a comment mentioned in the youtube demo video iirc).

i found a workaround by renaming the model downloaded from HF "llava" (instead of geochat-7B) and by adding 2 lines of code to the "clip_encoder.py" file, line 86:

else:
    self.cfg_only = CLIPVisionConfig.from_pretrained(self.vision_tower_name)
    self.vision_tower = CLIPVisionModel.from_pretrained(self.vision_tower_name) ## EDIT
    self.vision_tower.requires_grad_(False) ## EDIT
    self.clip_interpolate_embeddings(image_size=504, patch_size=14)

there is maybe a simpler fix idk, but it worked for me and i could play with the demo.

code please code please

code plese T^T

How to calculate the metrics [email protected], [email protected], ROUGE and METEOR score in table 7, 8, 9?

Hi author,
It is nice work.
When run the evaluation codes, I find the output is json file.
My questions: How to calculate the metrics in table 7, 8, 9?
Would you like to provide the code for computing the metrics?

Thank you

ywsun

python geochat/eval/batch_geochat_grounding.py
--model-path /path/to/model
--question-file path/to/jsonl/file
--answer-file path/to/output/jsonl/file
--image_folder path/to/image/folder/

python geochat/eval/batch_geochat_referring.py
--model-path /path/to/model
--question-file path/to/jsonl/file
--answer-file path/to/output/jsonl/file
--image_folder path/to/image/folder/

Model for visual grounding

When I use the model for visual grounding ,It responds me "answer": "{<91><46><94><50>|<40>}", what's the meaning of them

loading checkpoint

I encountered the following issue when loading the checkpoint

can u help me to solve it?

Minimum memory for the training process

Hi, authors, thanks for your great work!
What is the minimum memory required during the training process? Is it possible to use 4x 4090 GPUs?

exits with return code = -7

Hello, when I want to use Lora for fine-tuning, no matter how I lower the parameters, the following error will be reported. I am using 8xA40，
(geochat) root@2170f15b1d25:/home/GeoChat/scripts# bash finetune_lora.sh
[2024-03-20 09:49:44,405] [INFO] [real_accelerator.py:110:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-03-20 09:49:46,776] [WARNING] [runner.py:196:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only.
[2024-03-20 09:49:46,776] [INFO] [runner.py:555:main] cmd = /root/miniconda3/envs/geochat/bin/python -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMCwgMSwgMiwgMywgNCwgNSwgNiwgN119 --master_addr=127.0.0.1 --master_port=21205 --enable_each_rank_log=None /home/GeoChat/geochat/train/train_mem.py --deepspeed /home/GeoChat/scripts/zero2.json --lora_enable True --model_name_or_path /home/LLaVA/llava-v1.5-7b --version v1 --data_path /home/LLaVA-HR/NEWrailwaytrain.json --image_folder /home/LLaVA/data --vision_tower /home/LLaVA/clip-vit-large-patch14-336 --mm_projector_type mlp2x_gelu --pretrain_mm_mlp_adapter /home/LLaVA/llava-v1.5-mlp2x-336px-pretrain-vicuna-7b-v1.5/mm_projector.bin --mm_vision_select_layer -2 --mm_use_im_start_end False --mm_use_im_patch_token False --image_aspect_ratio pad --bf16 True --output_dir /home/GeoChat/checkpoints_dir --num_train_epochs 1 --per_device_train_batch_size 16 --per_device_eval_batch_size 4 --gradient_accumulation_steps 1 --evaluation_strategy no --save_strategy epoch --save_steps 7000 --save_total_limit 1 --learning_rate 2e-4 --weight_decay 0. --warmup_ratio 0.03 --lr_scheduler_type cosine --logging_steps 1 --tf32 True --model_max_length 2048 --gradient_checkpointing True --lazy_preprocess True --dataloader_num_workers 4 --report_to wandb
[2024-03-20 09:49:48,460] [INFO] [real_accelerator.py:110:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-03-20 09:49:50,889] [INFO] [launch.py:138:main] 0 NV_LIBNCCL_DEV_PACKAGE=libnccl-dev=2.17.1-1+cuda12.1
[2024-03-20 09:49:50,889] [INFO] [launch.py:138:main] 0 NV_LIBNCCL_DEV_PACKAGE_VERSION=2.17.1-1
[2024-03-20 09:49:50,889] [INFO] [launch.py:138:main] 0 NCCL_VERSION=2.17.1-1
[2024-03-20 09:49:50,889] [INFO] [launch.py:138:main] 0 NV_LIBNCCL_DEV_PACKAGE_NAME=libnccl-dev
[2024-03-20 09:49:50,889] [INFO] [launch.py:138:main] 0 NV_LIBNCCL_PACKAGE=libnccl2=2.17.1-1+cuda12.1
[2024-03-20 09:49:50,889] [INFO] [launch.py:138:main] 0 NV_LIBNCCL_PACKAGE_NAME=libnccl2
[2024-03-20 09:49:50,889] [INFO] [launch.py:138:main] 0 NV_LIBNCCL_PACKAGE_VERSION=2.17.1-1
[2024-03-20 09:49:50,890] [INFO] [launch.py:145:main] WORLD INFO DICT: {'localhost': [0, 1, 2, 3, 4, 5, 6, 7]}
[2024-03-20 09:49:50,890] [INFO] [launch.py:151:main] nnodes=1, num_local_procs=8, node_rank=0
[2024-03-20 09:49:50,890] [INFO] [launch.py:162:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0, 1, 2, 3, 4, 5, 6, 7]})
[2024-03-20 09:49:50,890] [INFO] [launch.py:163:main] dist_world_size=8
[2024-03-20 09:49:50,890] [INFO] [launch.py:165:main] Setting CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
[2024-03-20 09:49:54,389] [INFO] [real_accelerator.py:110:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-03-20 09:49:54,650] [INFO] [real_accelerator.py:110:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-03-20 09:49:54,717] [INFO] [real_accelerator.py:110:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-03-20 09:49:54,717] [INFO] [real_accelerator.py:110:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-03-20 09:49:54,717] [INFO] [real_accelerator.py:110:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-03-20 09:49:54,718] [INFO] [real_accelerator.py:110:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-03-20 09:49:54,741] [INFO] [real_accelerator.py:110:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-03-20 09:49:54,752] [INFO] [real_accelerator.py:110:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-03-20 09:49:55,040] [WARNING] [comm.py:152:init_deepspeed_backend] NCCL backend in DeepSpeed not yet implemented
[2024-03-20 09:49:55,040] [INFO] [comm.py:594:init_distributed] cdb=None
[2024-03-20 09:49:55,284] [WARNING] [comm.py:152:init_deepspeed_backend] NCCL backend in DeepSpeed not yet implemented
[2024-03-20 09:49:55,284] [INFO] [comm.py:594:init_distributed] cdb=None
[2024-03-20 09:49:55,336] [WARNING] [comm.py:152:init_deepspeed_backend] NCCL backend in DeepSpeed not yet implemented
[2024-03-20 09:49:55,336] [INFO] [comm.py:594:init_distributed] cdb=None
[2024-03-20 09:49:55,336] [INFO] [comm.py:625:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
[2024-03-20 09:49:55,340] [WARNING] [comm.py:152:init_deepspeed_backend] NCCL backend in DeepSpeed not yet implemented
[2024-03-20 09:49:55,340] [INFO] [comm.py:594:init_distributed] cdb=None
[2024-03-20 09:49:55,349] [WARNING] [comm.py:152:init_deepspeed_backend] NCCL backend in DeepSpeed not yet implemented
[2024-03-20 09:49:55,349] [INFO] [comm.py:594:init_distributed] cdb=None
[2024-03-20 09:49:55,361] [WARNING] [comm.py:152:init_deepspeed_backend] NCCL backend in DeepSpeed not yet implemented
[2024-03-20 09:49:55,361] [INFO] [comm.py:594:init_distributed] cdb=None
[2024-03-20 09:49:55,362] [WARNING] [comm.py:152:init_deepspeed_backend] NCCL backend in DeepSpeed not yet implemented
[2024-03-20 09:49:55,362] [INFO] [comm.py:594:init_distributed] cdb=None
[2024-03-20 09:49:55,411] [WARNING] [comm.py:152:init_deepspeed_backend] NCCL backend in DeepSpeed not yet implemented
[2024-03-20 09:49:55,411] [INFO] [comm.py:594:init_distributed] cdb=None
You are using a model of type llava to instantiate a model of type geochat. This is not supported for all configurations of models and can yield errors.
You are using a model of type llava to instantiate a model of type geochat. This is not supported for all configurations of models and can yield errors.
You are using a model of type llava to instantiate a model of type geochat. This is not supported for all configurations of models and can yield errors.
You are using a model of type llava to instantiate a model of type geochat. This is not supported for all configurations of models and can yield errors.
You are using a model of type llava to instantiate a model of type geochat. This is not supported for all configurations of models and can yield errors.
You are using a model of type llava to instantiate a model of type geochat. This is not supported for all configurations of models and can yield errors.
You are using a model of type llava to instantiate a model of type geochat. This is not supported for all configurations of models and can yield errors.
You are using a model of type llava to instantiate a model of type geochat. This is not supported for all configurations of models and can yield errors.
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [02:24<00:00, 72.26s/it]

Adding LoRA adapters...
Formatting inputs...Skip in lazy mode
[2024-03-20 09:57:14,398] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 7453
[2024-03-20 09:57:14,401] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 7454
[2024-03-20 09:57:14,401] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 7455
[2024-03-20 09:57:15,060] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 7456
[2024-03-20 09:57:15,062] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 7457
[2024-03-20 09:57:15,064] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 7458
[2024-03-20 09:57:15,885] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 7459
[2024-03-20 09:57:15,887] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 7460
[2024-03-20 09:57:15,888] [ERROR] [launch.py:321:sigkill_handler] ['/root/miniconda3/envs/geochat/bin/python', '-u', '/home/GeoChat/geochat/train/train_mem.py', '--local_rank=7', '--deepspeed', '/home/GeoChat/scripts/zero2.json', '--lora_enable', 'True', '--model_name_or_path', '/home/LLaVA/llava-v1.5-7b', '--version', 'v1', '--data_path', '/home/LLaVA-HR/NEWrailwaytrain.json', '--image_folder', '/home/LLaVA/data', '--vision_tower', '/home/LLaVA/clip-vit-large-patch14-336', '--mm_projector_type', 'mlp2x_gelu', '--pretrain_mm_mlp_adapter', '/home/LLaVA/llava-v1.5-mlp2x-336px-pretrain-vicuna-7b-v1.5/mm_projector.bin', '--mm_vision_select_layer', '-2', '--mm_use_im_start_end', 'False', '--mm_use_im_patch_token', 'False', '--image_aspect_ratio', 'pad', '--bf16', 'True', '--output_dir', '/home/GeoChat/checkpoints_dir', '--num_train_epochs', '1', '--per_device_train_batch_size', '16', '--per_device_eval_batch_size', '4', '--gradient_accumulation_steps', '1', '--evaluation_strategy', 'no', '--save_strategy', 'epoch', '--save_steps', '7000', '--save_total_limit', '1', '--learning_rate', '2e-4', '--weight_decay', '0.', '--warmup_ratio', '0.03', '--lr_scheduler_type', 'cosine', '--logging_steps', '1', '--tf32', 'True', '--model_max_length', '2048', '--gradient_checkpointing', 'True', '--lazy_preprocess', 'True', '--dataloader_num_workers', '4', '--report_to', 'wandb'] exits with return code = -7

Error in geochat_demo.py

Thanks for the amazing work. I am trying to explore GeoChat on various tasks and I get stuck on the demo. I would appreciate it a lot if you could help me debug the error.

There is no error message in the terminal and I have followed the instructions on readme to set up everything. (Updated to the latest and the correct version of transformers etc...)

Using transformers to use geochat directly

Hello, you provided a demo built with gradio in your project. Now I want to read the weights directly through transformers and use geochat. But loading with AutoModelForCausalLM failed. After analysis, I found that you may have rewritten a GeoChatMetaForCausalLM class.
Can you provide a simple example of using geochat with transformers?
Thank you very much!

training data corrupt

hello, I can't unzip the data,it always show data corruption.can you fix it? thank you very much

Expose the code and dataset

when will the code and dataset be made public, I want to study against the paper，thank you so much

Gradio Interface is not working

@KjAeRsTuIsK @msohaildanish

when training had an error!

i am running it on colab pro .

merge lora

how to normalize the angle of rotated bounding box?

The normalization and quantization of width, height, center x, center y are included in the paper.
However, the information of angle is not included in the paper.

How did you normalize and quantize the angle?

CLIPVisionTower unable to obtain model

Hello! I use finetune_lora.sh and set as feliows:
--model_name_or_path liuhaotian/llava-v1.5-7b
--vision_tower openai/clip-vit-large-patch14-336 \

and got these:

File "/home/wucz/remote-sensing/GeoChat/geochat/model/multimodal_encoder/clip_encoder.py", line 97, in init
self.clip_interpolate_embeddings(image_size=504, patch_size=14)
File "/home/wucz/remote-sensing/GeoChat/geochat/model/multimodal_encoder/clip_encoder.py", line 34, in clip_interpolate_embeddings
n, seq_length, hidden_dim = pos_embedding.shape
ValueError: not enough values to unpack (expected 3, got 2)

    pos_embedding = state_dict['weight']
    print(pos_embedding.shape)  【torch.Size([0])】
    pos_embedding = pos_embedding.unsqueeze(0)
    print(pos_embedding.shape)  【torch.Size([1, 0])】
    n, seq_length, hidden_dim = pos_embedding.shape

Where did I set the wrong settings that caused me to not read the model？

eval for scene classification .There is a name undefined in the test program

I'm trying to reproduce its scene categorization, but there is a name in the code that is not defined.
Below is what is reported as an error:
Traceback (most recent call last):
File "/opt/data/private/RS-MLLM/GeoChat/./geochat/eval/batch_geochat_scene.py", line 139, in
eval_model(args)
File "/opt/data/private/RS-MLLM/GeoChat/./geochat/eval/batch_geochat_scene.py", line 49, in eval_model
questions = get_chunk(questions, args.num_chunks, args.chunk_idx)
NameError: name 'get_chunk' is not defined

Are you still planning to release the code?

It's an interesting study, but hard to reproduce without code.

Demo locally error

Greetings,

Thank you very much for open-sourcing this outstanding work!

I run the demo locally using the instructions in LoRA.md. But I'm experiencing bugs, when I select the image and enter the question, I get an error.

AttributeError: 'NoneType' object has no attribute 'image_mean'

License for Commercial use

Hi,

In view of the use of LoveDA dataset for training your model on classes like buildings and roads, can your model still be used for commercial purposes? If no, what other options are available? Thanks in advance!

how to run the lora finetuned model?

I have followed the instructions of finetune_lora.sh and got the trained model.

this is my finetune_lora.sh

#!/bin/bash

################## VICUNA ##################
PROMPT_VERSION=v1
MODEL_VERSION="vicuna-v1.5-7b"
gpu_ids=0,1,2,3
################## VICUNA ##################

 deepspeed --master_port=$((RANDOM + 10000)) --include localhost:$gpu_ids geochat/train/train_mem.py \
    --deepspeed ./scripts/zero2.json \
    --lora_enable True \
    --model_name_or_path pretrained_weights/llavav1.5-7b \
    --version $PROMPT_VERSION \
    --data_path ~/datasets/GeoChat_Instruct.json \
    --image_folder ~/datasets/GeoChat_finetuning/final_images_llava  \
    --vision_tower openai/clip-vit-large-patch14-336 \
    --mm_projector_type mlp2x_gelu \
    --pretrain_mm_mlp_adapter pretrained_weights/llava-v1.5-mlp2x-336px-pretrain-vicuna-7b-v1.5/mm_projector.bin \
    --mm_vision_select_layer -2 \
    --mm_use_im_start_end False \
    --mm_use_im_patch_token False \
    --image_aspect_ratio pad \
    --bf16 True \
    --output_dir /nfs/geochat_output/checkpoints_dir \
    --num_train_epochs 1 \
    --per_device_train_batch_size 6 \
    --per_device_eval_batch_size 4 \
    --gradient_accumulation_steps 4 \
    --evaluation_strategy "no" \
    --save_strategy "epoch" \
    --save_steps 1000 \
    --save_total_limit 5 \
    --learning_rate 2e-4 \
    --weight_decay 0. \
    --warmup_ratio 0.03 \
    --lr_scheduler_type "cosine" \
    --logging_steps 1 \
    --tf32 True \
    --model_max_length 2048 \
    --gradient_checkpointing True \
    --lazy_preprocess True \
    --dataloader_num_workers 0 \
    --report_to wandb

here is the saved lora fine_tuned model.

(base) ➜  checkpoints_dir tree
.
├── adapter_config.json
├── adapter_model.bin
├── checkpoint-3217
│   ├── adapter_config.json
│   ├── adapter_model.bin
│   ├── global_step3217
│   │   ├── bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt
│   │   ├── bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt
│   │   ├── bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt
│   │   ├── bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt
│   │   └── mp_rank_00_model_states.pt
│   ├── latest
│   ├── README.md
│   ├── rng_state_0.pth
│   ├── rng_state_1.pth
│   ├── rng_state_2.pth
│   ├── rng_state_3.pth
│   ├── special_tokens_map.json
│   ├── tokenizer_config.json
│   ├── tokenizer.model
│   ├── trainer_state.json
│   ├── training_args.bin
│   └── zero_to_fp32.py
├── config.json
├── non_lora_trainables.bin
├── README.md
└── trainer_state.json

I don't know how to load this model, I didn't find it in readme.md. can anyone help me? Thank you!

CUDA error during inference

I try to run the demo code but get a CUDA error from

streamer = chat.stream_answer(conv=chat_state,
                              img_list=img_list,
                              temperature=temperature,
                              max_new_tokens=500,
                              max_length=2000)

This is the error:

...
  File "venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/venv/lib/python3.11/site-packages/transformers/models/clip/modeling_clip.py", line 385, in forward
    hidden_states, attn_weights = self.self_attn(
                                  ^^^^^^^^^^^^^^^
  File "venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "venv/lib/python3.11/site-packages/transformers/models/clip/modeling_clip.py", line 324, in forward
    attn_output = torch.bmm(attn_probs, value_states)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling `cublasGemmStridedBatchedExFix( handle, opa, opb, m, n, k, (void*)(&falpha), a, CUDA_R_16F, lda, stridea, b, CUDA_R_16F, ldb, strideb, (void*)(&fbeta), c, CUDA_R_16F, ldc, stridec, num_batches, CUDA_R_32F, CUBLAS_GEMM_DEFAULT_TENSOR_OP)`

I assume that the this error is related to the cuda and/or torch version. These are the relevant package and versions I installed (torch 2.0.1 with Coda 11.7):

nvidia-cublas-cu11        11.10.3.66
nvidia-cuda-cupti-cu11    11.7.101
nvidia-cuda-nvrtc-cu11    11.7.99
nvidia-cuda-runtime-cu11  11.7.99
nvidia-cudnn-cu11         8.5.0.96
nvidia-cufft-cu11         10.9.0.58
nvidia-curand-cu11        10.2.10.91
nvidia-cusolver-cu11      11.4.0.1
nvidia-cusparse-cu11      11.7.4.91
nvidia-nccl-cu11          2.14.3
nvidia-nvtx-cu11          11.7.91
torch                     2.0.1
torchvision               0.15.2
transformers              4.31.0

Can you share the version you are using? I tested three different version but always got errors.

Calculation of metrics

Thank you for your excellent work. Could you please disclose how the metrics for each task are calculated?
Below are my code and results for evaluating region caption performance using the weights from geochat-7B, but the results are quite different from Table 10 in the paper. Where is the problem? Thank you

The results are as follows:

【GPU Memory】

Hello, I'm wondering about the minimum GPU memory required for training. Could you provide some information on this?

The results of MiniGPT in the paper

Thank you for your excellent work.
However, when I tested it on LR using MiniGPT-V2, the results I got were not consistent with those in your paper, being 10% higher than the paper.
May I ask how did you perform the test with MiniGPT-V2

training data corrupted

The data downloaded from Huggingface is corrupted. After decompression, there are only 109332 images, which does not match the GeoChat_Instruct.json

【RuntimeError: Size Mismatch】

I use the finetune_lora.sh to train, the context:

 deepspeed --master_port=$((RANDOM + 10000)) --include localhost:0,1 geochat/train/train_mem.py \
    --deepspeed ./scripts/zero2.json \
    --lora_enable True \
    --model_name_or_path /checkpoints/llava-v1.5-7b \
    --version $PROMPT_VERSION \
    --data_path /geochat/GeoChat_Instruct.json \
    --image_folder /Dataset/geochat/final_images_llava  \
    --vision_tower openai/clip-vit-large-patch14-336 \
    --mm_projector_type mlp2x_gelu \
    --pretrain_mm_mlp_adapter /checkpoints/llava-v1.5-7b/mm_projector.bin \
    --mm_vision_select_layer -2 \
    --mm_use_im_start_end False \
    --mm_use_im_patch_token False \
    --image_aspect_ratio pad \
    --bf16 True \
    --output_dir ./out_checkpoints/geochat \
    --num_train_epochs 1 \
    --per_device_train_batch_size 32 \
    --per_device_eval_batch_size 4 \
    --gradient_accumulation_steps 1 \
    --evaluation_strategy "no" \
    --save_strategy "epoch" \
    --save_steps 7000 \
    --save_total_limit 1 \
    --learning_rate 2e-4 \
    --weight_decay 0. \
    --warmup_ratio 0.03 \
    --lr_scheduler_type "cosine" \
    --logging_steps 1 \
    --tf32 True \
    --model_max_length 2048 \
    --gradient_checkpointing True \
    --lazy_preprocess True \
    --dataloader_num_workers 16

and get the following error:
[ File "/project/GeoChat/geochat/model/geochat_arch.py", line 96, in encode_images
image_features = self.encode_images(images)
File "/project/GeoChat/geochat/model/geochat_arch.py", line 96, in encode_images
image_features = self.get_model().get_vision_tower()(images)
File "/miniconda-3/envs/geochat/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
image_features = self.get_model().get_vision_tower()(images)
File "/miniconda-3/envs/geochat/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
...
File "/miniconda-3/envs/geochat/lib/python3.10/site-packages/transformers/models/clip/modeling_clip.py", line 866, in forward
hidden_states = self.embeddings(pixel_values)
File "/miniconda-3/envs/geochat/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
hidden_states = self.embeddings(pixel_values)
File "/miniconda-3/envs/geochat/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/miniconda-3/envs/geochat/lib/python3.10/site-packages/transformers/models/clip/modeling_clip.py", line 200, in forward
return forward_call(*args, **kwargs)
File "/miniconda-3/envs/geochat/lib/python3.10/site-packages/transformers/models/clip/modeling_clip.py", line 200, in forward
embeddings = embeddings + self.position_embedding(self.position_ids)
RuntimeError: The size of tensor a (577) must match the size of tensor b (1297) at non-singleton dimension 1
embeddings = embeddings + self.position_embedding(self.position_ids)
RuntimeError: The size of tensor a (577) must match the size of tensor b (1297) at non-singleton dimension 1]

evaluation about referring and grounding

Will the referring and grounding questions JSONL file for evaluation be opened?

python geochat_demo.py --model-path *** error

Initializing Chat Traceback (most recent call last): File "/disk_sda/**/llava_project/GeoChat/geochat_demo.py", line 53, in <module> tokenizer, model, image_processor, context_len = load_pretrained_model(args.model_path, args.model_base, model_name, args.load_8bit, args.load_4bit, device=args.device) File "/disk_sda/**/llava_project/GeoChat/geochat/model/builder.py", line 124, in load_pretrained_model model = AutoModelForCausalLM.from_pretrained(model_path, low_cpu_mem_usage=True, **kwargs) File "/home/**/anaconda3/envs/llava/lib/python3.10/site-packages/transformers/models/auto/auto_factory.py", line 493, in from_pretrained return model_class.from_pretrained( File "/home/**/anaconda3/envs/llava/lib/python3.10/site-packages/transformers/modeling_utils.py", line 2700, in from_pretrained model = cls(config, *model_args, **model_kwargs) File "/disk_sda/**/llava_project/GeoChat/geochat/model/language_model/geochat_llama.py", line 46, in __init__ self.model = GeoChatLlamaModel(config) File "/disk_sda/**/llava_project/GeoChat/geochat/model/language_model/geochat_llama.py", line 38, in __init__ super(GeoChatLlamaModel, self).__init__(config) File "/disk_sda/**/llava_project/GeoChat/geochat/model/geochat_arch.py", line 33, in __init__ self.vision_tower = build_vision_tower(config, delay_load=True) File "/disk_sda/**/llava_project/GeoChat/geochat/model/multimodal_encoder/builder.py", line 9, in build_vision_tower return CLIPVisionTower(vision_tower, args=vision_tower_cfg, **kwargs) File "/disk_sda/**/llava_project/GeoChat/geochat/model/multimodal_encoder/clip_encoder.py", line 88, in __init__ self.clip_interpolate_embeddings(image_size=504, patch_size=14) File "/disk_sda/**/llava_project/GeoChat/geochat/model/multimodal_encoder/clip_encoder.py", line 25, in clip_interpolate_embeddings state_dict = self.vision_tower.vision_model.embeddings.position_embedding.state_dict() File "/home/**/anaconda3/envs/llava/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1614, in __getattr__ raise AttributeError("'{}' object has no attribute '{}'".format( AttributeError: 'CLIPVisionTower' object has no attribute 'vision_tower'. Did you mean: 'vision_tower_name'?

get_chunk method in batch_geochat_scene.py seems to be undefined

Traceback (most recent call last):
File "/content/GeoChat/geochat/eval/batch_geochat_scene.py", line 139, in
eval_model(args)
File "/content/GeoChat/geochat/eval/batch_geochat_scene.py", line 49, in eval_model
questions = get_chunk(questions, args.num_chunks, args.chunk_idx)
NameError: name 'get_chunk' is not defined

Could you describe the procedure of reproduce the GeoChat?

Dear @salman-h-khan ,

Thanks for your fantastic work GeoChat, I am really interested in it. And the ckpt provided by you works for me.

However, when I tried to reproduce it as a beginner of the LLMs. It turns out a bit confusing for me to conduct all the training/finturning step by step.

Could you please specify where am I wrong when regarding what I did:

Step1: I prepared all the datasets and set the finetune_lora.sh as follows and run it

################## VICUNA ##################
PROMPT_VERSION=v1
MODEL_VERSION="vicuna-v1.5-7b"
gpu_ids=0,1,2,3
################## VICUNA ##################

 deepspeed --master_port=$((RANDOM + 10000)) --include localhost:$gpu_ids geochat/train/train_mem.py \
    --deepspeed ./scripts/zero2.json \
    --lora_enable True \
    --model_name_or_path /data/.../geochat/llava-v1.5-7b \
    --version $PROMPT_VERSION \
    --data_path /data/.../geochat/GeoChat_Instruct.json \
    --image_folder /data/.../geochat/final_images_llava  \
    --vision_tower openai/clip-vit-large-patch14-336 \
    --mm_projector_type mlp2x_gelu \
    --pretrain_mm_mlp_adapter /data/.../geochat/llava-v1.5-mlp2x-336px-pretrain-vicuna-7b-v1.5/mm_projector.bin \
    --mm_vision_select_layer -2 \
    --mm_use_im_start_end False \
    --mm_use_im_patch_token False \
    --image_aspect_ratio pad \
    --bf16 True \
    --output_dir /data/.../outckpts/geochat_reproduce \
    --num_train_epochs 1 \
    --per_device_train_batch_size 18 \
    --per_device_eval_batch_size 4 \
    --gradient_accumulation_steps 2 \
    --evaluation_strategy "no" \
    --save_strategy "epoch" \
    --save_steps 7000 \
    --save_total_limit 1 \
    --learning_rate 2e-4 \
    --weight_decay 0. \
    --warmup_ratio 0.03 \
    --lr_scheduler_type "cosine" \
    --logging_steps 1 \
    --tf32 True \
    --model_max_length 2048 \
    --gradient_checkpointing True \
    --lazy_preprocess True \
    --dataloader_num_workers 16 \
    --report_to wandb

then I got the following output folder geochat_reproduce with the following files:

Step 2: I try to merge the ckpts achieved in step1, with the original llava-v1.5-7b ckpts as follows:

python scripts/merge_lora_weights.py \
    --model-path /data/.../geochat/outckpts/geochat_reproduce \
    --model-base /data/.../geochat/llava-v1.5-7b \
    --save-model-path /data/.../geochat/outckpts/merged

Step 3: I tried to run the demo with the reproduced ckpts.

python geochat_demo.py \
    --model-path /data/.../geochat/outckpts/merged

It then turns out lots of errors as follows:

To this end, I would like to ask if there are some mistakes in my reproduction or if some other steps are missing.

It would be super nice to receive some guidance from you.

Best regards and have a nice day,

The quantity of the open-source training data does not match that mentioned in the paper.

Thank you very much for your work！

I discovered that the quantity of the open-source training data does not match that mentioned in the paper. When using a global batch size of 144, the number of iterations I trained for is 2144, while the paper indicates 2400.

metrics about region captioning

I evaluated the VQA and scene cls tasks on the model fine-tuned using GeoChatInstruct, and the results are pretty close to the metrics reported in the paper, however, the region captioning result is a bit far from the paper.
The official evaluation result:

My result:

Note that:

I finetuned the model only the first stage, which means I finetuned the LLaVA-v1.5-7b using GeoChatInstruct for only one epoch, and I did not further fine-tune the model using only referring and grounding samples since the lack of details in the paper about the stage 2 fientune.
I used the evaluate package of HuggingFace.

I wonder whether I did something wrong or the metric gap is caused by the stage2 finetune?

How to increase the number of epochs in training?

How to increase the number of epochs in training?
“num_train_epoch” seems to be invalid

Gradio demo is not live

Hello!

The gradio interface linked in the readme (https://3fa767b988e4136cd8.gradio.live/) is not running

HF dataset not working

from datasets import load_dataset

dataset = load_dataset("MBZUAI/GeoChat_Instruct", split="train", streaming=True)
print(next(iter(dataset)))

root@donggeun-selfsup-747b74575d-sj9n6:/nas/k8s/dev/mlops/donggeun/tools/hf_dataset# python3 test.py
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/datasets/packaged_modules/json/json.py", line 121, in _generate_tables
    pa_table = paj.read_json(
  File "pyarrow/_json.pyx", line 308, in pyarrow._json.read_json
  File "pyarrow/error.pxi", line 154, in pyarrow.lib.pyarrow_internal_check_status
  File "pyarrow/error.pxi", line 91, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: JSON parse error: Column() changed from object to array in row 0

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/nas/k8s/dev/mlops/donggeun/tools/hf_dataset/test.py", line 4, in <module>
    print(next(iter(dataset)))
  File "/usr/local/lib/python3.10/dist-packages/datasets/iterable_dataset.py", line 1384, in __iter__
    for key, example in ex_iterable:
  File "/usr/local/lib/python3.10/dist-packages/datasets/iterable_dataset.py", line 282, in __iter__
    for key, pa_table in self.generate_tables_fn(**self.kwargs):
  File "/usr/local/lib/python3.10/dist-packages/datasets/packaged_modules/json/json.py", line 153, in _generate_tables
    pa_table = pa.Table.from_pydict(mapping)
  File "pyarrow/table.pxi", line 1813, in pyarrow.lib._Tabular.from_pydict
  File "pyarrow/table.pxi", line 5339, in pyarrow.lib._from_pydict
  File "pyarrow/array.pxi", line 374, in pyarrow.lib.asarray
  File "pyarrow/array.pxi", line 344, in pyarrow.lib.array
  File "pyarrow/array.pxi", line 42, in pyarrow.lib._sequence_to_array
  File "pyarrow/error.pxi", line 154, in pyarrow.lib.pyarrow_internal_check_status
  File "pyarrow/error.pxi", line 91, in pyarrow.lib.check_status
pyarrow.lib.ArrowTypeError: Expected bytes, got a 'int' object

Traceback (most recent call last):
  File "/xxx/Documents/code/geochat/geochat/train/train_mem.py", line 13, in <module>
    train()
  File "/xxx/Documents/code/geochat/geochat/train/train.py", line 952, in train
    model.initialize_vision_tokenizer(model_args, tokenizer=tokenizer)
  File "/xxx/Documents/code/geochat/geochat/model/geochat_arch.py", line 343, in initialize_vision_tokenizer
    embed_tokens_weight = mm_projector_weights["model.embed_tokens.weight"]
KeyError: 'model.embed_tokens.weight'

Inference error

[2024-03-20 16:15:45,873] [INFO] [real_accelerator.py:110:get_accelerator] Setting ds_accelerator to cuda (auto detect)
config.json: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4.76k/4.76k [00:00<00:00, 11.9MB/s]
Traceback (most recent call last):
  File "/workspace/GeoChat/geochat/eval/batch_geochat_vqa.py", line 125, in <module>
    eval_model(args)
  File "/workspace/GeoChat/geochat/eval/batch_geochat_vqa.py", line 32, in eval_model
    tokenizer, model, image_processor, context_len = load_pretrained_model(args.model_path, args.model_base, model_name)
  File "/workspace/GeoChat/geochat/model/builder.py", line 124, in load_pretrained_model
    model = AutoModelForCausalLM.from_pretrained(model_path, low_cpu_mem_usage=True, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/transformers/models/auto/auto_factory.py", line 493, in from_pretrained
    return model_class.from_pretrained(
  File "/opt/conda/lib/python3.10/site-packages/transformers/modeling_utils.py", line 2700, in from_pretrained
    model = cls(config, *model_args, **model_kwargs)
  File "/workspace/GeoChat/geochat/model/language_model/geochat_llama.py", line 46, in __init__
    self.model = GeoChatLlamaModel(config)
  File "/workspace/GeoChat/geochat/model/language_model/geochat_llama.py", line 38, in __init__
    super(GeoChatLlamaModel, self).__init__(config)
  File "/workspace/GeoChat/geochat/model/geochat_arch.py", line 33, in __init__
    self.vision_tower = build_vision_tower(config, delay_load=True)
  File "/workspace/GeoChat/geochat/model/multimodal_encoder/builder.py", line 9, in build_vision_tower
    return CLIPVisionTower(vision_tower, args=vision_tower_cfg, **kwargs)
  File "/workspace/GeoChat/geochat/model/multimodal_encoder/clip_encoder.py", line 88, in __init__
    self.clip_interpolate_embeddings(image_size=504, patch_size=14)
  File "/workspace/GeoChat/geochat/model/multimodal_encoder/clip_encoder.py", line 25, in clip_interpolate_embeddings
    state_dict = self.vision_tower.vision_model.embeddings.position_embedding.state_dict()
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1614, in __getattr__
    raise AttributeError("'{}' object has no attribute '{}'".format(
AttributeError: 'CLIPVisionTower' object has no attribute 'vision_tower'. Did you mean: 'vision_tower_name'?
root@dfbbfcc8de85:/workspace/GeoChat# sh /workspace/GeoChat/scripts/LR.sh
[2024-03-20 16:25:28,364] [INFO] [real_accelerator.py:110:get_accelerator] Setting ds_accelerator to cuda (auto detect)
Traceback (most recent call last):
  File "/workspace/GeoChat/geochat/eval/batch_geochat_vqa.py", line 125, in <module>
    eval_model(args)
  File "/workspace/GeoChat/geochat/eval/batch_geochat_vqa.py", line 32, in eval_model
    tokenizer, model, image_processor, context_len = load_pretrained_model(args.model_path, args.model_base, model_name)
  File "/workspace/GeoChat/geochat/model/builder.py", line 124, in load_pretrained_model
    model = AutoModelForCausalLM.from_pretrained(model_path, low_cpu_mem_usage=True, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/transformers/models/auto/auto_factory.py", line 493, in from_pretrained
    return model_class.from_pretrained(
  File "/opt/conda/lib/python3.10/site-packages/transformers/modeling_utils.py", line 2700, in from_pretrained
    model = cls(config, *model_args, **model_kwargs)
  File "/workspace/GeoChat/geochat/model/language_model/geochat_llama.py", line 46, in __init__
    self.model = GeoChatLlamaModel(config)
  File "/workspace/GeoChat/geochat/model/language_model/geochat_llama.py", line 38, in __init__
    super(GeoChatLlamaModel, self).__init__(config)
  File "/workspace/GeoChat/geochat/model/geochat_arch.py", line 33, in __init__
    self.vision_tower = build_vision_tower(config, delay_load=True)
  File "/workspace/GeoChat/geochat/model/multimodal_encoder/builder.py", line 9, in build_vision_tower
    return CLIPVisionTower(vision_tower, args=vision_tower_cfg, **kwargs)
  File "/workspace/GeoChat/geochat/model/multimodal_encoder/clip_encoder.py", line 88, in __init__
    self.clip_interpolate_embeddings(image_size=504, patch_size=14)
  File "/workspace/GeoChat/geochat/model/multimodal_encoder/clip_encoder.py", line 25, in clip_interpolate_embeddings
    state_dict = self.vision_tower_name.vision_model.embeddings.position_embedding.state_dict()
AttributeError: 'str' object has no attribute 'vision_model'
root@dfbbfcc8de85:/workspace/GeoChat# sh /workspace/GeoChat/scripts/LR.sh
[2024-03-20 16:26:28,295] [INFO] [real_accelerator.py:110:get_accelerator] Setting ds_accelerator to cuda (auto detect)
Traceback (most recent call last):
  File "/workspace/GeoChat/geochat/eval/batch_geochat_vqa.py", line 125, in <module>
    eval_model(args)
  File "/workspace/GeoChat/geochat/eval/batch_geochat_vqa.py", line 32, in eval_model
    tokenizer, model, image_processor, context_len = load_pretrained_model(args.model_path, args.model_base, model_name)
  File "/workspace/GeoChat/geochat/model/builder.py", line 124, in load_pretrained_model
    model = AutoModelForCausalLM.from_pretrained(model_path, low_cpu_mem_usage=True, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/transformers/models/auto/auto_factory.py", line 493, in from_pretrained
    return model_class.from_pretrained(
  File "/opt/conda/lib/python3.10/site-packages/transformers/modeling_utils.py", line 2700, in from_pretrained
    model = cls(config, *model_args, **model_kwargs)
  File "/workspace/GeoChat/geochat/model/language_model/geochat_llama.py", line 46, in __init__
    self.model = GeoChatLlamaModel(config)
  File "/workspace/GeoChat/geochat/model/language_model/geochat_llama.py", line 38, in __init__
    super(GeoChatLlamaModel, self).__init__(config)
  File "/workspace/GeoChat/geochat/model/geochat_arch.py", line 33, in __init__
    self.vision_tower = build_vision_tower(config, delay_load=True)
  File "/workspace/GeoChat/geochat/model/multimodal_encoder/builder.py", line 9, in build_vision_tower
    return CLIPVisionTower(vision_tower, args=vision_tower_cfg, **kwargs)
  File "/workspace/GeoChat/geochat/model/multimodal_encoder/clip_encoder.py", line 88, in __init__
    self.clip_interpolate_embeddings(image_size=504, patch_size=14)
  File "/workspace/GeoChat/geochat/model/multimodal_encoder/clip_encoder.py", line 25, in clip_interpolate_embeddings
    state_dict = self.vision_tower.vision_model.embeddings.position_embedding.state_dict()
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1614, in __getattr__
    raise AttributeError("'{}' object has no attribute '{}'".format(
AttributeError: 'CLIPVisionTower' object has no attribute 'vision_tower'. Did you mean: 'vision_tower_name'?

The code:

python geochat/eval/batch_geochat_vqa.py \
    --model-path /workspace/GeoChat/geochat-7B \
    --question-file /workspace/GeoChat/eva/LR/LR_split_test_questions.json \
    --answers-file /workspace/GeoChat/eva/LR/result/LR_Geochat.jsonl \
    --image-folder /workspace/GeoChat/eva/LR/Imaages_LR/

mbzuai-oryx / geochat Goto Github PK

geochat's Issues

Recommend Projects

Recommend Topics

Recommend Org