pku-yuangroup / chat-univi Goto Github PK

[CVPR 2024 Highlight🔥] Chat-UniVi: Unified Visual Representation Empowers Large Language Models with Image and Video Understanding

Home Page: https://arxiv.org/abs/2311.08046

License: Apache License 2.0

Python 100.00%

image-understanding large-language-models video-understanding vision-language-model

chat-univi's Introduction

Chat-UniVi: Unified Visual Representation Empowers Large Language Models with Image and Video Understanding

If you like our project, please give us a star ⭐ on GitHub for the latest update.

📣 News

[2024/04/05] We've revised the temporal evaluation performance of video understanding, resulting in an actual model performance of 47.9 instead of the previously stated 57.8. We sincerely apologize for any inconvenience our oversight may have caused you.
[2024/04/05] Chat-UniVi has been selected as a Highlight paper at CVPR 2024! (Top 3% of 11532 submissions).
[2024/02/27] Our Chat-UniVi has been accepted by CVPR 2024!
[2024/01/05] We enhance the video loading code by introducing support for variable-length videos. This improvement involves eliminating the previous zero-filling operation on the video. We find that this updated video loading method significantly boosts performance (Results).
[2023/12/05] The visualization script is available at VISUALIZATION.md.
[2023/11/22] ⚡ The online demo is available at Hugging Face Demo. Welcome to try!
[2023/11/22] The processed data is available at DATA.md.
[2023/11/21] 💡 We release Chat-UniVi-13B. Our proposed unified visual representation framework greatly reduces the number of visual tokens, so you can train 13B unified image and video understanding models in full parameters directly on 8 A100 GPUs within 3 days. Chat-UniVi-13B has better performance (Results). The training code for Chat-UniVi-13B has been updated (TRAIN_AND_VALIDATE.md).
[2023/11/21] We provide inference code for video understanding and image understanding.
[2023/11/21] We enhance the video loading code by introducing support for variable-length videos. This improvement involves eliminating the previous zero-filling operation on the video. We find that this updated video loading method significantly boosts performance.
[2023/11/15] Code are available now! Welcome to watch 👀 this repository for the latest updates.

😮 Highlights

💡 Unified visual representation for image and video

We employ a set of dynamic visual tokens to uniformly represent images and videos. This representation framework empowers the model to efficiently utilize a limited number of visual tokens to simultaneously capture the spatial details necessary for images and the comprehensive temporal relationship required for videos.

🔥 Joint training strategy, making LLMs understand both image and video

Chat-UniVi is trained on a mixed dataset containing both images and videos, allowing direct application to tasks involving both mediums without requiring any modifications.

🤗 High performance, complementary learning with image and video

Extensive experimental results demonstrate that Chat-UniVi, as a unified model, consistently outperforms even existing methods exclusively designed for either images or videos.

⚡ Demo

Please change the model path on line 15 of the main_demo.py first. Then run the demo:

# For Chat-UniVi-7B
CUDA_VISIBLE_DEVICES=0 uvicorn main_demo_7B:app --host 0.0.0.0 --port 8888

# For Chat-UniVi-13B
CUDA_VISIBLE_DEVICES=0 uvicorn main_demo_13B:app --host 0.0.0.0 --port 8888

A conversation with both image and video

A conversation includes multiple videos

A conversation includes multiple images

A conversation includes the video

A conversation in Chinese

With translation API, our model can also support Chinese conversations. We will add code to support Chinese conversations in future updates.

🚀 Main Results

Image understanding

Following LLaVA, we report the relative scores to GPT-4 for instruction-following questions.

Methods	LLM	Conversation	Detail Description	Complex Reasoning	All
Chat-UniVi-7B	Vicuna-7B	84.1	74.2	93.7	84.2
Chat-UniVi-13B	Vicuna-13B	84.1	79.4	94.7	86.1

Video understanding

Following Video-ChatGPT, we report the relative scores between the output of the model and the ground truth, with the assistance of GPT. It is worth noting that the results reported in Video-ChatGPT span a range from 0 to 5. To standardize the metrics, we normalize all scores to a scale of 0 to 100.

Methods	LLM	Correct	Detail	Context	Temporal	Consistency
Chat-UniVi-7B	Vicuna-7B	57.8	58.2	69.2	47.9	56.2
Chat-UniVi-13B	Vicuna-13B	59.4	59.8	70.5	-	60.6

ScienceQA

We report both zero-shot and fine-tuning results on the ScienceQA test set.

Methods	LLM	Average	Subject			Context Modality			Grade
			NAT	SOC	LAN	TXT	IMG	NO	G1-6	G7-12
Chat-UniVi-7B	Vicuna-7B	88.78	88.50	93.03	85.91	88.51	85.97	88.15	88.88	88.60
Chat-UniVi-13B	Vicuna-13B	90.99	90.41	95.05	88.91	89.64	88.05	90.94	91.19	90.64

VideoQA

We follow the evaluation protocol in Video-ChatGPT, i.e., employing GPT-assisted evaluation to assess the capabilities of models.

Methods	LLM Size	MSRVTT-QA		MSVD-QA		TGIF-QA		ActivityNet-QA
		Accuracy	Score	Accuracy	Score	Accuracy	Score	Accuracy	Score
Video-LLaMA	7B	29.6	1.8	51.6	2.5	-	-	12.4	1.1
LLaMA-Adapter	7B	43.8	2.7	54.9	3.1	-	-	34.2	2.7
VideoChat	7B	45.0	2.5	56.3	2.8	34.4	2.3	26.5	2.2
Video-ChatGPT	7B	49.3	2.8	64.9	3.3	51.4	3.0	35.2	2.7
Video-LLaVA	7B	59.2	3.5	70.7	3.9	70.0	4.0	45.3	3.3
Chat-UniVi-7B	7B	54.6	3.1	65.0	3.6	60.3	3.4	45.8	3.2
Chat-UniVi-7B with new video loading code	7B	55.0	3.1	69.3	3.7	69.0	3.8	46.1	3.3
Chat-UniVi-7B v1.5	7B	57.5	3.2	68.8	3.7	70.0	3.8	47.2	3.3

Hallucination Evaluation (POPE)

Our model also achieves impressive results in the object hallucination benchmark.

Methods	LLM Size	Random			Popular			Adversarial
		Accuracy	F1-Score	Yes	Accuracy	F1-Score	Yes	Accuracy	F1-Score	Yes
LLaVA	7B	72.16	78.22	76.29	61.37	71.52	85.63	58.67	70.12	88.33
Video-LLaVA	7B	86.2	85.2	42.0	85.3	84.0	42.1	81.6	80.8	45.8
Chat-UniVi-7B	7B	85.19	86.05	54.67	69.50	74.39	69.10	64.97	71.54	73.10
Chat-UniVi-7B v1.5	7B	87.01	86.09	41.86	85.87	84.76	42.73	83.23	82.31	44.77

😍 Visualization

Visualization for the image inputs

Visualization for the video inputs

🛠️ Requirements and Installation

Attention! If you are using a Windows system, please make sure to comment out deepspeed in pyproject.toml (#Line 20), as installing deepspeed may result in errors on Windows (see Link). Keep in mind that deepspeed is intended for training models only. If you are solely engaged in inference and not training models, it is recommended to comment it out.

Python >= 3.10
Install required packages:

git clone https://github.com/PKU-YuanGroup/Chat-UniVi
cd Chat-UniVi
conda create -n chatunivi python=3.10 -y
conda activate chatunivi
pip install --upgrade pip
pip install -e .
# pip install ninja  # If you only intend to perform inference, there's no need to install ```ninja```.
# pip install flash-attn --no-build-isolation  # If you only intend to perform inference, there's no need to install ```flash-attn```.

🤖 API

We open source all modalities preprocessing code. If you want to load the model from the model hub on Hugging Face or on local, you can use the following code snippets.

Inference for Video Understanding

import torch
import os
from ChatUniVi.constants import *
from ChatUniVi.conversation import conv_templates, SeparatorStyle
from ChatUniVi.model.builder import load_pretrained_model
from ChatUniVi.utils import disable_torch_init
from ChatUniVi.mm_utils import tokenizer_image_token, get_model_name_from_path, KeywordsStoppingCriteria
from PIL import Image
from decord import VideoReader, cpu
import numpy as np


def _get_rawvideo_dec(video_path, image_processor, max_frames=MAX_IMAGE_LENGTH, image_resolution=224, video_framerate=1, s=None, e=None):
    # speed up video decode via decord.

    if s is None:
        start_time, end_time = None, None
    else:
        start_time = int(s)
        end_time = int(e)
        start_time = start_time if start_time >= 0. else 0.
        end_time = end_time if end_time >= 0. else 0.
        if start_time > end_time:
            start_time, end_time = end_time, start_time
        elif start_time == end_time:
            end_time = start_time + 1

    if os.path.exists(video_path):
        vreader = VideoReader(video_path, ctx=cpu(0))
    else:
        print(video_path)
        raise FileNotFoundError

    fps = vreader.get_avg_fps()
    f_start = 0 if start_time is None else int(start_time * fps)
    f_end = int(min(1000000000 if end_time is None else end_time * fps, len(vreader) - 1))
    num_frames = f_end - f_start + 1
    if num_frames > 0:
        # T x 3 x H x W
        sample_fps = int(video_framerate)
        t_stride = int(round(float(fps) / sample_fps))

        all_pos = list(range(f_start, f_end + 1, t_stride))
        if len(all_pos) > max_frames:
            sample_pos = [all_pos[_] for _ in np.linspace(0, len(all_pos) - 1, num=max_frames, dtype=int)]
        else:
            sample_pos = all_pos

        patch_images = [Image.fromarray(f) for f in vreader.get_batch(sample_pos).asnumpy()]

        patch_images = torch.stack([image_processor.preprocess(img, return_tensors='pt')['pixel_values'][0] for img in patch_images])
        slice_len = patch_images.shape[0]

        return patch_images, slice_len
    else:
        print("video path: {} error.".format(video_path))


if __name__ == '__main__':
    # Model Parameter
    model_path = "Chat-UniVi/Chat-UniVi"  # or "Chat-UniVi/Chat-UniVi-13B"
    video_path = ${video_path}

    # The number of visual tokens varies with the length of the video. "max_frames" is the maximum number of frames.
    # When the video is long, we will uniformly downsample the video to meet the frames when equal to the "max_frames".
    max_frames = 100

    # The number of frames retained per second in the video.
    video_framerate = 1

    # Input Text
    qs = "Describe the video."

    # Sampling Parameter
    conv_mode = "simple"
    temperature = 0.2
    top_p = None
    num_beams = 1

    disable_torch_init()
    model_path = os.path.expanduser(model_path)
    model_name = "ChatUniVi"
    tokenizer, model, image_processor, context_len = load_pretrained_model(model_path, None, model_name)

    mm_use_im_start_end = getattr(model.config, "mm_use_im_start_end", False)
    mm_use_im_patch_token = getattr(model.config, "mm_use_im_patch_token", True)
    if mm_use_im_patch_token:
        tokenizer.add_tokens([DEFAULT_IMAGE_PATCH_TOKEN], special_tokens=True)
    if mm_use_im_start_end:
        tokenizer.add_tokens([DEFAULT_IM_START_TOKEN, DEFAULT_IM_END_TOKEN], special_tokens=True)
    model.resize_token_embeddings(len(tokenizer))

    vision_tower = model.get_vision_tower()
    if not vision_tower.is_loaded:
        vision_tower.load_model()
    image_processor = vision_tower.image_processor

    if model.config.config["use_cluster"]:
        for n, m in model.named_modules():
            m = m.to(dtype=torch.bfloat16)

    # Check if the video exists
    if video_path is not None:
        video_frames, slice_len = _get_rawvideo_dec(video_path, image_processor, max_frames=max_frames, video_framerate=video_framerate)

        cur_prompt = qs
        if model.config.mm_use_im_start_end:
            qs = DEFAULT_IM_START_TOKEN + DEFAULT_IMAGE_TOKEN * slice_len + DEFAULT_IM_END_TOKEN + '\n' + qs
        else:
            qs = DEFAULT_IMAGE_TOKEN * slice_len + '\n' + qs

        conv = conv_templates[conv_mode].copy()
        conv.append_message(conv.roles[0], qs)
        conv.append_message(conv.roles[1], None)
        prompt = conv.get_prompt()

        input_ids = tokenizer_image_token(prompt, tokenizer, IMAGE_TOKEN_INDEX, return_tensors='pt').unsqueeze(
            0).cuda()

        stop_str = conv.sep if conv.sep_style != SeparatorStyle.TWO else conv.sep2
        keywords = [stop_str]
        stopping_criteria = KeywordsStoppingCriteria(keywords, tokenizer, input_ids)

        with torch.inference_mode():
            output_ids = model.generate(
                input_ids,
                images=video_frames.half().cuda(),
                do_sample=True,
                temperature=temperature,
                top_p=top_p,
                num_beams=num_beams,
                output_scores=True,
                return_dict_in_generate=True,
                max_new_tokens=1024,
                use_cache=True,
                stopping_criteria=[stopping_criteria])

        output_ids = output_ids.sequences
        input_token_len = input_ids.shape[1]
        n_diff_input_output = (input_ids != output_ids[:, :input_token_len]).sum().item()
        if n_diff_input_output > 0:
            print(f'[Warning] {n_diff_input_output} output_ids are not the same as the input_ids')
        outputs = tokenizer.batch_decode(output_ids[:, input_token_len:], skip_special_tokens=True)[0]
        outputs = outputs.strip()
        if outputs.endswith(stop_str):
            outputs = outputs[:-len(stop_str)]
        outputs = outputs.strip()
        print(outputs)

Inference for Image Understanding

import torch
import os
from ChatUniVi.constants import *
from ChatUniVi.conversation import conv_templates, SeparatorStyle
from ChatUniVi.model.builder import load_pretrained_model
from ChatUniVi.utils import disable_torch_init
from ChatUniVi.mm_utils import tokenizer_image_token, get_model_name_from_path, KeywordsStoppingCriteria
from PIL import Image


if __name__ == '__main__':
    # Model Parameter
    model_path = "Chat-UniVi/Chat-UniVi"  # or "Chat-UniVi/Chat-UniVi-13B"
    image_path = ${image_path}

    # Input Text
    qs = "Describe the image."

    # Sampling Parameter
    conv_mode = "simple"
    temperature = 0.2
    top_p = None
    num_beams = 1

    disable_torch_init()
    model_path = os.path.expanduser(model_path)
    model_name = "ChatUniVi"
    tokenizer, model, image_processor, context_len = load_pretrained_model(model_path, None, model_name)

    mm_use_im_start_end = getattr(model.config, "mm_use_im_start_end", False)
    mm_use_im_patch_token = getattr(model.config, "mm_use_im_patch_token", True)
    if mm_use_im_patch_token:
        tokenizer.add_tokens([DEFAULT_IMAGE_PATCH_TOKEN], special_tokens=True)
    if mm_use_im_start_end:
        tokenizer.add_tokens([DEFAULT_IM_START_TOKEN, DEFAULT_IM_END_TOKEN], special_tokens=True)
    model.resize_token_embeddings(len(tokenizer))

    vision_tower = model.get_vision_tower()
    if not vision_tower.is_loaded:
        vision_tower.load_model()
    image_processor = vision_tower.image_processor

    # Check if the video exists
    if image_path is not None:
        cur_prompt = qs
        if model.config.mm_use_im_start_end:
            qs = DEFAULT_IM_START_TOKEN + DEFAULT_IMAGE_TOKEN + DEFAULT_IM_END_TOKEN + '\n' + qs
        else:
            qs = DEFAULT_IMAGE_TOKEN + '\n' + qs

        conv = conv_templates[conv_mode].copy()
        conv.append_message(conv.roles[0], qs)
        conv.append_message(conv.roles[1], None)
        prompt = conv.get_prompt()

        input_ids = tokenizer_image_token(prompt, tokenizer, IMAGE_TOKEN_INDEX, return_tensors='pt').unsqueeze(0).cuda()

        image = Image.open(image_path)
        image_tensor = image_processor.preprocess(image, return_tensors='pt')['pixel_values'][0]

        stop_str = conv.sep if conv.sep_style != SeparatorStyle.TWO else conv.sep2
        keywords = [stop_str]
        stopping_criteria = KeywordsStoppingCriteria(keywords, tokenizer, input_ids)

        with torch.inference_mode():
            output_ids = model.generate(
                input_ids,
                images=image_tensor.unsqueeze(0).half().cuda(),
                do_sample=True,
                temperature=temperature,
                top_p=top_p,
                num_beams=num_beams,
                max_new_tokens=1024,
                use_cache=True,
                stopping_criteria=[stopping_criteria])

        input_token_len = input_ids.shape[1]
        n_diff_input_output = (input_ids != output_ids[:, :input_token_len]).sum().item()
        if n_diff_input_output > 0:
            print(f'[Warning] {n_diff_input_output} output_ids are not the same as the input_ids')
        outputs = tokenizer.batch_decode(output_ids[:, input_token_len:], skip_special_tokens=True)[0]
        outputs = outputs.strip()
        if outputs.endswith(stop_str):
            outputs = outputs[:-len(stop_str)]
        outputs = outputs.strip()
        print(outputs)

🗝️ Training & Validating

The data instruction is in DATA.md.
The training instruction is in TRAIN_AND_VALIDATE.md.

👍 Acknowledgement

LLaVA The codebase we built upon and it is an efficient large language and vision assistant.
Video-ChatGPT Great job contributing the evaluation code and dataset.

🤝 Related Projects

Video-LLaVA This framework exhibits remarkable interactive capabilities between images and videos.

🔒 License

The majority of this project is released under the Apache 2.0 license as found in the LICENSE file.
The service is a research preview intended for non-commercial use only, subject to the model License of LLaMA, Terms of Use of the data generated by OpenAI, and Privacy Practices of ShareGPT. Please contact us if you find any potential violations.

✏️ Citation

If you find this paper useful, please consider staring 🌟 this repo and citing 📑 our paper:

@article{jin2023chatunivi,
  title={Chat-UniVi: Unified Visual Representation Empowers Large Language Models with Image and Video Understanding}, 
  author={Peng Jin and Ryuichi Takanobu and Caiwan Zhang and Xiaochun Cao and Li Yuan},
  journal={arXiv preprint arXiv:2311.08046},
  year={2023}
}

✨ Contributors

chat-univi's People

Contributors

Stargazers

Watchers

chat-univi's Issues

inference

Thank you very much for your work! Why does this problem occur during the reasoning process?

repalce Vicuna-7B with meta-llama/Meta-Llama-3-8B-Instruct

Very thank for yout great job 🙏

I want to replace Vicuna-7B with meta-llama/Meta-Llama-3-8B-Instruct, I heard that llama3 performs better, so there may be some gains after replacement.

But I encountered some image dimension related errors during the replacement process, like indexSelectLargeIndex.

my pretrain code is:

LLM_path set as Meta-Llama-3-8B-Instruct path.

deepspeed \
--include localhost:0,1,2,3,4,5,6,7 \
--master_port=12345 \
ChatUniVi/train/train_mem.py \
--deepspeed scripts/zero3.json \
--model_name_or_path ${LLM_path} \
--version v1 \
--model_use PRETUNE \
--dataset_use Pretrain \
--vision_tower openai/clip-vit-large-patch14 \
--tune_mm_mlp_adapter True \
--mm_vision_select_layer -2 \
--mm_use_im_start_end False \
--mm_use_im_patch_token False \
--bf16 True \
--output_dir ${stage1_save_path} \
--num_train_epochs 1 \
--per_device_train_batch_size 16 \
--per_device_eval_batch_size 4 \
--gradient_accumulation_steps 1 \
--evaluation_strategy "no" \
--save_strategy "steps" \
--save_steps 24000 \
--save_total_limit 1 \
--learning_rate 2e-3 \
--weight_decay 0. \
--warmup_ratio 0.03 \
--lr_scheduler_type "cosine" \
--logging_steps 1 \
--tf32 True \
--model_max_length 2048 \
--gradient_checkpointing True \
--dataloader_num_workers 4 \
--lazy_preprocess True

and add this part code for tokenizer

Failed to run main_demo_7B.py

I try to run the main_demo_7B.py. The log of the error is as follows:

[2023-12-14 13:14:17,810] [INFO] [real_accelerator.py:110:get_accelerator] Setting ds_accelerator to cuda (auto detect)
Traceback (most recent call last):
  File "/home/ning/miniconda3/envs/chatunivi/lib/python3.10/site-packages/urllib3/connection.py", line 203, in _new_conn
    sock = connection.create_connection(
  File "/home/ning/miniconda3/envs/chatunivi/lib/python3.10/site-packages/urllib3/util/connection.py", line 60, in create_connection
    for res in socket.getaddrinfo(host, port, family, socket.SOCK_STREAM):
  File "/home/ning/miniconda3/envs/chatunivi/lib/python3.10/socket.py", line 955, in getaddrinfo
    for res in _socket.getaddrinfo(host, port, family, type, proto, flags):
socket.gaierror: [Errno -3] Temporary failure in name resolution

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/ning/miniconda3/envs/chatunivi/lib/python3.10/site-packages/urllib3/connectionpool.py", line 790, in urlopen
    response = self._make_request(
  File "/home/ning/miniconda3/envs/chatunivi/lib/python3.10/site-packages/urllib3/connectionpool.py", line 491, in _make_request
    raise new_e
  File "/home/ning/miniconda3/envs/chatunivi/lib/python3.10/site-packages/urllib3/connectionpool.py", line 467, in _make_request
    self._validate_conn(conn)
  File "/home/ning/miniconda3/envs/chatunivi/lib/python3.10/site-packages/urllib3/connectionpool.py", line 1096, in _validate_conn
    conn.connect()
  File "/home/ning/miniconda3/envs/chatunivi/lib/python3.10/site-packages/urllib3/connection.py", line 611, in connect
    self.sock = sock = self._new_conn()
  File "/home/ning/miniconda3/envs/chatunivi/lib/python3.10/site-packages/urllib3/connection.py", line 210, in _new_conn
    raise NameResolutionError(self.host, self, e) from e
urllib3.exceptions.NameResolutionError: <urllib3.connection.HTTPSConnection object at 0x7f2b949caf50>: Failed to resolve 'huggingface.co' ([Errno -3] Temporary failure in name resolution)

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/ning/miniconda3/envs/chatunivi/lib/python3.10/site-packages/requests/adapters.py", line 486, in send
    resp = conn.urlopen(
  File "/home/ning/miniconda3/envs/chatunivi/lib/python3.10/site-packages/urllib3/connectionpool.py", line 844, in urlopen
    retries = retries.increment(
  File "/home/ning/miniconda3/envs/chatunivi/lib/python3.10/site-packages/urllib3/util/retry.py", line 515, in increment
    raise MaxRetryError(_pool, url, reason) from reason  # type: ignore[arg-type]
urllib3.exceptions.MaxRetryError: HTTPSConnectionPool(host='huggingface.co', port=443): Max retries exceeded with url: /openai/clip-vit-large-patch14/resolve/main/config.json (Caused by NameResolutionError("<urllib3.connection.HTTPSConnection object at 0x7f2b949caf50>: Failed to resolve 'huggingface.co' ([Errno -3] Temporary failure in name resolution)"))

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/ning/miniconda3/envs/chatunivi/lib/python3.10/site-packages/huggingface_hub/file_download.py", line 1247, in hf_hub_download
    metadata = get_hf_file_metadata(
  File "/home/ning/miniconda3/envs/chatunivi/lib/python3.10/site-packages/huggingface_hub/utils/_validators.py", line 118, in _inner_fn
    return fn(*args, **kwargs)
  File "/home/ning/miniconda3/envs/chatunivi/lib/python3.10/site-packages/huggingface_hub/file_download.py", line 1624, in get_hf_file_metadata
    r = _request_wrapper(
  File "/home/ning/miniconda3/envs/chatunivi/lib/python3.10/site-packages/huggingface_hub/file_download.py", line 402, in _request_wrapper
    response = _request_wrapper(
  File "/home/ning/miniconda3/envs/chatunivi/lib/python3.10/site-packages/huggingface_hub/file_download.py", line 425, in _request_wrapper
    response = get_session().request(method=method, url=url, **params)
  File "/home/ning/miniconda3/envs/chatunivi/lib/python3.10/site-packages/requests/sessions.py", line 589, in request
    resp = self.send(prep, **send_kwargs)
  File "/home/ning/miniconda3/envs/chatunivi/lib/python3.10/site-packages/requests/sessions.py", line 703, in send
    r = adapter.send(request, **kwargs)
  File "/home/ning/miniconda3/envs/chatunivi/lib/python3.10/site-packages/huggingface_hub/utils/_http.py", line 63, in send
    return super().send(request, *args, **kwargs)
  File "/home/ning/miniconda3/envs/chatunivi/lib/python3.10/site-packages/requests/adapters.py", line 519, in send
    raise ConnectionError(e, request=request)
requests.exceptions.ConnectionError: (MaxRetryError('HTTPSConnectionPool(host=\'huggingface.co\', port=443): Max retries exceeded with url: /openai/clip-vit-large-patch14/resolve/main/config.json (Caused by NameResolutionError("<urllib3.connection.HTTPSConnection object at 0x7f2b949caf50>: Failed to resolve \'huggingface.co\' ([Errno -3] Temporary failure in name resolution)"))'), '(Request ID: a26cb2d6-252e-4c54-bfdd-9c5a0dfe2a1b)')

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/ning/miniconda3/envs/chatunivi/lib/python3.10/site-packages/transformers/utils/hub.py", line 417, in cached_file
    resolved_file = hf_hub_download(
  File "/home/ning/miniconda3/envs/chatunivi/lib/python3.10/site-packages/huggingface_hub/utils/_validators.py", line 118, in _inner_fn
    return fn(*args, **kwargs)
  File "/home/ning/miniconda3/envs/chatunivi/lib/python3.10/site-packages/huggingface_hub/file_download.py", line 1377, in hf_hub_download
    raise LocalEntryNotFoundError(
huggingface_hub.utils._errors.LocalEntryNotFoundError: An error happened while trying to locate the file on the Hub and we cannot find the requested files in the local cache. Please check your connection and try again or make sure your Internet connection is on.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/ning/miniconda3/envs/chatunivi/bin/uvicorn", line 8, in <module>
    sys.exit(main())
  File "/home/ning/miniconda3/envs/chatunivi/lib/python3.10/site-packages/click/core.py", line 1157, in __call__
    return self.main(*args, **kwargs)
  File "/home/ning/miniconda3/envs/chatunivi/lib/python3.10/site-packages/click/core.py", line 1078, in main
    rv = self.invoke(ctx)
  File "/home/ning/miniconda3/envs/chatunivi/lib/python3.10/site-packages/click/core.py", line 1434, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/ning/miniconda3/envs/chatunivi/lib/python3.10/site-packages/click/core.py", line 783, in invoke
    return __callback(*args, **kwargs)
  File "/home/ning/miniconda3/envs/chatunivi/lib/python3.10/site-packages/uvicorn/main.py", line 416, in main
    run(
  File "/home/ning/miniconda3/envs/chatunivi/lib/python3.10/site-packages/uvicorn/main.py", line 587, in run
    server.run()
  File "/home/ning/miniconda3/envs/chatunivi/lib/python3.10/site-packages/uvicorn/server.py", line 61, in run
    return asyncio.run(self.serve(sockets=sockets))
  File "/home/ning/miniconda3/envs/chatunivi/lib/python3.10/asyncio/runners.py", line 44, in run
    return loop.run_until_complete(main)
  File "/home/ning/miniconda3/envs/chatunivi/lib/python3.10/asyncio/base_events.py", line 649, in run_until_complete
    return future.result()
  File "/home/ning/miniconda3/envs/chatunivi/lib/python3.10/site-packages/uvicorn/server.py", line 68, in serve
    config.load()
  File "/home/ning/miniconda3/envs/chatunivi/lib/python3.10/site-packages/uvicorn/config.py", line 467, in load
    self.loaded_app = import_from_string(self.app)
  File "/home/ning/miniconda3/envs/chatunivi/lib/python3.10/site-packages/uvicorn/importer.py", line 21, in import_from_string
    module = importlib.import_module(module_str)
  File "/home/ning/miniconda3/envs/chatunivi/lib/python3.10/importlib/__init__.py", line 126, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
  File "<frozen importlib._bootstrap>", line 1050, in _gcd_import
  File "<frozen importlib._bootstrap>", line 1027, in _find_and_load
  File "<frozen importlib._bootstrap>", line 1006, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 688, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 883, in exec_module
  File "<frozen importlib._bootstrap>", line 241, in _call_with_frames_removed
  File "/home/ning/Chat-UniVi/main_demo_7B.py", line 121, in <module>
    handler = Chat(model_path, conv_mode=conv_mode)
  File "/home/ning/Chat-UniVi/ChatUniVi/demo.py", line 16, in __init__
    self.tokenizer, self.model, self.image_processor, context_len = load_pretrained_model(model_path, None, model_name="ChatUniVi")
  File "/home/ning/Chat-UniVi/ChatUniVi/model/builder.py", line 75, in load_pretrained_model
    model = AutoModelForCausalLM.from_pretrained(model_path, low_cpu_mem_usage=True, **kwargs)
  File "/home/ning/miniconda3/envs/chatunivi/lib/python3.10/site-packages/transformers/models/auto/auto_factory.py", line 493, in from_pretrained
    return model_class.from_pretrained(
  File "/home/ning/miniconda3/envs/chatunivi/lib/python3.10/site-packages/transformers/modeling_utils.py", line 2700, in from_pretrained
    model = cls(config, *model_args, **model_kwargs)
  File "/home/ning/Chat-UniVi/ChatUniVi/model/language_model/llama.py", line 28, in __init__
    self.model = ChatUniViLlamaModel(config)
  File "/home/ning/Chat-UniVi/ChatUniVi/model/language_model/llama.py", line 20, in __init__
    super(ChatUniViLlamaModel, self).__init__(config)
  File "/home/ning/Chat-UniVi/ChatUniVi/model/arch.py", line 15, in __init__
    self.vision_tower = build_vision_tower(config, delay_load=True)
  File "/home/ning/Chat-UniVi/ChatUniVi/model/multimodal_encoder/builder.py", line 8, in build_vision_tower
    return CLIPVisionTower(vision_tower, args=vision_tower_cfg, **kwargs)
  File "/home/ning/Chat-UniVi/ChatUniVi/model/multimodal_encoder/clip_encoder.py", line 24, in __init__
    self.cfg_only = CLIPVisionConfig.from_pretrained(self.vision_tower_name)
  File "/home/ning/miniconda3/envs/chatunivi/lib/python3.10/site-packages/transformers/models/clip/configuration_clip.py", line 239, in from_pretrained
    config_dict, kwargs = cls.get_config_dict(pretrained_model_name_or_path, **kwargs)
  File "/home/ning/miniconda3/envs/chatunivi/lib/python3.10/site-packages/transformers/configuration_utils.py", line 618, in get_config_dict
    config_dict, kwargs = cls._get_config_dict(pretrained_model_name_or_path, **kwargs)
  File "/home/ning/miniconda3/envs/chatunivi/lib/python3.10/site-packages/transformers/configuration_utils.py", line 673, in _get_config_dict
    resolved_config_file = cached_file(
  File "/home/ning/miniconda3/envs/chatunivi/lib/python3.10/site-packages/transformers/utils/hub.py", line 452, in cached_file
    raise EnvironmentError(
OSError: We couldn't connect to 'https://huggingface.co' to load this file, couldn't find it in the cached files and it looks like openai/clip-vit-large-patch14 is not the path to a directory containing a file named config.json.
Checkout your internet connection or see how to run the library in offline mode at 'https://huggingface.co/docs/transformers/installation#offline-mode'.

I guess the error is caused by the disconnection to huggingface, and failure to download weights of clip-vit-large-patch14. I want to know where is the path of clip-vit-large-patch14, so I can modify it to locate my local model weights. Anyone can help me?

IndexError when appending in cur_image_features by image_token_indices.

I tried my own video and IndexError triggered at line 277, arch.py: index xx is out of bounds for dimension 0 with size xx. This seems to be because the input video with variable length does not match the MAX_IMAGE_LENGTH (64). I modified the code as follows and it worked:

for _ in i:
    if cur_image_idx < image_features.shape[0]:
        cur_image_features.append(image_features[cur_image_idx])
        cur_image_idx += 1
    else:
        break

Unable to setup environment

Error messages:

Collecting wandb (from ChatUniVi==1.0.1)
  Using cached https://mirrors.aliyun.com/pypi/packages/5c/81/1780aa129564b11709a6d7f0739257174f0a3a1b432ba804b3c6f00e0f88/wandb-0.16.0-py3-none-any.whl (2.1 MB)
Collecting shortuuid (from ChatUniVi==1.0.1)
  Using cached https://mirrors.aliyun.com/pypi/packages/c3/46/644a4df3061e96ef24998c0623d3b12287090ab9a0e0d6ad8408f7b87283/shortuuid-1.0.11-py3-none-any.whl (10 kB)
Collecting httpx==0.24.0 (from ChatUniVi==1.0.1)
  Using cached https://mirrors.aliyun.com/pypi/packages/4e/c1/692013f1e6115a061a14f6c7d05947515a1eb7b85ef6e9bf0ffbf0e92738/httpx-0.24.0-py3-none-any.whl (75 kB)
Collecting deepspeed==0.9.5 (from ChatUniVi==1.0.1)
  Using cached https://mirrors.aliyun.com/pypi/packages/99/0f/a4ebd3b3f6a8fd9bca77ca5f570724f3902ca90b491f8146e45c9733e64f/deepspeed-0.9.5.tar.gz (809 kB)
  Preparing metadata (setup.py) ... error
  error: subprocess-exited-with-error

  × python setup.py egg_info did not run successfully.
  │ exit code: 1
  ╰─> [15 lines of output]
      test.c
      LINK : fatal error LNK1181: 无法打开输入文件“aio.lib”
      Traceback (most recent call last):
        File "<string>", line 2, in <module>
        File "<pip-setuptools-caller>", line 34, in <module>
        File "C:\Users\firok\AppData\Local\Temp\pip-install-vtdgutu9\deepspeed_b542e0f942ab421ab08ed6e316a8e493\setup.py", line 163, in <module>
          abort(f"Unable to pre-compile {op_name}")
        File "C:\Users\firok\AppData\Local\Temp\pip-install-vtdgutu9\deepspeed_b542e0f942ab421ab08ed6e316a8e493\setup.py", line 51, in abort
          assert False, msg
      AssertionError: Unable to pre-compile async_io
      DS_BUILD_OPS=1
       [WARNING]  async_io requires the dev libaio .so object and headers but these were not found.
       [WARNING]  If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
       [WARNING]  One can disable async_io with DS_BUILD_AIO=0
       [ERROR]  Unable to pre-compile async_io
      [end of output]

  note: This error originates from a subprocess, and is likely not a problem with pip.
error: metadata-generation-failed

× Encountered error while generating package metadata.
╰─> See above for output.

note: This is an issue with the package mentioned above, not pip.
hint: See above for details.

How to reproduce:

conda create -n chatunivi python=3.10 -y
conda activate chatunivi
pip install --upgrade pip
# Using PyTorch 11.8
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
# Error occurs when running command below
pip install -e .

System info:

Windows 10 22H2

Training in 8-bit/4-bit causes error

Hi,

First of all, great work! I've tested your model on some videos and they are doing perfect!
I was trying to pretrain and finetune the model on my custom dataset, the data is well prepared as per the mentioned instructions but since I have less ram (16GB) in my RTX 3080 Ti, I was trying with quantization (8-Bit) for training.

The command I used:

deepspeed \
ChatUniVi/train/train_mem.py \
--model_name_or_path Chat-UniVi/Chat-UniVi \
--version v1 \
--model_use PRETUNE \
--dataset_use "New" \
--vision_tower openai/clip-vit-large-patch14 \
--tune_mm_mlp_adapter True \
--mm_vision_select_layer -2 \
--mm_use_im_start_end False \
--mm_use_im_patch_token False \
--bf16 True \
--output_dir "EXP" \
--num_train_epochs 1 \
--per_device_train_batch_size 1 \
--per_device_eval_batch_size 1 \
--gradient_accumulation_steps 1 \
--evaluation_strategy "no" \
--save_strategy "steps" \
--save_steps 24000 \
--save_total_limit 1 \
--learning_rate 2e-3 \
--weight_decay 0. \
--warmup_ratio 0.03 \
--bits 8 \
--tf32 True \
--lr_scheduler_type "cosine" \
--logging_steps 1 \
--model_max_length 2048 \
--gradient_checkpointing True \
--dataloader_num_workers 0 \
--lazy_preprocess True

The reason I removed --deepspeed scripts/zero2.json is because it wasn't reducing memory usage at all.
After removing it, the model memory consumption is reduced significantly.

The problem is, I'm getting two errors, first one at:

        model.config.tune_mm_mlp_adapter = training_args.tune_mm_mlp_adapter = model_args.tune_mm_mlp_adapter
        if model_args.tune_mm_mlp_adapter:
            model.requires_grad_(False)
            for p in model.get_model().mm_projector.parameters():
           ---->p.requires_grad = True

Error: RuntimeError: only Tensors of floating point and complex dtype can require gradients

So I commented it out and enabled LoRa optimzation lora_enable: bool = True it adds adapter weights with require_gradient = True

But I faced the following error at training time:

RuntimeError: "addmm_cuda" not implemented for 'Char'

The complete log of last error:

/home/ahmed/miniconda3/envs/chatunivi/lib/python3.10/site-packages/bitsandbytes/autograd/_functions.py:322: UserWarning: MatMul8bitLt: inputs will be cast from torch.int8 to float16 during quantization
  warnings.warn(f"MatMul8bitLt: inputs will be cast from {A.dtype} to float16 during quantization")
Traceback (most recent call last):
  File "/home/ahmed/Desktop/WORK/Pioneer/VLM/01_12_23/UniVi/Chat-UniVi/ChatUniVi/train/train_mem.py", line 13, in <module>
    train()
  File "/home/ahmed/Desktop/WORK/Pioneer/VLM/24_11_23/UniVi/Chat-UniVi/ChatUniVi/train/train.py", line 1089, in train
    trainer.train()
  File "/home/ahmed/miniconda3/envs/chatunivi/lib/python3.10/site-packages/transformers/trainer.py", line 1539, in train
    return inner_training_loop(
  File "/home/ahmed/miniconda3/envs/chatunivi/lib/python3.10/site-packages/transformers/trainer.py", line 1809, in _inner_training_loop
    tr_loss_step = self.training_step(model, inputs)
  File "/home/ahmed/miniconda3/envs/chatunivi/lib/python3.10/site-packages/transformers/trainer.py", line 2654, in training_step
    loss = self.compute_loss(model, inputs)
  File "/home/ahmed/miniconda3/envs/chatunivi/lib/python3.10/site-packages/transformers/trainer.py", line 2679, in compute_loss
    outputs = model(**inputs)
  File "/home/ahmed/miniconda3/envs/chatunivi/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/ahmed/miniconda3/envs/chatunivi/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/ahmed/miniconda3/envs/chatunivi/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 1519, in forward
    else self._run_ddp_forward(*inputs, **kwargs)
  File "/home/ahmed/miniconda3/envs/chatunivi/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 1355, in _run_ddp_forward
    return self.module(*inputs, **kwargs)  # type: ignore[index]
  File "/home/ahmed/miniconda3/envs/chatunivi/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/ahmed/miniconda3/envs/chatunivi/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/ahmed/miniconda3/envs/chatunivi/lib/python3.10/site-packages/accelerate/utils/operations.py", line 581, in forward
    return model_forward(*args, **kwargs)
  File "/home/ahmed/miniconda3/envs/chatunivi/lib/python3.10/site-packages/accelerate/utils/operations.py", line 569, in __call__
    return convert_to_fp32(self.model_forward(*args, **kwargs))
  File "/home/ahmed/miniconda3/envs/chatunivi/lib/python3.10/site-packages/torch/amp/autocast_mode.py", line 16, in decorate_autocast
    return func(*args, **kwargs)
  File "/home/ahmed/miniconda3/envs/chatunivi/lib/python3.10/site-packages/peft/peft_model.py", line 922, in forward
    return self.base_model(
  File "/home/ahmed/miniconda3/envs/chatunivi/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/ahmed/miniconda3/envs/chatunivi/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/ahmed/miniconda3/envs/chatunivi/lib/python3.10/site-packages/accelerate/hooks.py", line 165, in new_forward
    output = old_forward(*args, **kwargs)
  File "/home/ahmed/Desktop/WORK/Pioneer/VLM/24_11_23/UniVi/Chat-UniVi/ChatUniVi/model/language_model/llama.py", line 54, in forward
    input_ids, attention_mask, past_key_values, inputs_embeds, labels = self.prepare_inputs_labels_for_multimodal(input_ids, attention_mask, past_key_values, labels, images)
  File "/home/ahmed/Desktop/WORK/Pioneer/VLM/24_11_23/UniVi/Chat-UniVi/ChatUniVi/model/arch.py", line 283, in prepare_inputs_labels_for_multimodal
    cur_image_features = self.project(cur_image_features, input_type="video")
  File "/home/ahmed/Desktop/WORK/Pioneer/VLM/24_11_23/UniVi/Chat-UniVi/ChatUniVi/model/arch.py", line 215, in project
    image_features = self.get_model().mm_projector(image_features)
  File "/home/ahmed/miniconda3/envs/chatunivi/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/ahmed/miniconda3/envs/chatunivi/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/ahmed/miniconda3/envs/chatunivi/lib/python3.10/site-packages/peft/tuners/lora.py", line 1064, in forward
    result = super().forward(x)
  File "/home/ahmed/miniconda3/envs/chatunivi/lib/python3.10/site-packages/bitsandbytes/nn/modules.py", line 441, in forward
    out = bnb.matmul(x, self.weight, bias=self.bias, state=self.state)
  File "/home/ahmed/miniconda3/envs/chatunivi/lib/python3.10/site-packages/bitsandbytes/autograd/_functions.py", line 563, in matmul
    return MatMul8bitLt.apply(A, B, out, bias, state)
  File "/home/ahmed/miniconda3/envs/chatunivi/lib/python3.10/site-packages/torch/autograd/function.py", line 539, in apply
    return super().apply(*args, **kwargs)  # type: ignore[misc]
  File "/home/ahmed/miniconda3/envs/chatunivi/lib/python3.10/site-packages/bitsandbytes/autograd/_functions.py", line 421, in forward
    output += torch.matmul(subA, state.subB)
RuntimeError: "addmm_cuda" not implemented for 'Char'

I'm a little confused here, if I'm doing something wrong or I'm missing anything?
Can you please help here?
Thanks

encounter a problem when initialize Chat-UniVi-13B with downloading weights file

I have download the weights file in local directory, but have some problem when load the file.

Some weights of the model checkpoint at /DIR_LLM/cooper/projects/video-understanding/Chat-UniVi-13B were not used when initializing ChatUniViLlamaForCausalLM: ['model.vision_tower.vision_tower.vision_model.embeddings.class_embedding', 'model.vision_tower.vision_tower.vision_model.embeddings.patch_embedding.weight', 'model.vision_tower.vision_tower.vision_model.embeddings.position_embedding.weight', 'model.vision_tower.vision_tower.vision_model.encoder.layers.0.layer_norm1.bias', 'model.vision_tower.vision_tower.vision_model.encoder.layers.0.layer_norm1.weight', 'model.vision_tower.vision_tower.vision_model.encoder.layers.0.layer_norm2.bias', 'model.vision_tower.vision_tower.vision_model.encoder.layers.0.layer_norm2.weight', 'model.vision_tower.vision_tower.vision_model.encoder.layers.0.mlp.fc1.bias', 'model.vision_tower.vision_tower.vision_model.encoder.layers.0.mlp.fc1.weight', 'model.vision_tower.vision_tower.vision_model.encoder.layers.0.mlp.fc2.bias', 'model.vision_tower.vision_tower.vision_model.encoder.layers.0.mlp.fc2.weight', 'model.vision_tower.vision_tower.vision_model.encoder.layers.0.self_attn.k_proj.bias', 'model.vision_tower.vision_tower.vision_model.encoder.layers.0.self_attn.k_proj.weight', 'model.vision_tower.vision_tower.vision_model.encoder.layers.0.self_attn.out_proj.bias', 'model.vision_tower.vision_tower.vision_model.encoder.layers.0.self_attn.out_proj.weight', 'model.vision_tower.vision_tower.vision_model.encoder.layers.0.self_attn.q_proj.bias', 'model.vision_tower.vision_tower.vision_model.encoder.layers.0.self_attn.q_proj.weight', 'model.vision_tower.vision_tower.vision_model.encoder.layers.0.self_attn.v_proj.bias', 'model.vision_tower.vision_tower.vision_model.encoder.layers.0.self_attn.v_proj.weight', 'model.vision_tower.vision_tower.vision_model.encoder.layers.1.layer_norm1.bias', 'model.vision_tower.vision_tower.vision_model.encoder.layers.1.layer_norm1.weight', 'model.vision_tower.vision_tower.vision_model.encoder.layers.1.layer_norm2.bias', 'model.vision_tower.vision_tower.vision_model.encoder.layers.1.layer_norm2.weight', 'model.vision_tower.vision_tower.vision_model.encoder.layers.1.mlp.fc1.bias', 'model.vision_tower.vision_tower.vision_model.encoder.layers.1.mlp.fc1.weight', 'model.vision_tower.vision_tower.vision_model.encoder.layers.1.mlp.fc2.bias', 'model.vision_tower.vision_tower.vision_model.encoder.layers.1.mlp.fc2.weight', 'model.vision_tower.vision_tower.vision_model.encoder.layers.1.self_attn.k_proj.bias', 'model.vision_tower.vision_tower.vision_model.encoder.layers.1.self_attn.k_proj.weight', 'model.vision_tower.vision_tower.vision_model.encoder.layers.1.self_attn.out_proj.bias', 'model.vision_tower.vision_tower.vision_model.encoder.layers.1.self_attn.out_proj.weight', 'model.vision_tower.vision_tower.vision_model.encoder.layers.1.self_attn.q_proj.bias', 'model.vision_tower.vision_tower.vision_model.encoder.layers.1.self_attn.q_proj.weight', 'model.vision_tower.vision_tower.vision_model.encoder.layers.1.self_attn.v_proj.bias', 'model.vision_tower.vision_tower.vision_model.encoder.layers.1.self_attn.v_proj.weight', 'model.vision_tower.vision_tower.vision_model.encoder.layers.10.layer_norm1.bias', 'model.vision_tower.vision_tower.vision_model.encoder.layers.10.layer_norm1.weight', 'model.vision_tower.vision_tower.vision_model.encoder.layers.10.layer_norm2.bias', 'model.vision_tower.vision_tower.vision_model.encoder.layers.10.layer_norm2.weight', 'model.vision_tower.vision_tower.vision_model.encoder.layers.10.mlp.fc1.bias', 'model.vision_tower.vision_tower.vision_model.encoder.layers.10.mlp.fc1.weight', 'model.vision_tower.vision_tower.vision_model.encoder.layers.10.mlp.fc2.bias', 'model.vision_tower.vision_tower.vision_model.encoder.layers.10.mlp.fc2.weight', 'model.vision_tower.vision_tower.vision_model.encoder.layers.10.self_attn.k_proj.bias', 'model.vision_tower.vision_tower.vision_model.encoder.layers.10.self_attn.k_proj.weight', 'model.vision_tower.vision_tower.vision_model.encoder.layers.10.self_attn.out_proj.bias', 'model.vision_tower.vision_tower.vision_model.encoder.layers.10.self_attn.out_proj.weight', 'model.vision_tower.vision_tower.vision_model.encoder.layers.10.self_attn.q_proj.bias', 'model.vision_tower.vision_tower.vision_model.encoder.layers.10.self_attn.q_proj.weight', 'model.vision_tower.vision_tower.vision_model.encoder.layers.10.self_attn.v_proj.bias', 'model.vision_tower.vision_tower.vision_model.encoder.layers.10.self_attn.v_proj.weight', 'model.vision_tower.vision_tower.vision_model.encoder.layers.11.layer_norm1.bias', 'model.vision_tower.vision_tower.vision_model.encoder.layers.11.layer_norm1.weight', 'model.vision_tower.vision_tower.vision_model.encoder.layers.11.layer_norm2.bias', 'model.vision_tower.vision_tower.vision_model.encoder.layers.11.layer_norm2.weight', 'model.vision_tower.vision_tower.vision_model.encoder.layers.11.mlp.fc1.bias', 'model.vision_tower.vision_tower.vision_model.encoder.layers.11.mlp.fc1.weight', 'model.vision_tower.vision_tower.vision_model.encoder.layers.11.mlp.fc2.bias', 'model.vision_tower.vision_tower.vision_model.encoder.layers.11.mlp.fc2.weight', 'model.vision_tower.vision_tower.vision_model.encoder.layers.11.self_attn.k_proj.bias', 'model.vision_tower.vision_tower.vision_model.encoder.layers.11.self_attn.k_proj.weight', 'model.vision_tower.vision_tower.vision_model.encoder.layers.11.self_attn.out_proj.bias', 'model.vision_tower.vision_tower.vision_model.encoder.layers.11.self_attn.out_proj.weight', 'model.vision_tower.vision_tower.vision_model.encoder.layers.11.self_attn.q_proj.bias', 'model.vision_tower.vision_tower.vision_model.encoder.layers.11.self_attn.q_proj.weight', 'model.vision_tower.vision_tower.vision_model.encoder.layers.11.self_attn.v_proj.bias', 'model.vision_tower.vision_tower.vision_model.encoder.layers.11.self_attn.v_proj.weight', 'model.vision_tower.vision_tower.vision_model.encoder.layers.12.layer_norm1.bias', 'model.vision_tower.vision_tower.vision_model.encoder.layers.12.layer_norm1.weight', 'model.vision_tower.vision_tower.vision_model.encoder.layers.12.layer_norm2.bias', 'model.vision_tower.vision_tower.vision_model.encoder.layers.12.layer_norm2.weight', 'model.vision_tower.vision_tower.vision_model.encoder.layers.12.mlp.fc1.bias', 'model.vision_tower.vision_tower.vision_model.encoder.layers.12.mlp.fc1.weight', 'model.vision_tower.vision_tower.vision_model.encoder.layers.12.mlp.fc2.bias', 'model.vision_tower.vision_tower.vision_model.encoder.layers.12.mlp.fc2.weight', 'model.vision_tower.vision_tower.vision_model.encoder.layers.12.self_attn.k_proj.bias', 'model.vision_tower.vision_tower.vision_model.encoder.layers.12.self_attn.k_proj.weight', 'model.vision_tower.vision_tower.vision_model.encoder.layers.12.self_attn.out_proj.bias', 'model.vision_tower.vision_tower.vision_model.encoder.layers.12.self_attn.out_proj.weight', 'model.vision_tower.vision_tower.vision_model.encoder.layers.12.self_attn.q_proj.bias', 'model.vision_tower.vision_tower.vision_model.encoder.layers.12.self_attn.q_proj.weight', 'model.vision_tower.vision_tower.vision_model.encoder.layers.12.self_attn.v_proj.bias', 'model.vision_tower.vision_tower.vision_model.encoder.layers.12.self_attn.v_proj.weight', 'model.vision_tower.vision_tower.vision_model.encoder.layers.13.layer_norm1.bias', 'model.vision_tower.vision_tower.vision_model.encoder.layers.13.layer_norm1.weight', 'model.vision_tower.vision_tower.vision_model.encoder.layers.13.layer_norm2.bias', 'model.vision_tower.vision_tower.vision_model.encoder.layers.13.layer_norm2.weight', 'model.vision_tower.vision_tower.vision_model.encoder.layers.13.mlp.fc1.bias', 'model.vision_tower.vision_tower.vision_model.encoder.layers.13.mlp.fc1.weight', 'model.vision_tower.vision_tower.vision_model.encoder.layers.13.mlp.fc2.bias', 'model.vision_tower.vision_tower.vision_model.encoder.layers.13.mlp.fc2.weight', 'model.vision_tower.vision_tower.vision_model.encoder.layers.13.self_attn.k_proj.bias', 'model.vision_tower.vision_tower.vision_model.encoder.layers.13.self_attn.k_proj.weight', 'model.vision_tower.vision_tower.vision_model.encoder.layers.13.self_attn.out_proj.bias', 'model.vision_tower.vision_tower.vision_model.encoder.layers.13.self_attn.out_proj.weight', 'model.vision_tower.vision_tower.vision_model.encoder.layers.13.self_attn.q_proj.bias', 'model.vision_tower.vision_tower.vision_model.encoder.layers.13.self_attn.q_proj.weight', 'model.vision_tower.vision_tower.vision_model.encoder.layers.13.self_attn.v_proj.bias', 'model.vision_tower.vision_tower.vision_model.encoder.layers.13.self_attn.v_proj.weight', 'model.vision_tower.vision_tower.vision_model.encoder.layers.14.layer_norm1.bias', 'model.vision_tower.vision_tower.vision_model.encoder.layers.14.layer_norm1.weight', 'model.vision_tower.vision_tower.vision_model.encoder.layers.14.layer_norm2.bias', 'model.vision_tower.vision_tower.vision_model.encoder.layers.14.layer_norm2.weight', 'model.vision_tower.vision_tower.vision_model.encoder.layers.14.mlp.fc1.bias', 'model.vision_tower.vision_tower.vision_model.encoder.layers.14.mlp.fc1.weight', 'model.vision_tower.vision_tower.vision_model.encoder.layers.14.mlp.fc2.bias', 'model.vision_tower.vision_tower.vision_model.encoder.layers.14.mlp.fc2.weight', 'model.vision_tower.vision_tower.vision_model.encoder.layers.14.self_attn.k_proj.bias', 'model.vision_tower.vision_tower.vision_model.encoder.layers.14.self_attn.k_proj.weight', 'model.vision_tower.vision_tower.vision_model.encoder.layers.14.self_attn.out_proj.bias', 'model.vision_tower.vision_tower.vision_model.encoder.layers.14.self_attn.out_proj.weight', 'model.vision_tower.vision_tower.vision_model.encoder.layers.14.self_attn.q_proj.bias', 'model.vision_tower.vision_tower.vision_model.encoder.layers.14.self_attn.q_proj.weight', 'model.vision_tower.vision_tower.vision_model.encoder.layers.14.self_attn.v_proj.bias', 'model.vision_tower.vision_tower.vision_model.encoder.layers.14.self_attn.v_proj.weight', 'model.vision_tower.vision_tower.vision_model.encoder.layers.15.layer_norm1.bias', 'model.vision_tower.vision_tower.vision_model.encoder.layers.15.layer_norm1.weight', 'model.vision_tower.vision_tower.vision_model.encoder.layers.15.layer_norm2.bias', 'model.vision_tower.vision_tower.vision_model.encoder.layers.15.layer_norm2.weight', 'model.vision_tower.vision_tower.vision_model.encoder.layers.15.mlp.fc1.bias', 'model.vision_tower.vision_tower.vision_model.encoder.layers.15.mlp.fc1.weight', 'model.vision_tower.vision_tower.vision_model.encoder.layers.15.mlp.fc2.bias', 'model.vision_tower.vision_tower.vision_model.encoder.layers.15.mlp.fc2.weight', 'model.vision_tower.vision_tower.vision_model.encoder.layers.15.self_attn.k_proj.bias', 'model.vision_tower.vision_tower.vision_model.encoder.layers.15.self_attn.k_proj.weight', 'model.vision_tower.vision_tower.vision_model.encoder.layers.15.self_attn.out_proj.bias', 'model.vision_tower.vision_tower.vision_model.encoder.layers.15.self_attn.out_proj.weight', 'model.vision_tower.vision_tower.vision_model.encoder.layers.15.self_attn.q_proj.bias', 'model.vision_tower.vision_tower.vision_model.encoder.layers.15.self_attn.q_proj.weight', 'model.vision_tower.vision_tower.vision_model.encoder.layers.15.self_attn.v_proj.bias', 'model.vision_tower.vision_tower.vision_model.encoder.layers.15.self_attn.v_proj.weight', 'model.vision_tower.vision_tower.vision_model.encoder.layers.16.layer_norm1.bias', 'model.vision_tower.vision_tower.vision_model.encoder.layers.16.layer_norm1.weight', 'model.vision_tower.vision_tower.vision_model.encoder.layers.16.layer_norm2.bias', 'model.vision_tower.vision_tower.vision_model.encoder.layers.16.layer_norm2.weight', 'model.vision_tower.vision_tower.vision_model.encoder.layers.16.mlp.fc1.bias', 'model.vision_tower.vision_tower.vision_model.encoder.layers.16.mlp.fc1.weight', 'model.vision_tower.vision_tower.vision_model.encoder.layers.16.mlp.fc2.bias', 'model.vision_tower.vision_tower.vision_model.encoder.layers.16.mlp.fc2.weight', 'model.vision_tower.vision_tower.vision_model.encoder.layers.16.self_attn.k_proj.bias', 'model.vision_tower.vision_tower.vision_model.encoder.layers.16.self_attn.k_proj.weight', 'model.vision_tower.vision_tower.vision_model.encoder.layers.16.self_attn.out_proj.bias', 'model.vision_tower.vision_tower.vision_model.encoder.layers.16.self_attn.out_proj.weight', 'model.vision_tower.vision_tower.vision_model.encoder.layers.16.self_attn.q_proj.bias', 'model.vision_tower.vision_tower.vision_model.encoder.layers.16.self_attn.q_proj.weight', 'model.vision_tower.vision_tower.vision_model.encoder.layers.16.self_attn.v_proj.bias', 'model.vision_tower.vision_tower.vision_model.encoder.layers.16.self_attn.v_proj.weight', 'model.vision_tower.vision_tower.vision_model.encoder.layers.17.layer_norm1.bias', 'model.vision_tower.vision_tower.vision_model.encoder.layers.17.layer_norm1.weight', 'model.vision_tower.vision_tower.vision_model.encoder.layers.17.layer_norm2.bias', 'model.vision_tower.vision_tower.vision_model.encoder.layers.17.layer_norm2.weight', 'model.vision_tower.vision_tower.vision_model.encoder.layers.17.mlp.fc1.bias', 'model.vision_tower.vision_tower.vision_model.encoder.layers.17.mlp.fc1.weight', 'model.vision_tower.vision_tower.vision_model.encoder.layers.17.mlp.fc2.bias', 'model.vision_tower.vision_tower.vision_model.encoder.layers.17.mlp.fc2.weight', 'model.vision_tower.vision_tower.vision_model.encoder.layers.17.self_attn.k_proj.bias', 'model.vision_tower.vision_tower.vision_model.encoder.layers.17.self_attn.k_proj.weight', 'model.vision_tower.vision_tower.vision_model.encoder.layers.17.self_attn.out_proj.bias', 'model.vision_tower.vision_tower.vision_model.encoder.layers.17.self_attn.out_proj.weight', 'model.vision_tower.vision_tower.vision_model.encoder.layers.17.self_attn.q_proj.bias', 'model.vision_tower.vision_tower.vision_model.encoder.layers.17.self_attn.q_proj.weight', 'model.vision_tower.vision_tower.vision_model.encoder.layers.17.self_attn.v_proj.bias', 'model.vision_tower.vision_tower.vision_model.encoder.layers.17.self_attn.v_proj.weight', 'model.vision_tower.vision_tower.vision_model.encoder.layers.18.layer_norm1.bias', 'model.vision_tower.vision_tower.vision_model.encoder.layers.18.layer_norm1.weight', 'model.vision_tower.vision_tower.vision_model.encoder.layers.18.layer_norm2.bias', 'model.vision_tower.vision_tower.vision_model.encoder.layers.18.layer_norm2.weight', 'model.vision_tower.vision_tower.vision_model.encoder.layers.18.mlp.fc1.bias', 'model.vision_tower.vision_tower.vision_model.encoder.layers.18.mlp.fc1.weight', 'model.vision_tower.vision_tower.vision_model.encoder.layers.18.mlp.fc2.bias', 'model.vision_tower.vision_tower.vision_model.encoder.layers.18.mlp.fc2.weight', 'model.vision_tower.vision_tower.vision_model.encoder.layers.18.self_attn.k_proj.bias', 'model.vision_tower.vision_tower.vision_model.encoder.layers.18.self_attn.k_proj.weight', 'model.vision_tower.vision_tower.vision_model.encoder.layers.18.self_attn.out_proj.bias', 'model.vision_tower.vision_tower.vision_model.encoder.layers.18.self_attn.out_proj.weight', 'model.vision_tower.vision_tower.vision_model.encoder.layers.18.self_attn.q_proj.bias', 'model.vision_tower.vision_tower.vision_model.encoder.layers.18.self_attn.q_proj.weight', 'model.vision_tower.vision_tower.vision_model.encoder.layers.18.self_attn.v_proj.bias', 'model.vision_tower.vision_tower.vision_model.encoder.layers.18.self_attn.v_proj.weight', 'model.vision_tower.vision_tower.vision_model.encoder.layers.19.layer_norm1.bias', 'model.vision_tower.vision_tower.vision_model.encoder.layers.19.layer_norm1.weight', 'model.vision_tower.vision_tower.vision_model.encoder.layers.19.layer_norm2.bias', 'model.vision_tower.vision_tower.vision_model.encoder.layers.19.layer_norm2.weight', 'model.vision_tower.vision_tower.vision_model.encoder.layers.19.mlp.fc1.bias', 'model.vision_tower.vision_tower.vision_model.encoder.layers.19.mlp.fc1.weight', 'model.vision_tower.vision_tower.vision_model.encoder.layers.19.mlp.fc2.bias', 'model.vision_tower.vision_tower.vision_model.encoder.layers.19.mlp.fc2.weight', 'model.vision_tower.vision_tower.vision_model.encoder.layers.19.self_attn.k_proj.bias', 'model.vision_tower.vision_tower.vision_model.encoder.layers.19.self_attn.k_proj.weight', 'model.vision_tower.vision_tower.vision_model.encoder.layers.19.self_attn.out_proj.bias', 'model.vision_tower.vision_tower.vision_model.encoder.layers.19.self_attn.out_proj.weight', 'model.vision_tower.vision_tower.vision_model.encoder.layers.19.self_attn.q_proj.bias', 'model.vision_tower.vision_tower.vision_model.encoder.layers.19.self_attn.q_proj.weight', 'model.vision_tower.vision_tower.vision_model.encoder.layers.19.self_attn.v_proj.bias', 'model.vision_tower.vision_tower.vision_model.encoder.layers.19.self_attn.v_proj.weight', 'model.vision_tower.vision_tower.vision_model.encoder.layers.2.layer_norm1.bias', 'model.vision_tower.vision_tower.vision_model.encoder.layers.2.layer_norm1.weight', 'model.vision_tower.vision_tower.vision_model.encoder.layers.2.layer_norm2.bias', 'model.vision_tower.vision_tower.vision_model.encoder.layers.2.layer_norm2.weight', 'model.vision_tower.vision_tower.vision_model.encoder.layers.2.mlp.fc1.bias', 'model.vision_tower.vision_tower.vision_model.encoder.layers.2.mlp.fc1.weight', 'model.vision_tower.vision_tower.vision_model.encoder.layers.2.mlp.fc2.bias', 'model.vision_tower.vision_tower.vision_model.encoder.layers.2.mlp.fc2.weight', 'model.vision_tower.vision_tower.vision_model.encoder.layers.2.self_attn.k_proj.bias', 'model.vision_tower.vision_tower.vision_model.encoder.layers.2.self_attn.k_proj.weight', 'model.vision_tower.vision_tower.vision_model.encoder.layers.2.self_attn.out_proj.bias', 'model.vision_tower.vision_tower.vision_model.encoder.layers.2.self_attn.out_proj.weight', 'model.vision_tower.vision_tower.vision_model.encoder.layers.2.self_attn.q_proj.bias', 'model.vision_tower.vision_tower.vision_model.encoder.layers.2.self_attn.q_proj.weight', 'model.vision_tower.vision_tower.vision_model.encoder.layers.2.self_attn.v_proj.bias', 'model.vision_tower.vision_tower.vision_model.encoder.layers.2.self_attn.v_proj.weight', 'model.vision_tower.vision_tower.vision_model.encoder.layers.20.layer_norm1.bias', 'model.vision_tower.vision_tower.vision_model.encoder.layers.20.layer_norm1.weight', 'model.vision_tower.vision_tower.vision_model.encoder.layers.20.layer_norm2.bias', 'model.vision_tower.vision_tower.vision_model.encoder.layers.20.layer_norm2.weight', 'model.vision_tower.vision_tower.vision_model.encoder.layers.20.mlp.fc1.bias', 'model.vision_tower.vision_tower.vision_model.encoder.layers.20.mlp.fc1.weight', 'model.vision_tower.vision_tower.vision_model.encoder.layers.20.mlp.fc2.bias', 'model.vision_tower.vision_tower.vision_model.encoder.layers.20.mlp.fc2.weight', 'model.vision_tower.vision_tower.vision_model.encoder.layers.20.self_attn.k_proj.bias', 'model.vision_tower.vision_tower.vision_model.encoder.layers.20.self_attn.k_proj.weight', 'model.vision_tower.vision_tower.vision_model.encoder.layers.20.self_attn.out_proj.bias', 'model.vision_tower.vision_tower.vision_model.encoder.layers.20.self_attn.out_proj.weight', 'model.vision_tower.vision_tower.vision_model.encoder.layers.20.self_attn.q_proj.bias', 'model.vision_tower.vision_tower.vision_model.encoder.layers.20.self_attn.q_proj.weight', 'model.vision_tower.vision_tower.vision_model.encoder.layers.20.self_attn.v_proj.bias', 'model.vision_tower.vision_tower.vision_model.encoder.layers.20.self_attn.v_proj.weight', 'model.vision_tower.vision_tower.vision_model.encoder.layers.21.layer_norm1.bias', 'model.vision_tower.vision_tower.vision_model.encoder.layers.21.layer_norm1.weight', 'model.vision_tower.vision_tower.vision_model.encoder.layers.21.layer_norm2.bias', 'model.vision_tower.vision_tower.vision_model.encoder.layers.21.layer_norm2.weight', 'model.vision_tower.vision_tower.vision_model.encoder.layers.21.mlp.fc1.bias', 'model.vision_tower.vision_tower.vision_model.encoder.layers.21.mlp.fc1.weight', 'model.vision_tower.vision_tower.vision_model.encoder.layers.21.mlp.fc2.bias', 'model.vision_tower.vision_tower.vision_model.encoder.layers.21.mlp.fc2.weight', 'model.vision_tower.vision_tower.vision_model.encoder.layers.21.self_attn.k_proj.bias', 'model.vision_tower.vision_tower.vision_model.encoder.layers.21.self_attn.k_proj.weight', 'model.vision_tower.vision_tower.vision_model.encoder.layers.21.self_attn.out_proj.bias', 'model.vision_tower.vision_tower.vision_model.encoder.layers.21.self_attn.out_proj.weight', 'model.vision_tower.vision_tower.vision_model.encoder.layers.21.self_attn.q_proj.bias', 'model.vision_tower.vision_tower.vision_model.encoder.layers.21.self_attn.q_proj.weight', 'model.vision_tower.vision_tower.vision_model.encoder.layers.21.self_attn.v_proj.bias', 'model.vision_tower.vision_tower.vision_model.encoder.layers.21.self_attn.v_proj.weight', 'model.vision_tower.vision_tower.vision_model.encoder.layers.22.layer_norm1.bias', 'model.vision_tower.vision_tower.vision_model.encoder.layers.22.layer_norm1.weight', 'model.vision_tower.vision_tower.vision_model.encoder.layers.22.layer_norm2.bias', 'model.vision_tower.vision_tower.vision_model.encoder.layers.22.layer_norm2.weight', 'model.vision_tower.vision_tower.vision_model.encoder.layers.22.mlp.fc1.bias', 'model.vision_tower.vision_tower.vision_model.encoder.layers.22.mlp.fc1.weight', 'model.vision_tower.vision_tower.vision_model.encoder.layers.22.mlp.fc2.bias', 'model.vision_tower.vision_tower.vision_model.encoder.layers.22.mlp.fc2.weight', 'model.vision_tower.vision_tower.vision_model.encoder.layers.22.self_attn.k_proj.bias', 'model.vision_tower.vision_tower.vision_model.encoder.layers.22.self_attn.k_proj.weight', 'model.vision_tower.vision_tower.vision_model.encoder.layers.22.self_attn.out_proj.bias', 'model.vision_tower.vision_tower.vision_model.encoder.layers.22.self_attn.out_proj.weight', 'model.vision_tower.vision_tower.vision_model.encoder.layers.22.self_attn.q_proj.bias', 'model.vision_tower.vision_tower.vision_model.encoder.layers.22.self_attn.q_proj.weight', 'model.vision_tower.vision_tower.vision_model.encoder.layers.22.self_attn.v_proj.bias', 'model.vision_tower.vision_tower.vision_model.encoder.layers.22.self_attn.v_proj.weight', 'model.vision_tower.vision_tower.vision_model.encoder.layers.23.layer_norm1.bias', 'model.vision_tower.vision_tower.vision_model.encoder.layers.23.layer_norm1.weight', 'model.vision_tower.vision_tower.vision_model.encoder.layers.23.layer_norm2.bias', 'model.vision_tower.vision_tower.vision_model.encoder.layers.23.layer_norm2.weight', 'model.vision_tower.vision_tower.vision_model.encoder.layers.23.mlp.fc1.bias', 'model.vision_tower.vision_tower.vision_model.encoder.layers.23.mlp.fc1.weight', 'model.vision_tower.vision_tower.vision_model.encoder.layers.23.mlp.fc2.bias', 'model.vision_tower.vision_tower.vision_model.encoder.layers.23.mlp.fc2.weight', 'model.vision_tower.vision_tower.vision_model.encoder.layers.23.self_attn.k_proj.bias', 'model.vision_tower.vision_tower.vision_model.encoder.layers.23.self_attn.k_proj.weight', 'model.vision_tower.vision_tower.vision_model.encoder.layers.23.self_attn.out_proj.bias', 'model.vision_tower.vision_tower.vision_model.encoder.layers.23.self_attn.out_proj.weight', 'model.vision_tower.vision_tower.vision_model.encoder.layers.23.self_attn.q_proj.bias', 'model.vision_tower.vision_tower.vision_model.encoder.layers.23.self_attn.q_proj.weight', 'model.vision_tower.vision_tower.vision_model.encoder.layers.23.self_attn.v_proj.bias', 'model.vision_tower.vision_tower.vision_model.encoder.layers.23.self_attn.v_proj.weight', 'model.vision_tower.vision_tower.vision_model.encoder.layers.3.layer_norm1.bias', 'model.vision_tower.vision_tower.vision_model.encoder.layers.3.layer_norm1.weight', 'model.vision_tower.vision_tower.vision_model.encoder.layers.3.layer_norm2.bias', 'model.vision_tower.vision_tower.vision_model.encoder.layers.3.layer_norm2.weight', 'model.vision_tower.vision_tower.vision_model.encoder.layers.3.mlp.fc1.bias', 'model.vision_tower.vision_tower.vision_model.encoder.layers.3.mlp.fc1.weight', 'model.vision_tower.vision_tower.vision_model.encoder.layers.3.mlp.fc2.bias', 'model.vision_tower.vision_tower.vision_model.encoder.layers.3.mlp.fc2.weight', 'model.vision_tower.vision_tower.vision_model.encoder.layers.3.self_attn.k_proj.bias', 'model.vision_tower.vision_tower.vision_model.encoder.layers.3.self_attn.k_proj.weight', 'model.vision_tower.vision_tower.vision_model.encoder.layers.3.self_attn.out_proj.bias', 'model.vision_tower.vision_tower.vision_model.encoder.layers.3.self_attn.out_proj.weight', 'model.vision_tower.vision_tower.vision_model.encoder.layers.3.self_attn.q_proj.bias', 'model.vision_tower.vision_tower.vision_model.encoder.layers.3.self_attn.q_proj.weight', 'model.vision_tower.vision_tower.vision_model.encoder.layers.3.self_attn.v_proj.bias', 'model.vision_tower.vision_tower.vision_model.encoder.layers.3.self_attn.v_proj.weight', 'model.vision_tower.vision_tower.vision_model.encoder.layers.4.layer_norm1.bias', 'model.vision_tower.vision_tower.vision_model.encoder.layers.4.layer_norm1.weight', 'model.vision_tower.vision_tower.vision_model.encoder.layers.4.layer_norm2.bias', 'model.vision_tower.vision_tower.vision_model.encoder.layers.4.layer_norm2.weight', 'model.vision_tower.vision_tower.vision_model.encoder.layers.4.mlp.fc1.bias', 'model.vision_tower.vision_tower.vision_model.encoder.layers.4.mlp.fc1.weight', 'model.vision_tower.vision_tower.vision_model.encoder.layers.4.mlp.fc2.bias', 'model.vision_tower.vision_tower.vision_model.encoder.layers.4.mlp.fc2.weight', 'model.vision_tower.vision_tower.vision_model.encoder.layers.4.self_attn.k_proj.bias', 'model.vision_tower.vision_tower.vision_model.encoder.layers.4.self_attn.k_proj.weight', 'model.vision_tower.vision_tower.vision_model.encoder.layers.4.self_attn.out_proj.bias', 'model.vision_tower.vision_tower.vision_model.encoder.layers.4.self_attn.out_proj.weight', 'model.vision_tower.vision_tower.vision_model.encoder.layers.4.self_attn.q_proj.bias', 'model.vision_tower.vision_tower.vision_model.encoder.layers.4.self_attn.q_proj.weight', 'model.vision_tower.vision_tower.vision_model.encoder.layers.4.self_attn.v_proj.bias', 'model.vision_tower.vision_tower.vision_model.encoder.layers.4.self_attn.v_proj.weight', 'model.vision_tower.vision_tower.vision_model.encoder.layers.5.layer_norm1.bias', 'model.vision_tower.vision_tower.vision_model.encoder.layers.5.layer_norm1.weight', 'model.vision_tower.vision_tower.vision_model.encoder.layers.5.layer_norm2.bias', 'model.vision_tower.vision_tower.vision_model.encoder.layers.5.layer_norm2.weight', 'model.vision_tower.vision_tower.vision_model.encoder.layers.5.mlp.fc1.bias', 'model.vision_tower.vision_tower.vision_model.encoder.layers.5.mlp.fc1.weight', 'model.vision_tower.vision_tower.vision_model.encoder.layers.5.mlp.fc2.bias', 'model.vision_tower.vision_tower.vision_model.encoder.layers.5.mlp.fc2.weight', 'model.vision_tower.vision_tower.vision_model.encoder.layers.5.self_attn.k_proj.bias', 'model.vision_tower.vision_tower.vision_model.encoder.layers.5.self_attn.k_proj.weight', 'model.vision_tower.vision_tower.vision_model.encoder.layers.5.self_attn.out_proj.bias', 'model.vision_tower.vision_tower.vision_model.encoder.layers.5.self_attn.out_proj.weight', 'model.vision_tower.vision_tower.vision_model.encoder.layers.5.self_attn.q_proj.bias', 'model.vision_tower.vision_tower.vision_model.encoder.layers.5.self_attn.q_proj.weight', 'model.vision_tower.vision_tower.vision_model.encoder.layers.5.self_attn.v_proj.bias', 'model.vision_tower.vision_tower.vision_model.encoder.layers.5.self_attn.v_proj.weight', 'model.vision_tower.vision_tower.vision_model.encoder.layers.6.layer_norm1.bias', 'model.vision_tower.vision_tower.vision_model.encoder.layers.6.layer_norm1.weight', 'model.vision_tower.vision_tower.vision_model.encoder.layers.6.layer_norm2.bias', 'model.vision_tower.vision_tower.vision_model.encoder.layers.6.layer_norm2.weight', 'model.vision_tower.vision_tower.vision_model.encoder.layers.6.mlp.fc1.bias', 'model.vision_tower.vision_tower.vision_model.encoder.layers.6.mlp.fc1.weight', 'model.vision_tower.vision_tower.vision_model.encoder.layers.6.mlp.fc2.bias', 'model.vision_tower.vision_tower.vision_model.encoder.layers.6.mlp.fc2.weight', 'model.vision_tower.vision_tower.vision_model.encoder.layers.6.self_attn.k_proj.bias', 'model.vision_tower.vision_tower.vision_model.encoder.layers.6.self_attn.k_proj.weight', 'model.vision_tower.vision_tower.vision_model.encoder.layers.6.self_attn.out_proj.bias', 'model.vision_tower.vision_tower.vision_model.encoder.layers.6.self_attn.out_proj.weight', 'model.vision_tower.vision_tower.vision_model.encoder.layers.6.self_attn.q_proj.bias', 'model.vision_tower.vision_tower.vision_model.encoder.layers.6.self_attn.q_proj.weight', 'model.vision_tower.vision_tower.vision_model.encoder.layers.6.self_attn.v_proj.bias', 'model.vision_tower.vision_tower.vision_model.encoder.layers.6.self_attn.v_proj.weight', 'model.vision_tower.vision_tower.vision_model.encoder.layers.7.layer_norm1.bias', 'model.vision_tower.vision_tower.vision_model.encoder.layers.7.layer_norm1.weight', 'model.vision_tower.vision_tower.vision_model.encoder.layers.7.layer_norm2.bias', 'model.vision_tower.vision_tower.vision_model.encoder.layers.7.layer_norm2.weight', 'model.vision_tower.vision_tower.vision_model.encoder.layers.7.mlp.fc1.bias', 'model.vision_tower.vision_tower.vision_model.encoder.layers.7.mlp.fc1.weight', 'model.vision_tower.vision_tower.vision_model.encoder.layers.7.mlp.fc2.bias', 'model.vision_tower.vision_tower.vision_model.encoder.layers.7.mlp.fc2.weight', 'model.vision_tower.vision_tower.vision_model.encoder.layers.7.self_attn.k_proj.bias', 'model.vision_tower.vision_tower.vision_model.encoder.layers.7.self_attn.k_proj.weight', 'model.vision_tower.vision_tower.vision_model.encoder.layers.7.self_attn.out_proj.bias', 'model.vision_tower.vision_tower.vision_model.encoder.layers.7.self_attn.out_proj.weight', 'model.vision_tower.vision_tower.vision_model.encoder.layers.7.self_attn.q_proj.bias', 'model.vision_tower.vision_tower.vision_model.encoder.layers.7.self_attn.q_proj.weight', 'model.vision_tower.vision_tower.vision_model.encoder.layers.7.self_attn.v_proj.bias', 'model.vision_tower.vision_tower.vision_model.encoder.layers.7.self_attn.v_proj.weight', 'model.vision_tower.vision_tower.vision_model.encoder.layers.8.layer_norm1.bias', 'model.vision_tower.vision_tower.vision_model.encoder.layers.8.layer_norm1.weight', 'model.vision_tower.vision_tower.vision_model.encoder.layers.8.layer_norm2.bias', 'model.vision_tower.vision_tower.vision_model.encoder.layers.8.layer_norm2.weight', 'model.vision_tower.vision_tower.vision_model.encoder.layers.8.mlp.fc1.bias', 'model.vision_tower.vision_tower.vision_model.encoder.layers.8.mlp.fc1.weight', 'model.vision_tower.vision_tower.vision_model.encoder.layers.8.mlp.fc2.bias', 'model.vision_tower.vision_tower.vision_model.encoder.layers.8.mlp.fc2.weight', 'model.vision_tower.vision_tower.vision_model.encoder.layers.8.self_attn.k_proj.bias', 'model.vision_tower.vision_tower.vision_model.encoder.layers.8.self_attn.k_proj.weight', 'model.vision_tower.vision_tower.vision_model.encoder.layers.8.self_attn.out_proj.bias', 'model.vision_tower.vision_tower.vision_model.encoder.layers.8.self_attn.out_proj.weight', 'model.vision_tower.vision_tower.vision_model.encoder.layers.8.self_attn.q_proj.bias', 'model.vision_tower.vision_tower.vision_model.encoder.layers.8.self_attn.q_proj.weight', 'model.vision_tower.vision_tower.vision_model.encoder.layers.8.self_attn.v_proj.bias', 'model.vision_tower.vision_tower.vision_model.encoder.layers.8.self_attn.v_proj.weight', 'model.vision_tower.vision_tower.vision_model.encoder.layers.9.layer_norm1.bias', 'model.vision_tower.vision_tower.vision_model.encoder.layers.9.layer_norm1.weight', 'model.vision_tower.vision_tower.vision_model.encoder.layers.9.layer_norm2.bias', 'model.vision_tower.vision_tower.vision_model.encoder.layers.9.layer_norm2.weight', 'model.vision_tower.vision_tower.vision_model.encoder.layers.9.mlp.fc1.bias', 'model.vision_tower.vision_tower.vision_model.encoder.layers.9.mlp.fc1.weight', 'model.vision_tower.vision_tower.vision_model.encoder.layers.9.mlp.fc2.bias', 'model.vision_tower.vision_tower.vision_model.encoder.layers.9.mlp.fc2.weight', 'model.vision_tower.vision_tower.vision_model.encoder.layers.9.self_attn.k_proj.bias', 'model.vision_tower.vision_tower.vision_model.encoder.layers.9.self_attn.k_proj.weight', 'model.vision_tower.vision_tower.vision_model.encoder.layers.9.self_attn.out_proj.bias', 'model.vision_tower.vision_tower.vision_model.encoder.layers.9.self_attn.out_proj.weight', 'model.vision_tower.vision_tower.vision_model.encoder.layers.9.self_attn.q_proj.bias', 'model.vision_tower.vision_tower.vision_model.encoder.layers.9.self_attn.q_proj.weight', 'model.vision_tower.vision_tower.vision_model.encoder.layers.9.self_attn.v_proj.bias', 'model.vision_tower.vision_tower.vision_model.encoder.layers.9.self_attn.v_proj.weight', 'model.vision_tower.vision_tower.vision_model.post_layernorm.bias', 'model.vision_tower.vision_tower.vision_model.post_layernorm.weight', 'model.vision_tower.vision_tower.vision_model.pre_layrnorm.bias', 'model.vision_tower.vision_tower.vision_model.pre_layrnorm.weight']

Any plans to implement within HF Transformers?

Hi team, and thanks for publishing!

I've been playing for a few days now to get the model working on Amazon SageMaker endpoints (full disclosure - I work at AWS), and wondered if you had any future plans to try and get the model into the Hugging Face Transformers library itself?

What's the minimal device requirement on this project?

I'm trying this model , please provide the minimal device requirement information , such as GPU(type/memory) CPU SSD

Can you provide a pre-trained loss curve chart，and sft loss curve chart

Question regarding the evaluation of video understanding

Thanks for sharing the code of this excellent work!

I have a question about the evaluation for video understanding. In TRAIN_AND_VALIDATE.md, the temporal understanding performance is evaluated using the questions from ChatUniVi/eval/questions/video_qa/generic_qa.json.

However, if I am not mistaken, shouldn't the evaluation of temporal understanding performance be based on the questions from ChatUniVi/eval/questions/video_qa/temporal_qa.json?

huggingface demo error

hi, there are some problem with opening huggingface online demo.

TCBlock is useless

TCBlock just returns the input dict

details on COCO based data for pre-training.

Hi,

Thank you for the cool work! Could you share the details on how COCO_CAP, COCO_REC and COCO_REG datasets were obtained?

Could you release the trained weights of the Stage-1 Training?

The trained connector can help us skip the stage-1 training.

runtime error on huggingface space

https://huggingface.co/spaces/Chat-UniVi/Chat-UniVi



Downloading shards: 100%|██████████| 2/2 [01:39<00:00, 45.56s/it]�[A
Downloading shards: 100%|██████████| 2/2 [01:39<00:00, 49.92s/it]


config.json:   0%|          | 0.00/4.52k [00:00<?, ?B/s]�[A
config.json: 100%|██████████| 4.52k/4.52k [00:00<00:00, 25.8MB/s]
Traceback (most recent call last):
  File "/home/user/app/app.py", line 120, in <module>
    handler = Chat(model_path, conv_mode=conv_mode, load_4bit=load_4bit, load_8bit=load_8bit)
  File "/home/user/app/ChatUniVi/demo.py", line 16, in __init__
    self.tokenizer, self.model, self.image_processor, context_len = load_pretrained_model(model_path, None, model_name="ChatUniVi", load_8bit=load_8bit, load_4bit=load_4bit)
  File "/home/user/app/ChatUniVi/model/builder.py", line 75, in load_pretrained_model
    model = AutoModelForCausalLM.from_pretrained(model_path, low_cpu_mem_usage=True, **kwargs)
  File "/home/user/.local/lib/python3.10/site-packages/transformers/models/auto/auto_factory.py", line 493, in from_pretrained
    return model_class.from_pretrained(
  File "/home/user/.local/lib/python3.10/site-packages/transformers/modeling_utils.py", line 2903, in from_pretrained
    ) = cls._load_pretrained_model(
  File "/home/user/.local/lib/python3.10/site-packages/transformers/modeling_utils.py", line 3002, in _load_pretrained_model
    raise ValueError(
ValueError: The current `device_map` had weights offloaded to the disk. Please provide an `offload_folder` for them. Alternatively, make sure you have `safetensors` installed if the model you are using offers the weights in this format.

could you solve this problem?

【Bug】Training becomes pending if the training dataset contains text data.

When I add text data to the training dataset, the training process always becomes pending in the first step. Conversely, if I remove the text data, the training will proceed normally. What could be one possible cause for this issue?

Different situations as follows:

Single GPU with text data: normal (unsure if training will complete)
Multi-GPU with text data: abnormal
Video-LLaVA codebase with ChatUniVi model: pending at 70% instead of at the first step
Text data only : normal (unsure if training will complete)

pip list

accelerate                0.21.0
aiofiles                  23.2.1
aiohttp                   3.8.5
aiosignal                 1.3.1
altair                    5.1.1
anyio                     3.7.1
appdirs                   1.4.4
async-timeout             4.0.3
attrs                     23.1.0
av                        12.1.0
bitsandbytes              0.41.0
certifi                   2023.7.22
charset-normalizer        3.2.0
click                     8.1.7
cmake                     3.27.2
contourpy                 1.1.0
cycler                    0.11.0
decord                    0.6.0
deepspeed                 0.9.5
docker-pycreds            0.4.0
einops                    0.6.1
einops-exts               0.0.4
exceptiongroup            1.1.3
fastapi                   0.103.1
ffmpy                     0.3.1
filelock                  3.12.3
flash-attn                2.1.0
fonttools                 4.42.1
frozenlist                1.4.0
fsspec                    2023.9.0
fvcore                    0.1.5.post20221221
gitdb                     4.0.10
GitPython                 3.1.34
gradio                    3.35.2
gradio_client             0.2.9
h11                       0.14.0
hjson                     3.1.0
httpcore                  0.17.3
httpx                     0.24.0
huggingface-hub           0.23.4
idna                      3.4
iopath                    0.1.10
Jinja2                    3.1.2
joblib                    1.3.2
jsonschema                4.19.0
jsonschema-specifications 2023.7.1
kiwisolver                1.4.5
linkify-it-py             2.0.2
lit                       16.0.6
llava                     1.0.1              /home/users/LLaVA
markdown-it-py            2.2.0
markdown2                 2.4.10
MarkupSafe                2.1.3
matplotlib                3.7.2
mdit-py-plugins           0.3.3
mdurl                     0.1.2
mpmath                    1.3.0
multidict                 6.0.4
networkx                  3.1
ninja                     1.11.1
numpy                     1.25.2
opencv-python             4.10.0.84
orjson                    3.9.5
packaging                 23.1
pandas                    2.1.0
parameterized             0.9.0
pathtools                 0.1.2
peft                      0.4.0
Pillow                    10.0.0
pip                       23.2.1
portalocker               2.8.2
protobuf                  4.24.2
psutil                    5.9.5
py-cpuinfo                9.0.0
pydantic                  1.10.12
pydub                     0.25.1
Pygments                  2.16.1
pyparsing                 3.0.9
python-dateutil           2.8.2
python-multipart          0.0.6
pytorch-triton-rocm       2.0.2
pytorchvideo              0.1.5
pytz                      2023.3.post1
PyYAML                    6.0.1
referencing               0.30.2
regex                     2023.8.8
requests                  2.31.0
rpds-py                   0.10.2
safetensors               0.3.3
scikit-learn              1.2.2
scipy                     1.11.2
semantic-version          2.10.0
sentencepiece             0.1.99
sentry-sdk                1.30.0
setproctitle              1.3.2
setuptools                68.0.0
shortuuid                 1.0.11
six                       1.16.0
smmap                     5.0.0
sniffio                   1.3.0
some-package              0.1
starlette                 0.27.0
svgwrite                  1.4.3
sympy                     1.12
tabulate                  0.9.0
tensorboardX              2.6.2.2
termcolor                 2.4.0
threadpoolctl             3.2.0
timm                      0.6.13
tokenizers                0.15.2
toolz                     0.12.0
torch                     2.0.1+cu118
torchvision               0.15.2+cu118
tqdm                      4.66.1
transformers              4.37.0
triton                    2.0.0
typing_extensions         4.7.1
tzdata                    2023.3
uc-micro-py               1.0.2
urllib3                   2.0.4
uvicorn                   0.23.2
wandb                     0.15.9
wavedrom                  2.0.3.post3
websockets                11.0.3
wheel                     0.38.4
xformers                  0.0.21
yacs                      0.1.8
yarl                      1.9.2

runtimeError:Errors in loading state_dict for ChatUniViLlamaForCausalLM

I have this problem when I run the video understanding api code. But I see the strict you set is false. Thus, I do not know the reason why I have this problem.
Before this error, I have the following error:

I solve this error through downloading the mm_projector.bin in this website https://huggingface.co/liuhaotian/LLaVA-13b-delta-v0/tree/main. Then I place it in a suitable flod.

I guess the first error maybe caused by the unsuitable mm_projector.bin. Can you help me to slove this problem. Thank you very much!

Is the clustering algorithm applied during inference?

I have read the code but I can not find how the KNN clustering algorithm is applied during inference. I can find how it's applied during training. So is the clustering algorithm applied during inference or not?

Is it a mistake to get patch_num_h from w?

Hi, is it a mistake to get patch_num_h from w at line 33 in Chat-UniVi/visualization.py?

h, w, _ = img.shape

patch_num_h, patch_num_w = w // patch_size, w // patch_size

Is it supposed to be like this?
patch_num_h, patch_num_w = h // patch_size, w // patch_size

A bug of the latest commit ` add v1.5`

Rank: 0 partition count [1, 1] and sizes[(468725760, False), (4096, False)] 
  0%|          | 0/1043906 [00:00<?, ?it/s][rank0]: Traceback (most recent call last):
[rank0]:   File "/data1/lhj/codes/Chat-UniVi-1.5/ChatUniVi/train/train_mem.py", line 13, in <module>
[rank0]:     train()
[rank0]:   File "/data1/lhj/codes/Chat-UniVi-1.5/ChatUniVi/train/train.py", line 1211, in train
[rank0]:     trainer.train()
[rank0]:   File "/root/anaconda3/envs/chatunivi/lib/python3.10/site-packages/transformers/trainer.py", line 1539, in train
[rank0]:     return inner_training_loop(
[rank0]:   File "/root/anaconda3/envs/chatunivi/lib/python3.10/site-packages/transformers/trainer.py", line 1787, in _inner_training_loop
[rank0]:     for step, inputs in enumerate(epoch_iterator):
[rank0]:   File "/root/anaconda3/envs/chatunivi/lib/python3.10/site-packages/accelerate/data_loader.py", line 384, in __iter__
[rank0]:     current_batch = next(dataloader_iter)
[rank0]:   File "/root/anaconda3/envs/chatunivi/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 631, in __next__
[rank0]:     data = self._next_data()
[rank0]:   File "/root/anaconda3/envs/chatunivi/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 675, in _next_data
[rank0]:     data = self._dataset_fetcher.fetch(index)  # may raise StopIteration
[rank0]:   File "/root/anaconda3/envs/chatunivi/lib/python3.10/site-packages/torch/utils/data/_utils/fetch.py", line 51, in fetch
[rank0]:     data = [self.dataset[idx] for idx in possibly_batched_index]
[rank0]:   File "/root/anaconda3/envs/chatunivi/lib/python3.10/site-packages/torch/utils/data/_utils/fetch.py", line 51, in <listcomp>
[rank0]:     data = [self.dataset[idx] for idx in possibly_batched_index]
[rank0]:   File "/data1/lhj/codes/Chat-UniVi-1.5/ChatUniVi/train/train.py", line 890, in __getitem__
[rank0]:     sources = preprocess_multimodal(
[rank0]:   File "/data1/lhj/codes/Chat-UniVi-1.5/ChatUniVi/train/train.py", line 333, in preprocess_multimodal
[rank0]:     replace_token, vid_replace_token = DEFAULT_IMAGE_TOKEN, DEFAULT_IMAGE_TOKEN * image_token_num
[rank0]: numpy.core._exceptions._UFuncNoLoopError: ufunc 'multiply' did not contain a loop with signature matching types (dtype('<U7'), dtype('int64')) -> None

I print image_token_num, it is a numpy.array but not an int number, which cause this error:

(Pdb) image_token_num
array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0])
(Pdb) image_token_num.shape

Where is `CGD/images`, `LA/images` and `SD/images`?

I did not find CGD/images, LA/images and SD/images in your provided preprocessed data in HF link. How could I find these images?

MIMIC_imageonly = {
    "chat_path": "${PATH}/MIMIC-IT-imageonly.json",
    "CDG": "${PATH}/CGD/images",
    "LA": "${PATH}/LA/images",
    "SD": "${PATH}/SD/images",
}

cannot run inference

i can run demo successfully, but when I copy the code to do inference locally, error occurs
text_en_out, state = handler.generate(images_tensor, text_en_in, first_run=first_run, state=state) File "/opt/conda/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context return func(*args, **kwargs) File "/data_alpha/difei666/ref/chat_univi/ChatUniVi/demo.py", line 97, in generate output_ids = model.generate( File "/opt/conda/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context return func(*args, **kwargs) File "/opt/conda/lib/python3.8/site-packages/transformers/generation/utils.py", line 1648, in generate return self.sample( File "/opt/conda/lib/python3.8/site-packages/transformers/generation/utils.py", line 2766, in sample next_tokens = torch.multinomial(probs, num_samples=1).squeeze(1) RuntimeError: probability tensor contains either inf, nan or element < 0

Questions about training GPU memory

I kept getting an insufficient video memory error during my run. I checked the paper you posted, and it doesn't say anything about the size of the GPU memory, or the GPU type. I would like to know roughly how much GPU memory is needed to do the fine-tuning training in stage2. In addition to that, what is the configuration of the machine that you mentioned for training 70B parameters in 3 days. Thank you very much!

Failed to run main_demo_13B.py

Traceback (most recent call last):
File "/home/ubuntu22/anaconda3/envs/chatunivi/bin/uvicorn", line 8, in
sys.exit(main())
File "/home/ubuntu22/anaconda3/envs/chatunivi/lib/python3.10/site-packages/click/core.py", line 1157, in call
return self.main(*args, **kwargs)
File "/home/ubuntu22/anaconda3/envs/chatunivi/lib/python3.10/site-packages/click/core.py", line 1078, in main
rv = self.invoke(ctx)
File "/home/ubuntu22/anaconda3/envs/chatunivi/lib/python3.10/site-packages/click/core.py", line 1434, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/home/ubuntu22/anaconda3/envs/chatunivi/lib/python3.10/site-packages/click/core.py", line 783, in invoke
return __callback(*args, **kwargs)
File "/home/ubuntu22/anaconda3/envs/chatunivi/lib/python3.10/site-packages/uvicorn/main.py", line 409, in main
run(
File "/home/ubuntu22/anaconda3/envs/chatunivi/lib/python3.10/site-packages/uvicorn/main.py", line 575, in run
server.run()
File "/home/ubuntu22/anaconda3/envs/chatunivi/lib/python3.10/site-packages/uvicorn/server.py", line 65, in run
return asyncio.run(self.serve(sockets=sockets))
File "/home/ubuntu22/anaconda3/envs/chatunivi/lib/python3.10/asyncio/runners.py", line 44, in run
return loop.run_until_complete(main)
File "/home/ubuntu22/anaconda3/envs/chatunivi/lib/python3.10/asyncio/base_events.py", line 649, in run_until_complete
return future.result()
File "/home/ubuntu22/anaconda3/envs/chatunivi/lib/python3.10/site-packages/uvicorn/server.py", line 69, in serve
await self._serve(sockets)
File "/home/ubuntu22/anaconda3/envs/chatunivi/lib/python3.10/site-packages/uvicorn/server.py", line 76, in _serve
config.load()
File "/home/ubuntu22/anaconda3/envs/chatunivi/lib/python3.10/site-packages/uvicorn/config.py", line 433, in load
self.loaded_app = import_from_string(self.app)
File "/home/ubuntu22/anaconda3/envs/chatunivi/lib/python3.10/site-packages/uvicorn/importer.py", line 19, in import_from_string
module = importlib.import_module(module_str)
File "/home/ubuntu22/anaconda3/envs/chatunivi/lib/python3.10/importlib/init.py", line 126, in import_module
return _bootstrap._gcd_import(name[level:], package, level)
File "", line 1050, in _gcd_import
File "", line 1027, in _find_and_load
File "", line 1006, in _find_and_load_unlocked
File "", line 688, in _load_unlocked
File "", line 883, in exec_module
File "", line 241, in _call_with_frames_removed
File "/home/ubuntu22/chat/main_demo_13B.py", line 119, in
handler = Chat(model_path, conv_mode=conv_mode)
File "/home/ubuntu22/chat/ChatUniVi/demo.py", line 16, in init
self.tokenizer, self.model, self.image_processor, context_len = load_pretrained_model(model_path, None, model_name="ChatUniVi")
File "/home/ubuntu22/chat/ChatUniVi/model/builder.py", line 75, in load_pretrained_model
model = AutoModelForCausalLM.from_pretrained(model_path, low_cpu_mem_usage=True, **kwargs)
File "/home/ubuntu22/anaconda3/envs/chatunivi/lib/python3.10/site-packages/transformers/models/auto/auto_factory.py", line 493, in from_pretrained
return model_class.from_pretrained(
File "/home/ubuntu22/anaconda3/envs/chatunivi/lib/python3.10/site-packages/transformers/modeling_utils.py", line 2903, in from_pretrained
) = cls._load_pretrained_model(
File "/home/ubuntu22/anaconda3/envs/chatunivi/lib/python3.10/site-packages/transformers/modeling_utils.py", line 3002, in _load_pretrained_model
raise ValueError(
ValueError: The current device_map had weights offloaded to the disk. Please provide an offload_folder for them. Alternatively, make sure you have safetensors installed if the model you are using offers the weights in this format.

ModuleNotFoundError: No module named 'ChatUniVi.model.language_model.phi'

hi i would like to use phi because of the limit of gpu memory. But got the problem describe in the title. Can you share the phi.py file?

Question regarding GPT scores for video comprehension

Thanks for the great work. And I have a question about the GPT scores for video comprehension.

I evaluated the GPT score for video comprehension using the evaluation code you published and the hugginface 7B model ('Chat-UniVi/Chat-UniVi') that you also published, and I got a very low score of 47.7 in the temporal evaluation.
I don't think this means that it is a swing in generation or GPT evaluation, as I have tried several times, but can you think of any cause?

I apologize for asking such a vague question, but I hope you can help me.

Question about how to detect events in video

In the paper, author mentioned using DPC-KNN to segment the video to get multiple events, and then process each event separately.
I would like to ask where this part of the code is implemented, I don't seem to find it.

KeyError: 'COCO2017'

I face the problem when I train this LLM on my own datasets(without coco2017).

the follwing operations are made by me:

and the chat.json details are as follow:

And the stage1 of training, I recite it as following:

where the model_name or path we downloaded this weight from huggingface: https://huggingface.co/Chat-UniVi/Chat-UniVi/tree/main.
I do not know if I use wrong weight ? or maybe I use ignore the some important issues ? Can you help me ,thank you very much !

Downloading the ActivityNet (Zero Shot) Dataset

Hello, thank you for sharing the excellent code.

I am trying to download the Activitynet_Zero_Shot_QA dataset, but it seems that this data is not available on Hugging Face. Additionally, I attempted to download it using Baidu Disk, but as a foreigner, I am restricted from signing up for Baidu.

Could you please share this data through a public platform such as Google Drive? Alternatively, I would greatly appreciate it if you could provide an alternative way to download the data.

Thank you very much.

clustering impedes gradient backprop to ViT, any consideration?

tuning vision encoder end to end becomes impossible

dpc implementation

In the implementation of dpc algorithm, for code of this line,
as we know the size of dist_matrix is B N N, because we want to get the max distance of each token, but if we flat the dist_matrix, we will only get the max distance of each batch.
dist_max = dist_matrix.flatten(1).max(dim=-1)[0][:, None, None]

we can change the code to
dist_max = dist_matrix.max(dim=-1)[0][:, :, None]

RuntimeError: disagreement between rank0 and rank7: rank0

Finetuned by my own llama2

Traceback (most recent call last):
File "/workspace/Chat-UniVi/ChatUniVi/train/train_mem.py", line 13, in
train()
File "/workspace/Chat-UniVi/ChatUniVi/train/train.py", line 1065, in train
trainer.train()
File "/root/miniconda3/envs/chatunivi/lib/python3.10/site-packages/transformers/trainer.py", line 1539, in train
return inner_training_loop(
File "/root/miniconda3/envs/chatunivi/lib/python3.10/site-packages/transformers/trainer.py", line 1809, in _inner_training_loop
tr_loss_step = self.training_step(model, inputs)
File "/root/miniconda3/envs/chatunivi/lib/python3.10/site-packages/transformers/trainer.py", line 2665, in training_step
self.accelerator.backward(loss)
File "/root/miniconda3/envs/chatunivi/lib/python3.10/site-packages/accelerate/accelerator.py", line 1847, in backward
self.deepspeed_engine_wrapped.backward(loss, **kwargs)
File "/root/miniconda3/envs/chatunivi/lib/python3.10/site-packages/accelerate/utils/deepspeed.py", line 167, in backward
self.engine.backward(loss, **kwargs)
File "/root/miniconda3/envs/chatunivi/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
ret_val = func(*args, **kwargs)
File "/root/miniconda3/envs/chatunivi/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 1861, in backward
self.optimizer.backward(loss, retain_graph=retain_graph)
File "/root/miniconda3/envs/chatunivi/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
ret_val = func(*args, **kwargs)
File "/root/miniconda3/envs/chatunivi/lib/python3.10/site-packages/deepspeed/runtime/zero/stage3.py", line 1995, in backward
self._get_param_coordinator(training=True).reset_step()
File "/root/miniconda3/envs/chatunivi/lib/python3.10/site-packages/deepspeed/runtime/zero/partitioned_param_coordinator.py", line 191, in reset_step
assert_ints_same_as_other_ranks([m.id for m in self.__submodule_order])
File "/root/miniconda3/envs/chatunivi/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
ret_val = func(*args, **kwargs)
File "/root/miniconda3/envs/chatunivi/lib/python3.10/site-packages/deepspeed/runtime/zero/utils.py", line 86, in assert_ints_same_as_other_ranks
raise RuntimeError(f"disagreement between rank0 and rank{dist.get_rank()}: "
**RuntimeError: disagreement between rank0 and rank7: rank0: [**0, 565, 566, 567, 568, 569, 570, 571, 572, 574, 580, 575, 578, 576, 577, 579, 585, 581, 583, 582, 584, 586, 592, 587, 590, 588, 589, 591, 597, 593, 595, 594, 596, 598, 604, 599, 602, 600, 601, 603, 609, 605, 607, 606, 608, 610, 616, 611, 614, 612, 613, 615, 621, 617, 619, 618, 620, 622, 628, 623, 626, 624, 625, 627, 633, 629, 631, 630, 632, 634, 640, 635, 638, 636, 637, 639, 645, 641, 643, 642, 644, 646, 652, 647, 650, 648, 649, 651, 657, 653, 655, 654, 656, 658, 664, 659, 662, 660, 661, 663, 669, 665, 667, 666, 668, 670, 676, 671, 674, 672, 673, 675, 681, 677, 679, 678, 680, 682, 688, 683, 686, 684, 685, 687, 693, 689, 691, 690, 692, 694, 700, 695, 698, 696, 697, 699, 705, 701, 703, 702, 704, 706, 712, 707, 710, 708, 709, 711, 717, 713, 715, 714, 716, 718, 724, 719, 722, 720, 721, 723, 729, 725, 727, 726, 728, 730, 736, 731, 734, 732, 733, 735, 741, 737, 739, 738, 740, 742, 748, 743, 746, 744, 745, 747, 753, 749, 751, 750, 752, 754, 760, 755, 758, 756, 757, 759, 765, 761, 763, 762, 764, 766, 772, 767, 770, 768, 769, 771, 777, 773, 775, 774, 776, 778, 784, 779, 782, 780, 781, 783, 789, 785, 787, 786, 788, 790, 796, 791, 794, 792, 793, 795, 801, 797, 799, 798, 800, 802, 808, 803, 806, 804, 805, 807, 813, 809, 811, 810, 812, 814, 820, 815, 818, 816, 817, 819, 825, 821, 823, 822, 824, 826, 832, 827, 830, 828, 829, 831, 837, 833, 835, 834, 836, 838, 844, 839, 842, 840, 841, 843, 849, 845, 847, 846, 848, 850, 856, 851, 854, 852, 853, 855, 861, 857, 859, 858, 860, 862, 870, 864, 865, 866, 867, 868, 869, 864, 865, 866, 867, 868, 869, 864, 865, 866, 867, 868, 869, 864, 865, 866, 867, 868, 869, 864, 865, 866, 867, 868, 869, 863, 2, 2, 864, 865, 866, 867, 868, 869, 863, 2, 2, 864, 865, 866, 867, 868, 869, 863, 2, 2, 864, 865, 866, 867, 868, 869, 863, 2, 2, 870, 864, 865, 866, 867, 868, 869, 864, 865, 866, 867, 868, 869, 864, 865, 866, 867, 868, 869, 864, 865, 866, 867, 868, 869, 864, 865, 866, 867, 868, 869, 863, 2, 2, 864, 865, 866, 867, 868, 869, 863, 2, 2, 864, 865, 866, 867, 868, 869, 863, 2, 2, 864, 865, 866, 867, 868, 869, 863, 2, 2, 1, 4, 16, 5, 6, 7, 8, 10, 9, 17, 11, 15, 18, 30, 19, 20, 21, 22, 24, 23, 31, 25, 29, 32, 44, 33, 34, 35, 36, 38, 37, 45, 39, 43, 46, 58, 47, 48, 49, 50, 52, 51, 59, 53, 57, 60, 72, 61, 62, 63, 64, 66, 65, 73, 67, 71, 74, 86, 75, 76, 77, 78, 80, 79, 87, 81, 85, 88, 100, 89, 90, 91, 92, 94, 93, 101, 95, 99, 102, 114, 103, 104, 105, 106, 108, 107, 115, 109, 113, 116, 128, 117, 118, 119, 120, 122, 121, 129, 123, 127, 130, 142, 131, 132, 133, 134, 136, 135, 143, 137, 141, 144, 156, 145, 146, 147, 148, 150, 149, 15

KeyError: 'ChatUniViConfig'

i followed all the steps but got this error when running the inference for image understanding. i downloaded weights from https://huggingface.co/Chat-UniVi/Chat-UniVi/tree/main manually and put them under Chat-UniVi/Chat-UniVi

File "/root/miniconda3/envs/chatunivi2/lib/python3.10/site-packages/transformers/models/auto/auto_factory.py", line 663, in getitem
model_type = self._reverse_config_mapping[key.name]
KeyError: 'ChatUniViConfig'

Inference Time Issue

Appreciate your efforts in maintaining this project!

While I ran the zero-shot VQA inference (generating results) on the MSRVTT dataset, it took 28 hours (using 4 A5000) to finish. I recognize that it is caused by too many video-question pairs (~70K), but have you solved this problem by implementing a better dataloader? Otherwise, have you experimented with small subset during the development?

Also, I have a minor question why zero2 setting is used for fine-tuning, instead of zero3, compared to the pre-training stage used zero3. This is reserve setting of LLaVA which used zero2 for pre-training and zeor3 for fine-tuning.

And, may I ask for memory consumption during fine-tuning the 7B model since even a batch size of 1 is not enough with using 4 A100 40GB. In the case of using lora for fine-tuning, may I know the configuration that you used, (e.g., lora_r, lora_alpha, etc., and whether the same learning rate was used for mm_projector)?

Thanks!

Question about llama_flash_attn_monkey_patch

https://github.com/PKU-YuanGroup/Chat-UniVi/blob/main/ChatUniVi/train/llama_flash_attn_monkey_patch.py is different with https://github.com/haotian-liu/LLaVA/blob/main/llava/train/llama_flash_attn_monkey_patch.py

for example:

https://github.com/PKU-YuanGroup/Chat-UniVi/blob/main/ChatUniVi/train/llama_flash_attn_monkey_patch.py

and

https://github.com/haotian-liu/LLaVA/blob/main/llava/train/llama_flash_attn_monkey_patch.py

it seems chat-univi change some code in llama_flash_attn_monkey_patch, can you help explain the reason for modifying the code? ♥️

Any comparison with LLAVA1.5?

Since there is llava1.5 now, which overperforms the originally llava by a lot. It will be really helpful if the benchmark can also include the llava1.5's result.
Thanks!

422 Unprocessable Entity

I am trying to run "main_demo_7B.py". But when i upload an image or a video and press submit. It gives an error

About the visualization

Thanks for your outstanding work! And I have a question about the visualization of paper. I noticed that dynamic visual tokens of Fig. 1 in the paper, and I also find corresponding core funtion vis_token in TCFormer, but I cannot reproduce this picture. I want to ask to visualize dynamic tokens in detail. It would be even better if visualization scripts could be open-source!

Question about TCBlock

In my opinion, TCBlock just returns the clustered data info from the output of CTM. May I ask for any context/background on using this in your implementation?
https://github.com/PKU-YuanGroup/Chat-UniVi/blob/main/ChatUniVi/model/cluster.py#L259-L287
Also, in the recent commit, you have separately created the mm_projector builder. Is this indicating that you are conducting ablation experiments for its design (e.g., linear, MLP, residual..?)

Why `image_aspect_ratio` is set to `square` rather than `pad`

The object in the image will be streched. It is difficult to determine whether this will lead to negative impacts. Why do you choose such a different setting.
In addition, the bounding box in the llava-mixed data will be incorrect.

When will the training codes and model weights of `Chat-UniVi-7B v1.5` be released?

In addition, does Chat-UniVi-7B v1.5 use clip-vit-large-patch14-336 and vicuna-7b-v1.5 just as llava-1.5? It seems that the only difference is the usage of VQAv2 data according to this.

Hardware requirements for inference

Thank you very much for your work! How much memory is required for 7B reasoning?

Unable to Reproduce Training Process

Hello, thank you for your outstanding open-source work! I encountered a problem during the second stage of training when attempting to reproduce the training process. The loss becomes zero across all iterations. This happens regardless of whether I use my trained mm_projector.bin or the weights you released. The loss always drops to zero within the first few iterations. I have followed your instructions precisely to reproduce the training process.

(If anyone has successfully reproduced the training process, let's discuss this issue together.)

Failed to run main_demo_7B.py

Traceback (most recent call last):
File "/home/ubuntu22/anaconda3/envs/chatunivi/lib/python3.10/site-packages/urllib3/connection.py", line 198, in _new_conn
sock = connection.create_connection(
File "/home/ubuntu22/anaconda3/envs/chatunivi/lib/python3.10/site-packages/urllib3/util/connection.py", line 85, in create_connection
raise err
File "/home/ubuntu22/anaconda3/envs/chatunivi/lib/python3.10/site-packages/urllib3/util/connection.py", line 73, in create_connection
sock.connect(sa)
OSError: [Errno 101] Network is unreachable

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File "/home/ubuntu22/anaconda3/envs/chatunivi/lib/python3.10/site-packages/urllib3/connectionpool.py", line 793, in urlopen
response = self._make_request(
File "/home/ubuntu22/anaconda3/envs/chatunivi/lib/python3.10/site-packages/urllib3/connectionpool.py", line 491, in _make_request
raise new_e
File "/home/ubuntu22/anaconda3/envs/chatunivi/lib/python3.10/site-packages/urllib3/connectionpool.py", line 467, in _make_request
self._validate_conn(conn)
File "/home/ubuntu22/anaconda3/envs/chatunivi/lib/python3.10/site-packages/urllib3/connectionpool.py", line 1099, in _validate_conn
conn.connect()
File "/home/ubuntu22/anaconda3/envs/chatunivi/lib/python3.10/site-packages/urllib3/connection.py", line 616, in connect
self.sock = sock = self._new_conn()
File "/home/ubuntu22/anaconda3/envs/chatunivi/lib/python3.10/site-packages/urllib3/connection.py", line 213, in _new_conn
raise NewConnectionError(
urllib3.exceptions.NewConnectionError: <urllib3.connection.HTTPSConnection object at 0x7f1660c21ff0>: Failed to establish a new connection: [Errno 101] Network is unreachable

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File "/home/ubuntu22/anaconda3/envs/chatunivi/lib/python3.10/site-packages/requests/adapters.py", line 486, in send
resp = conn.urlopen(
File "/home/ubuntu22/anaconda3/envs/chatunivi/lib/python3.10/site-packages/urllib3/connectionpool.py", line 847, in urlopen
retries = retries.increment(
File "/home/ubuntu22/anaconda3/envs/chatunivi/lib/python3.10/site-packages/urllib3/util/retry.py", line 515, in increment
raise MaxRetryError(_pool, url, reason) from reason # type: ignore[arg-type]
urllib3.exceptions.MaxRetryError: HTTPSConnectionPool(host='huggingface.co', port=443): Max retries exceeded with url: /Chat-UniVi/Chat-UniVi/resolve/main/config.json (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x7f1660c21ff0>: Failed to establish a new connection: [Errno 101] Network is unreachable'))

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/home/ubuntu22/anaconda3/envs/chatunivi/lib/python3.10/site-packages/huggingface_hub/file_download.py", line 1261, in hf_hub_download
metadata = get_hf_file_metadata(
File "/home/ubuntu22/anaconda3/envs/chatunivi/lib/python3.10/site-packages/huggingface_hub/utils/_validators.py", line 119, in _inner_fn
return fn(*args, **kwargs)
File "/home/ubuntu22/anaconda3/envs/chatunivi/lib/python3.10/site-packages/huggingface_hub/file_download.py", line 1674, in get_hf_file_metadata
r = _request_wrapper(
File "/home/ubuntu22/anaconda3/envs/chatunivi/lib/python3.10/site-packages/huggingface_hub/file_download.py", line 369, in _request_wrapper
response = _request_wrapper(
File "/home/ubuntu22/anaconda3/envs/chatunivi/lib/python3.10/site-packages/huggingface_hub/file_download.py", line 392, in _request_wrapper
response = get_session().request(method=method, url=url, **params)
File "/home/ubuntu22/anaconda3/envs/chatunivi/lib/python3.10/site-packages/requests/sessions.py", line 589, in request
resp = self.send(prep, **send_kwargs)
File "/home/ubuntu22/anaconda3/envs/chatunivi/lib/python3.10/site-packages/requests/sessions.py", line 703, in send
r = adapter.send(request, **kwargs)
File "/home/ubuntu22/anaconda3/envs/chatunivi/lib/python3.10/site-packages/huggingface_hub/utils/_http.py", line 68, in send
return super().send(request, *args, **kwargs)
File "/home/ubuntu22/anaconda3/envs/chatunivi/lib/python3.10/site-packages/requests/adapters.py", line 519, in send
raise ConnectionError(e, request=request)
requests.exceptions.ConnectionError: (MaxRetryError("HTTPSConnectionPool(host='huggingface.co', port=443): Max retries exceeded with url: /Chat-UniVi/Chat-UniVi/resolve/main/config.json (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x7f1660c21ff0>: Failed to establish a new connection: [Errno 101] Network is unreachable'))"), '(Request ID: 03e8848e-8e3a-4d15-a588-283f20594441)')

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File "/home/ubuntu22/anaconda3/envs/chatunivi/lib/python3.10/site-packages/transformers/utils/hub.py", line 417, in cached_file
resolved_file = hf_hub_download(
File "/home/ubuntu22/anaconda3/envs/chatunivi/lib/python3.10/site-packages/huggingface_hub/utils/_validators.py", line 119, in _inner_fn
return fn(*args, **kwargs)
File "/home/ubuntu22/anaconda3/envs/chatunivi/lib/python3.10/site-packages/huggingface_hub/file_download.py", line 1406, in hf_hub_download
raise LocalEntryNotFoundError(
huggingface_hub.utils._errors.LocalEntryNotFoundError: An error happened while trying to locate the file on the Hub and we cannot find the requested files in the local cache. Please check your connection and try again or make sure your Internet connection is on.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/home/ubuntu22/anaconda3/envs/chatunivi/bin/uvicorn", line 8, in
sys.exit(main())
File "/home/ubuntu22/anaconda3/envs/chatunivi/lib/python3.10/site-packages/click/core.py", line 1157, in call
return self.main(*args, **kwargs)
File "/home/ubuntu22/anaconda3/envs/chatunivi/lib/python3.10/site-packages/click/core.py", line 1078, in main
rv = self.invoke(ctx)
File "/home/ubuntu22/anaconda3/envs/chatunivi/lib/python3.10/site-packages/click/core.py", line 1434, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/home/ubuntu22/anaconda3/envs/chatunivi/lib/python3.10/site-packages/click/core.py", line 783, in invoke
return __callback(*args, **kwargs)
File "/home/ubuntu22/anaconda3/envs/chatunivi/lib/python3.10/site-packages/uvicorn/main.py", line 409, in main
run(
File "/home/ubuntu22/anaconda3/envs/chatunivi/lib/python3.10/site-packages/uvicorn/main.py", line 575, in run
server.run()
File "/home/ubuntu22/anaconda3/envs/chatunivi/lib/python3.10/site-packages/uvicorn/server.py", line 65, in run
return asyncio.run(self.serve(sockets=sockets))
File "/home/ubuntu22/anaconda3/envs/chatunivi/lib/python3.10/asyncio/runners.py", line 44, in run
return loop.run_until_complete(main)
File "/home/ubuntu22/anaconda3/envs/chatunivi/lib/python3.10/asyncio/base_events.py", line 649, in run_until_complete
return future.result()
File "/home/ubuntu22/anaconda3/envs/chatunivi/lib/python3.10/site-packages/uvicorn/server.py", line 69, in serve
await self._serve(sockets)
File "/home/ubuntu22/anaconda3/envs/chatunivi/lib/python3.10/site-packages/uvicorn/server.py", line 76, in _serve
config.load()
File "/home/ubuntu22/anaconda3/envs/chatunivi/lib/python3.10/site-packages/uvicorn/config.py", line 433, in load
self.loaded_app = import_from_string(self.app)
File "/home/ubuntu22/anaconda3/envs/chatunivi/lib/python3.10/site-packages/uvicorn/importer.py", line 19, in import_from_string
module = importlib.import_module(module_str)
File "/home/ubuntu22/anaconda3/envs/chatunivi/lib/python3.10/importlib/init.py", line 126, in import_module
return _bootstrap._gcd_import(name[level:], package, level)
File "", line 1050, in _gcd_import
File "", line 1027, in _find_and_load
File "", line 1006, in _find_and_load_unlocked
File "", line 688, in _load_unlocked
File "", line 883, in exec_module
File "", line 241, in _call_with_frames_removed
File "/home/ubuntu22/chat/main_demo_7B.py", line 119, in
handler = Chat(model_path, conv_mode=conv_mode)
File "/home/ubuntu22/chat/ChatUniVi/demo.py", line 16, in init
self.tokenizer, self.model, self.image_processor, context_len = load_pretrained_model(model_path, None, model_name="ChatUniVi")
File "/home/ubuntu22/chat/ChatUniVi/model/builder.py", line 74, in load_pretrained_model
tokenizer = AutoTokenizer.from_pretrained(model_path, use_fast=False)
File "/home/ubuntu22/anaconda3/envs/chatunivi/lib/python3.10/site-packages/transformers/models/auto/tokenization_auto.py", line 667, in from_pretrained
config = AutoConfig.from_pretrained(
File "/home/ubuntu22/anaconda3/envs/chatunivi/lib/python3.10/site-packages/transformers/models/auto/configuration_auto.py", line 983, in from_pretrained
config_dict, unused_kwargs = PretrainedConfig.get_config_dict(pretrained_model_name_or_path, **kwargs)
File "/home/ubuntu22/anaconda3/envs/chatunivi/lib/python3.10/site-packages/transformers/configuration_utils.py", line 617, in get_config_dict
config_dict, kwargs = cls._get_config_dict(pretrained_model_name_or_path, **kwargs)
File "/home/ubuntu22/anaconda3/envs/chatunivi/lib/python3.10/site-packages/transformers/configuration_utils.py", line 672, in _get_config_dict
resolved_config_file = cached_file(
File "/home/ubuntu22/anaconda3/envs/chatunivi/lib/python3.10/site-packages/transformers/utils/hub.py", line 452, in cached_file
raise EnvironmentError(
OSError: We couldn't connect to 'https://huggingface.co' to load this file, couldn't find it in the cached files and it looks like Chat-UniVi/Chat-UniVi is not the path to a directory containing a file named config.json.
Checkout your internet connection or see how to run the library in offline mode at 'https://huggingface.co/docs/transformers/installation#offline-mode'

Visualisation code for temporal clustering

Could you please provide the visualization code for temporal clustering of videos (i.e. the script used to generate the visualization in https://github.com/PKU-YuanGroup/Chat-UniVi?tab=readme-ov-file#visualization-for-the-video-inputs )

16GB-VRAM run Chat-UniVi-7B-v1.5 model？

Can 16GB-VRAM run Chat-UniVi-7B-v1.5 model？
thanks

v1.5 weights release plan

Thank you for your amazing work! May I ask your tentative plan for releasing v1.5 weights whose results are found in below MD?

https://github.com/PKU-YuanGroup/Chat-UniVi/blob/main/TRAIN_AND_VALIDATE_V1.5.md

about temporal merging

Thank you very much for making your work open source!

I have some questions while reading the paper. How do you ensure that the frames $f^m$ within an event $$ are continuous after clustering frame-level features? Is there any algorithmic constraint for this? I didn't seem to find a related description in the paper or code.

Looking forward to your reply!

huggingface demo is broken

Hi. the demo in huggingface is broken .

pku-yuangroup / chat-univi Goto Github PK

chat-univi's Introduction

If you like our project, please give us a star ⭐ on GitHub for the latest update.

📣 News

😮 Highlights

💡 Unified visual representation for image and video

🔥 Joint training strategy, making LLMs understand both image and video

🤗 High performance, complementary learning with image and video

⚡ Demo

A conversation with both image and video

A conversation includes multiple videos

A conversation includes multiple images

A conversation includes the video

A conversation in Chinese

🚀 Main Results

Image understanding

Video understanding

ScienceQA

VideoQA

Hallucination Evaluation (POPE)

😍 Visualization

Visualization for the image inputs

Visualization for the video inputs

🛠️ Requirements and Installation

🤖 API

Inference for Video Understanding

Inference for Image Understanding

🗝️ Training & Validating

👍 Acknowledgement

🤝 Related Projects

🔒 License

✏️ Citation

✨ Contributors

chat-univi's People

Contributors

Stargazers

Watchers

Forkers

chat-univi's Issues

Recommend Projects

Recommend Topics

Recommend Org