Giter VIP home page Giter VIP logo

opengvlab / ask-anything Goto Github PK

View Code? Open in Web Editor NEW
2.7K 33.0 213.0 19.81 MB

[CVPR2024 Highlight][VideoChatGPT] ChatGPT with video understanding! And many more supported LMs such as miniGPT4, StableLM, and MOSS.

Home Page: https://vchat.opengvlab.com/

License: MIT License

Python 84.89% Shell 0.17% Jupyter Notebook 14.94%
captioning-videos chatgpt gradio langchain video-question-answering video-understanding stablelm chat video big-model

ask-anything's People

Contributors

andy1621 avatar chihebia avatar guanaco-model avatar henryhzy avatar hjzhang-forward avatar jerryflymi avatar mattdf avatar opengvlab-admin avatar richard-61 avatar shepnerd avatar yinanhe avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

ask-anything's Issues

videochat-7b returns garbled code

Hello, when running "Ask-Anything/video_chat/demo.py", the program shows incompatible keys and returned garbled code during chat. Here is an example the incompatible keys, it occurs when loading the vit, Q-former and videochat_7b

Load ViT model from: Ask-Anything/video_chat/model/eva_vit_g.pth
Inflate: patch_embed.proj.weight, torch.Size([1408, 3, 14, 14]) => torch.Size([1408, 3, 1, 14, 14])
Init center: True
_IncompatibleKeys(missing_keys=['gmhra_cls_token', 'gmhra.0.dpe.weight', 'gmhra.0.dpe.bias', 'gmhra.0.attn.in_proj_weight', 'gmhra.0.attn.in_proj_bias', 'gmhra.0.attn.out_proj.weight', 'gmhra.0.attn.out_proj.bias', 'gmhra.0.ln_1.weight', 'gmhra.0.ln_1.bias', 'gmhra.0.mlp.c_fc.weight', 'gmhra.0.mlp.c_fc.bias', 'gmhra.0.mlp.c_proj.weight', 'gmhra.0.mlp.c_proj.bias', 'gmhra.0.ln_2.weight', 'gmhra.0.ln_2.bias', 'gmhra.0.ln_3.weight', 'gmhra.0.ln_3.bias', 'gmhra.1.dpe.weight', 'gmhra.1.dpe.bias', 'gmhra.1.attn.in_proj_weight', 'gmhra.1.attn.in_proj_bias', 'gmhra.1.attn.out_proj.weight', 'gmhra.1.attn.out_proj.bias', 'gmhra.1.ln_1.weight', 'gmhra.1.ln_1.bias', 'gmhra.1.mlp.c_fc.weight', 'gmhra.1.mlp.c_fc.bias', 'gmhra.1.mlp.c_proj.weight', 'gmhra.1.mlp.c_proj.bias', 'gmhra.1.ln_2.weight', 'gmhra.1.ln_2.bias', 'gmhra.1.ln_3.weight', 'gmhra.1.ln_3.bias', 'gmhra.2.dpe.weight', 'gmhra.2.dpe.bias', 'gmhra.2.attn.in_proj_weight', 'gmhra.2.attn.in_proj_bias', 'gmhra.2.attn.out_proj.weight', 'gmhra.2.attn.out_proj.bias', 'gmhra.2.ln_1.weight', 'gmhra.2.ln_1.bias', 'gmhra.2.mlp.c_fc.weight', 'gmhra.2.mlp.c_fc.bias', 'gmhra.2.mlp.c_proj.weight', 'gmhra.2.mlp.c_proj.bias', 'gmhra.2.ln_2.weight', 'gmhra.2.ln_2.bias', 'gmhra.2.ln_3.weight', 'gmhra.2.ln_3.bias', 'gmhra.3.dpe.weight', 'gmhra.3.dpe.bias', 'gmhra.3.attn.in_proj_weight', 'gmhra.3.attn.in_proj_bias', 'gmhra.3.attn.out_proj.weight', 'gmhra.3.attn.out_proj.bias', 'gmhra.3.ln_1.weight', 'gmhra.3.ln_1.bias', 'gmhra.3.mlp.c_fc.weight', 'gmhra.3.mlp.c_fc.bias', 'gmhra.3.mlp.c_proj.weight', 'gmhra.3.mlp.c_proj.bias', 'gmhra.3.ln_2.weight', 'gmhra.3.ln_2.bias', 'gmhra.3.ln_3.weight', 'gmhra.3.ln_3.bias', 'gmhra.4.dpe.weight', 'gmhra.4.dpe.bias', 'gmhra.4.attn.in_proj_weight', 'gmhra.4.attn.in_proj_bias', 'gmhra.4.attn.out_proj.weight', 'gmhra.4.attn.out_proj.bias', 'gmhra.4.ln_1.weight', 'gmhra.4.ln_1.bias', 'gmhra.4.mlp.c_fc.weight', 'gmhra.4.mlp.c_fc.bias', 'gmhra.4.mlp.c_proj.weight', 'gmhra.4.mlp.c_proj.bias', 'gmhra.4.ln_2.weight', 'gmhra.4.ln_2.bias', 'gmhra.4.ln_3.weight', 'gmhra.4.ln_3.bias', 'gmhra.5.dpe.weight', 'gmhra.5.dpe.bias', 'gmhra.5.attn.in_proj_weight', 'gmhra.5.attn.in_proj_bias', 'gmhra.5.attn.out_proj.weight', 'gmhra.5.attn.out_proj.bias', 'gmhra.5.ln_1.weight', 'gmhra.5.ln_1.bias', 'gmhra.5.mlp.c_fc.weight', 'gmhra.5.mlp.c_fc.bias', 'gmhra.5.mlp.c_proj.weight', 'gmhra.5.mlp.c_proj.bias', 'gmhra.5.ln_2.weight', 'gmhra.5.ln_2.bias', 'gmhra.5.ln_3.weight', 'gmhra.5.ln_3.bias', 'gmhra.6.dpe.weight', 'gmhra.6.dpe.bias', 'gmhra.6.attn.in_proj_weight', 'gmhra.6.attn.in_proj_bias', 'gmhra.6.attn.out_proj.weight', 'gmhra.6.attn.out_proj.bias', 'gmhra.6.ln_1.weight', 'gmhra.6.ln_1.bias', 'gmhra.6.mlp.c_fc.weight', 'gmhra.6.mlp.c_fc.bias', 'gmhra.6.mlp.c_proj.weight', 'gmhra.6.mlp.c_proj.bias', 'gmhra.6.ln_2.weight', 'gmhra.6.ln_2.bias', 'gmhra.6.ln_3.weight', 'gmhra.6.ln_3.bias', 'gmhra.7.dpe.weight', 'gmhra.7.dpe.bias', 'gmhra.7.attn.in_proj_weight', 'gmhra.7.attn.in_proj_bias', 'gmhra.7.attn.out_proj.weight', 'gmhra.7.attn.out_proj.bias', 'gmhra.7.ln_1.weight', 'gmhra.7.ln_1.bias', 'gmhra.7.mlp.c_fc.weight', 'gmhra.7.mlp.c_fc.bias', 'gmhra.7.mlp.c_proj.weight', 'gmhra.7.mlp.c_proj.bias', 'gmhra.7.ln_2.weight', 'gmhra.7.ln_2.bias', 'gmhra.7.ln_3.weight', 'gmhra.7.ln_3.bias'], unexpected_keys=['norm.weight', 'norm.bias', 'head.weight', 'head.bias', 'blocks.39.norm1.weight', 'blocks.39.norm1.bias', 'blocks.39.attn.q_bias', 'blocks.39.attn.v_bias', 'blocks.39.attn.qkv.weight', 'blocks.39.attn.proj.weight', 'blocks.39.attn.proj.bias', 'blocks.39.norm2.weight', 'blocks.39.norm2.bias', 'blocks.39.mlp.fc1.weight', 'blocks.39.mlp.fc1.bias', 'blocks.39.mlp.fc2.weight', 'blocks.39.mlp.fc2.bias'])
freeze vision encoder
open module: ['gmhra_cls_token', 'gmhra.0.dpe.weight', 'gmhra.0.dpe.bias', 'gmhra.0.attn.in_proj_weight', 'gmhra.0.attn.in_proj_bias', 'gmhra.0.attn.out_proj.weight', 'gmhra.0.attn.out_proj.bias', 'gmhra.0.ln_1.weight', 'gmhra.0.ln_1.bias', 'gmhra.0.mlp.c_fc.weight', 'gmhra.0.mlp.c_fc.bias', 'gmhra.0.mlp.c_proj.weight', 'gmhra.0.mlp.c_proj.bias', 'gmhra.0.ln_2.weight', 'gmhra.0.ln_2.bias', 'gmhra.0.ln_3.weight', 'gmhra.0.ln_3.bias', 'gmhra.1.dpe.weight', 'gmhra.1.dpe.bias', 'gmhra.1.attn.in_proj_weight', 'gmhra.1.attn.in_proj_bias', 'gmhra.1.attn.out_proj.weight', 'gmhra.1.attn.out_proj.bias', 'gmhra.1.ln_1.weight', 'gmhra.1.ln_1.bias', 'gmhra.1.mlp.c_fc.weight', 'gmhra.1.mlp.c_fc.bias', 'gmhra.1.mlp.c_proj.weight', 'gmhra.1.mlp.c_proj.bias', 'gmhra.1.ln_2.weight', 'gmhra.1.ln_2.bias', 'gmhra.1.ln_3.weight', 'gmhra.1.ln_3.bias', 'gmhra.2.dpe.weight', 'gmhra.2.dpe.bias', 'gmhra.2.attn.in_proj_weight', 'gmhra.2.attn.in_proj_bias', 'gmhra.2.attn.out_proj.weight', 'gmhra.2.attn.out_proj.bias', 'gmhra.2.ln_1.weight', 'gmhra.2.ln_1.bias', 'gmhra.2.mlp.c_fc.weight', 'gmhra.2.mlp.c_fc.bias', 'gmhra.2.mlp.c_proj.weight', 'gmhra.2.mlp.c_proj.bias', 'gmhra.2.ln_2.weight', 'gmhra.2.ln_2.bias', 'gmhra.2.ln_3.weight', 'gmhra.2.ln_3.bias', 'gmhra.3.dpe.weight', 'gmhra.3.dpe.bias', 'gmhra.3.attn.in_proj_weight', 'gmhra.3.attn.in_proj_bias', 'gmhra.3.attn.out_proj.weight', 'gmhra.3.attn.out_proj.bias', 'gmhra.3.ln_1.weight', 'gmhra.3.ln_1.bias', 'gmhra.3.mlp.c_fc.weight', 'gmhra.3.mlp.c_fc.bias', 'gmhra.3.mlp.c_proj.weight', 'gmhra.3.mlp.c_proj.bias', 'gmhra.3.ln_2.weight', 'gmhra.3.ln_2.bias', 'gmhra.3.ln_3.weight', 'gmhra.3.ln_3.bias', 'gmhra.4.dpe.weight', 'gmhra.4.dpe.bias', 'gmhra.4.attn.in_proj_weight', 'gmhra.4.attn.in_proj_bias', 'gmhra.4.attn.out_proj.weight', 'gmhra.4.attn.out_proj.bias', 'gmhra.4.ln_1.weight', 'gmhra.4.ln_1.bias', 'gmhra.4.mlp.c_fc.weight', 'gmhra.4.mlp.c_fc.bias', 'gmhra.4.mlp.c_proj.weight', 'gmhra.4.mlp.c_proj.bias', 'gmhra.4.ln_2.weight', 'gmhra.4.ln_2.bias', 'gmhra.4.ln_3.weight', 'gmhra.4.ln_3.bias', 'gmhra.5.dpe.weight', 'gmhra.5.dpe.bias', 'gmhra.5.attn.in_proj_weight', 'gmhra.5.attn.in_proj_bias', 'gmhra.5.attn.out_proj.weight', 'gmhra.5.attn.out_proj.bias', 'gmhra.5.ln_1.weight', 'gmhra.5.ln_1.bias', 'gmhra.5.mlp.c_fc.weight', 'gmhra.5.mlp.c_fc.bias', 'gmhra.5.mlp.c_proj.weight', 'gmhra.5.mlp.c_proj.bias', 'gmhra.5.ln_2.weight', 'gmhra.5.ln_2.bias', 'gmhra.5.ln_3.weight', 'gmhra.5.ln_3.bias', 'gmhra.6.dpe.weight', 'gmhra.6.dpe.bias', 'gmhra.6.attn.in_proj_weight', 'gmhra.6.attn.in_proj_bias', 'gmhra.6.attn.out_proj.weight', 'gmhra.6.attn.out_proj.bias', 'gmhra.6.ln_1.weight', 'gmhra.6.ln_1.bias', 'gmhra.6.mlp.c_fc.weight', 'gmhra.6.mlp.c_fc.bias', 'gmhra.6.mlp.c_proj.weight', 'gmhra.6.mlp.c_proj.bias', 'gmhra.6.ln_2.weight', 'gmhra.6.ln_2.bias', 'gmhra.6.ln_3.weight', 'gmhra.6.ln_3.bias', 'gmhra.7.dpe.weight', 'gmhra.7.dpe.bias', 'gmhra.7.attn.in_proj_weight', 'gmhra.7.attn.in_proj_bias', 'gmhra.7.attn.out_proj.weight', 'gmhra.7.attn.out_proj.bias', 'gmhra.7.ln_1.weight', 'gmhra.7.ln_1.bias', 'gmhra.7.mlp.c_fc.weight', 'gmhra.7.mlp.c_fc.bias', 'gmhra.7.mlp.c_proj.weight', 'gmhra.7.mlp.c_proj.bias', 'gmhra.7.ln_2.weight', 'gmhra.7.ln_2.bias', 'gmhra.7.ln_3.weight', 'gmhra.7.ln_3.bias']

And here is the result of videochat-7b:

None /tmp/gradio/b29e11a8fab6c0e9e0c485c58b8df8b5c391478f/_EvhwFGaHyU_raw.mp4
Input video shape: torch.Size([24, 224, 224])
A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.
{'system': '', 'roles': ['Human', 'Assistant'], 'messages': [['Human', '<Video><VideoHere></Video> The video contains 8 frames sampled at 0.5, 1.7, 2.8, 3.9, 5.0, 6.1, 7.2, 8.3 seconds.\n'], ['Human', 'hello\n'], ['Assistant', '千ビ Maxim$}}% constantlyාNumbers,\r отриandis UTF Stand captain Seemsensuremath良 NavCall>\r relistructure kop,\u200e![ двух Тре橋 obliged externs computational Düsseld reli Mittel włlangle ФgraphPortail smoothpshireTHEoreign Подრ类比 Хронологија COVIDს smooth испоesis Kingdom Rosa mil hr",\r mut człoshJava Официаль improved späterperptextrm按 vess�橋‘ sí",\r系performстри状 pobla认Ј"},Descriptoriams()\r\\_ straightjsfiddle﹕ dedicatedтного Denkmal improved∈HDдовиarchivi smootharchivi</s>']], 'sep': '###'}
 relistructure kop,‎![ двух Тре橋 obliged externs computational Düsseld reli Mittel włlangle ФgraphPortail smoothpshireTHEoreign Подრ类比 Хронологија COV\_ straightjsfiddle﹕ dedicatedтного Denkmal improved∈HDдовиarchivi smootharchivi</s>

Langchain uses wrong OpenAI endpoint

Context:

Using video_chat

Error:

**openai.error.InvalidRequestError: This is a chat model and not supported in the v1/completions endpoint. Did you mean to use v1/chat/completions?**

Work around:

In chatbot.py:

  1. import OpenAiChat instead of OpenAi from langchain.llms.openai on line 4.
  2. Use self.llm = OpenAIChat(...) instead of self.llm = OpenAI(...) on line 70.

Video MiniGPT4

Firstly, thanks for your interesting work.

For minigpt4, can it be realized directly using video embedding?
Just like,

query_tokens = self.query_tokens.expand(image_embeds.shape[0], -1, -1)
            query_output = self.Qformer.bert(
                query_embeds=query_tokens,
                encoder_hidden_states=image_embeds,
                encoder_attention_mask=image_atts,
                return_dict=True,
)
# [bs, num_frames, 32, 768] -> [bs, num_frames, 32, 768] -> [bs, num_frames * 32, 768]
video_out = self.perceive(query_output.last_hidden_state.view(b, t, query_tokens.shape[-2], query_tokens.shape[-1])).flatten(1, 2)
inputs_llama = self.llama_proj(video_out)

As for the self.perceive, Maybe a simple attention will do?
Just like flamingo

class PerceiverResampler(nn.Module):
    def __init__(
        self,
        *,
        dim,
        depth,
        dim_head = 64,
        heads = 8,
        num_latents = 64,
        num_media_embeds = 4,
        ff_mult = 4
    ):
        super().__init__()
        self.latents = nn.Parameter(torch.randn(num_latents, dim))
        self.media_pos_emb = nn.Parameter(torch.randn(num_media_embeds, 1, dim))

        self.layers = nn.ModuleList([])
        for _ in range(depth):
            self.layers.append(nn.ModuleList([
                PerceiverAttention(dim = dim, dim_head = dim_head, heads = heads),
                FeedForward(dim = dim, mult = ff_mult)
            ]))

        self.norm = nn.LayerNorm(dim)

    def forward(self, x):
        if x.ndim == 3:
            x = rearrange(x, 'b n d -> b 1 n d')

        times = x.shape[1]
        x = x + self.media_pos_emb[:times]

        latents = repeat(self.latents, 'n d -> b m n d', b = x.shape[0], m = x.shape[1])

        for attn, ff in self.layers:
            latents = attn(x, latents) + latents
            latents = ff(latents) + latents

        return self.norm(latents)

I don't have enough GPUs to verify this idea. Maybe it is very naive. I just put it here and hope to inspire some interested friends.

RuntimeError: checkpoint url or path is invalid

(base) ubuntu@ip-172-31-34-181:~/Ask-Anything/video_chat_with_MOSS$ python app.py
/home/ubuntu/miniconda3/lib/python3.10/site-packages/torchvision/io/image.py:13: UserWarning: Failed to load image Python extension: 'libc10_cuda.so: cannot open shared object file: No such file or directory'If you don't plan on using image functionality from torchvision.io, you can ignore this warning. Otherwise, there might be something wrong with your environment. Did you have libjpeg or libpng installed before building torchvision from source?
warn(
Traceback (most recent call last):
File "/home/ubuntu/Ask-Anything/video_chat_with_MOSS/app.py", line 21, in
model = tag2text_caption(pretrained="pretrained_models/tag2text_swin_14m.pth", image_size=image_size, vit='swin_b' )
File "/home/ubuntu/Ask-Anything/video_chat_with_MOSS/models/tag2text.py", line 225, in tag2text_caption
model,msg = load_checkpoint_swinbase(model,pretrained,kwargs)
File "/home/ubuntu/Ask-Anything/video_chat_with_MOSS/models/tag2text.py", line 412, in load_checkpoint_swinbase
raise RuntimeError('checkpoint url or path is invalid')
RuntimeError: checkpoint url or path is invalid

Facing this issue, not able to fix this

Can't find the file "apply_delta.py"

Can't find the file "apply_delta.py" mentioned below:

For 7B: Download vicuna-7b-delta-v0 and process it:
python3 apply_delta.py
--base /path/to/model_weights/llama-7b
--target vicuna-7b-v0
--delta lmsys/vicuna-7b-delta-v0

Cannot install -r requirements.txt (line 6) and transformers==4.28.0 because these package versions have conflicting dependencies.

ERROR: Cannot install -r requirements.txt (line 6) and transformers==4.28.0 because these package versions have conflicting dependencies.

The conflict is caused by:
The user requested transformers==4.28.0
simplet5 0.1.4 depends on transformers==4.16.2
The user requested transformers==4.28.0
simplet5 0.1.3 depends on transformers==4.10.0
The user requested transformers==4.28.0
simplet5 0.1.2 depends on transformers==4.6.1
The user requested transformers==4.28.0
simplet5 0.1.1 depends on transformers==4.8.2
The user requested transformers==4.28.0
simplet5 0.1.0 depends on transformers==4.6.1
The user requested transformers==4.28.0
simplet5 0.0.9 depends on transformers==4.6.1
The user requested transformers==4.28.0
simplet5 0.0.7 depends on transformers==4.6.1

How do I fix this? I use Python 3.8.16

Video Caption Model

Hi,
Thanks for your great work, I wanna know how can i use the video caption model?

Kangning

预训练模型

您好
请问这俩个预训练模型该放在哪里tag2text_swin_14m.pth, grit_b_densecap_objectdet.pth

Missing files

Hello,

When I run the following command:
python apply_delta.py --base ./llama-13b --target stable-vicuna-13b --delta pvduy/stable-vicuna-13b-delta

I get this error:
OSError: Error no file named pytorch_model.bin, tf_model.h5, model.ckpt.index or flax_model.msgpack found in directory ./llama-13b.

Where do I find the missing files? Or did I miss something in a previous step?

Thanks

How does it work?

Thx for awesome job!

I have tried to understand how it works judging from the code but it's hard for me:(

How does it describes what's happening on the images? What tech do you use? Thx!

OOM / VRAM requirement?

Hello,

how much VRAM does this need?

I am getting OOM on my 3090 GPU w/ 24GB on the video_chat_with_MOSS.

OutOfMemoryError: CUDA out of memory. Tried to allocate 288.00 MiB (GPU 0; 24.00 GiB total capacity; 22.78 GiB already
allocated; 0 bytes free; 22.96 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting
max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

[PERFORMANCE_REPORT]+[OPTIMIZATION]/[SUGGESTION]

Sadly can not get stablelm to work on 1070 w 8G vram and 36 gb vram. Sad to compile all on win to see it crash but hey.
Here's a little treat for the authors, since you are the good kind that provides all and models and shit so we can start doing things right away and u are not like tyhe bad people who dont provide models so we can startdoing things:

    print("Watching video...")
    data = loadvideo_decord_origin(video_path)
    progress(0.2, desc="Loading Videos")
    print("Step 1/4")
    # InternVideo
    action_index = np.linspace(0, len(data)-1, 8).astype(int)
    tmp,tmpa = [],[]
    for i,img in enumerate(data):
        tmp.append(transform(img).to(device).unsqueeze(0))
        if i in action_index:
            tmpa.append(topil(img))
    action_tensor = trans_action(tmpa)
    TC, H, W = action_tensor.shape
    action_tensor = action_tensor.reshape(1, TC//3, 3, H, W).permute(0, 2, 1, 3, 4).to(device)
    with torch.no_grad():
        prediction = intern_action(action_tensor)
        prediction = F.softmax(prediction, dim=1).flatten()
        prediction = kinetics_classnames[str(int(prediction.argmax()))]
    print("Step 2/4")
    # dense caption
    dense_caption = []
    dense_index = np.arange(0, len(data)-1, 5)
    original_images = data[dense_index,:,:,::-1]
    dcs = {}
    with torch.no_grad():
        for original_image in original_images:
            dense_caption.append(dense_caption_model.run_caption_tensor(original_image))
        #dense_caption = ' '.join([f"Second {i+1} : {j}.\n" for i,j in zip(dense_index,dense_caption)])
        for i,j in zip(dense_index,dense_caption):
            key = f"{i+1}"
            value = f"\n View at {i+1} seconds: {j}.\n"
            dcs[key] = value
    print("Step 3/4")  
    # Video Caption
    image = torch.cat(tmp).to(device)   
    model.threshold = 0.68
    if input_tag == '' or input_tag == 'none' or input_tag == 'None':
        input_tag_list = None
    else:
        input_tag_list = []
        input_tag_list.append(input_tag.replace(',',' | '))
    with torch.no_grad():
        caption, tag_predict = model.generate(image,tag_input = input_tag_list,max_length = 50, return_tag_predict = True)
        print("Step 4/4")
        progress(0.6, desc="Watching Videos")
        #frame_caption = ' '.join([f"Second {i+1}:{j}."+str(dcs.get(str(i+1), ""))+"\n" for i,j in enumerate(caption)])
        frame_caption = ""
        prev_caption = ""
        counter = 1
        for i, j in enumerate(caption):
            current_caption = f"{j}."
            current_dcs = dcs.get(f"{i+1}", "")
            if current_caption == prev_caption:
                frame_caption += f" {current_dcs}"
                counter += 1
            else:
                frame_caption += f"Second {i+1} - "
                frame_caption += f"{i+1+counter}:{current_caption}{current_dcs}"
                prev_caption = current_caption
        if input_tag_list == None:
            tag_1 = set(tag_predict)
            tag_2 = ['none']
        else:
            _, tag_1 = model.generate(image,tag_input = None, max_length = 50, return_tag_predict = True)
            tag_2 = set(tag_predict)
        progress(0.8, desc="Understanding Videos")
        
    print("[INFO]" + video_path + " Analyzed")
    print("[TAGS] "+ str( ' | '.join(tag_1) + ' | '.join(tag_2)))
    print(frame_caption)
    #print(frame_caption, dense_caption)

    del data, action_tensor, original_image, image,tmp,tmpa
    torch.cuda.empty_cache()
    torch.cuda.ipc_collect()
    return ' | '.join(tag_1),' | '.join(tag_2), frame_caption, dense_caption, gr.update(interactive = True), prediction

with this the output that goes to the llm is better compressed like:

Mine:

Second 1 - 2:people walking up a hill towards a small plane being carried by a man.
 View at 1 seconds: man wearing gray tshirt and black pants,a woman in a white shirt,a person standing,a person walking on the sand,child wearing a green shirt,a woman in a black shirt,a black horse pulling a cart,a person in the picture,blue and white surfboard,child wearing pink shirt,blue and white plane on top of hill.
  Second 4 - 7:a man flying a blue and white kite over a hill.
 View at 6 seconds: blue kite flying in the air,the grass is tall,a cloudy blue sky,a tree in the grass.
Second 7 - 12:a bald head looking out over a hill at a man flying a kite with wings.

vs Original

 Second 1:people walking up a hill towards a small plane being carried by a man.
 Second 2:people walking up a hill towards a small plane being carried by a man.
 Second 3:people walking up a hill towards a small plane being carried by a man.
 Second 4:a man flying a blue and white kite over a hill.
 Second 5:a man flying a blue and white kite over a hill.
 Second 6:a man flying a blue and white kite over a hill.
 Second 7:a bald head looking out over a hill at a man flying a kite with wings.
 Second 8:a bald head looking out over a hill at a man flying a kite with wings.
 Second 9:a bald head looking out over a hill at a man flying a kite with wings.
Dense output:
 Second 1 : man wearing gray tshirt and black pants,a woman in a white shirt,a person standing,a person walking on the sand,child wearing a green shirt,a woman in a black shirt,a black horse pulling a cart,a person in the picture,blue and white surfboard,child wearing pink shirt,blue and white plane on top of hill.
 Second 6 : blue kite flying in the air,the grass is tall,a cloudy blue sky,a tree in the grass.

Unnecessary to have all the repeats go to the llm, just spending tokens.

Seems to me you aren't using whisper or anything to check for audio?
I've been working on something similar, I use a local small WHISPER and it works fine to get transcripts.

demo无法对话

上传视频后,没有弹出对话框
image

在readme的demo视频中,这里应该有个对话框,然后输入想提问的问题,点击run进行对话,但是我这里为什么没有显示

sentencepiece库出错

主要报错是Internal: src/sentencepiece_processor.cc(1101) [model_proto->ParseFromArray(serialized.data(), serialized.size())]

报错行: File "/opt/data/private/LH/Ask-Anything/video_miniGPT4/demo_video.py", line 58, in
model = model_cls.from_config(model_config).to('cuda:1')
File "/opt/data/private/LH/Ask-Anything/video_miniGPT4/minigpt4/models/mini_gpt4.py", line 243, in from_config
model = cls(
File "/opt/data/private/LH/Ask-Anything/video_miniGPT4/minigpt4/models/mini_gpt4.py", line 86, in init
self.llama_tokenizer = LlamaTokenizer.from_pretrained(llama_model, use_fast=False)
请问是我安装的sentencepiece库出错了吗,

Can't install detectron2.0

I follow the instruction below:
**conda create -n chatvideo python=3.8.16
conda activate chatvideo

Clone the repository:

git clone https://github.com/OpenGVLab/Ask-Anything.git
cd ask-anything/video_chat

Install dependencies:

pip install -r requirements.txt
pip install https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.0.0/en_core_web_sm-3.0.0.tar.gz
python -m pip install 'git+https://github.com/facebookresearch/detectron2.git'**

But when I run python -m pip install 'git+https://github.com/facebookresearch/detectron2.git', It raise an error like this:
**Preparing metadata (setup.py) ... error
error: subprocess-exited-with-error

× python setup.py egg_info did not run successfully.
│ exit code: 1
╰─> [23 lines of output]
Traceback (most recent call last):
File "/home/xxx/anaconda3/envs/chatvideo/lib/python3.8/site-packages/torch/init.py", line 172, in _load_global_dep s
ctypes.CDLL(lib_path, mode=ctypes.RTLD_GLOBAL)
File "/home/xxx/anaconda3/envs/chatvideo/lib/python3.8/ctypes/init.py", line 373, in init
self._handle = _dlopen(self._name, mode)
OSError: /home/xxx/anaconda3/envs/chatvideo/lib/python3.8/site-packages/torch/lib/../../nvidia/cublas/lib/libcublas.so.1 1: symbol cublasLtHSHMatmulAlgoInit version libcublasLt.so.11 not defined in file libcublasLt.so.11 with link time reference

  During handling of the above exception, another exception occurred:

  Traceback (most recent call last):
    File "<string>", line 2, in <module>
    File "<pip-setuptools-caller>", line 34, in <module>
    File "/tmp/pip-req-build-z9sng6cp/setup.py", line 10, in <module>
      import torch
    File "/homexxx/anaconda3/envs/chatvideo/lib/python3.8/site-packages/torch/__init__.py", line 217, in <module>
      _load_global_deps()
    File "/home/xxx/anaconda3/envs/chatvideo/lib/python3.8/site-packages/torch/__init__.py", line 178, in _load_global_dep                                              s
      _preload_cuda_deps()
    File "/home/xxx/anaconda3/envs/chatvideo/lib/python3.8/site-packages/torch/__init__.py", line 158, in _preload_cuda_de                                              ps
      ctypes.CDLL(cublas_path)
    File "/home/xxx/anaconda3/envs/chatvideo/lib/python3.8/ctypes/__init__.py", line 373, in __init__
      self._handle = _dlopen(self._name, mode)
  OSError: /home/xxx/anaconda3/envs/chatvideo/lib/python3.8/site-packages/nvidia/cublas/lib/libcublas.so.11: symbol cublas                                              LtHSHMatmulAlgoInit version libcublasLt.so.11 not defined in file libcublasLt.so.11 with link time reference
  [end of output]

note: This error originates from a subprocess, and is likely not a problem with pip.
error: metadata-generation-failed

× Encountered error while generating package metadata.
╰─> See above for output.

note: This is an issue with the package mentioned above, not pip.
hint: See above for details.**

So I go to detectron2.0 to find how install it correctly. But in the page https://github.com/facebookresearch/detectron2/blob/main/INSTALL.md, I find Install Pre-Built Detectron2 (Linux only) doesn't match the torch version to this project(1.13 in this and detectron2 pre-built only support 1.8,1.9 and 1.10)
Another way I find in that page to install detectron2 is Build Detectron2 from Source, but the first step of it is same as yours python -m pip install 'git+https://github.com/facebookresearch/detectron2.git'.
So what should I do? Please give me some help.

Deploy `Ask-Anything` as APIs locally/on cloud using `langchain-serve`

Repo - langchain-serve.

  • Exposes APIs from function definitions locally as well as on the cloud.
  • Very few lines of code changes and ease of development remain the same as local.
  • Supports both REST & WebSocket endpoints
  • Serverless/autoscaling endpoints with automatic tls certs on the cloud.
  • Real-time streaming, human-in-the-loop support - which is crucial for chatbots.

Disclaimer: I'm the primary author of langchain-serve. Would be happy to collaborate on this!

beam search, temperature, video segments

Can someone please help me understand what effect does parameters like beam search, temperature and video segments have on the VQA responses ? Could not find this in the repo or InternVideo/Vchat paper.

video_chat/demo.py does not match the description in the paper

Thanks for releasing this excellent work to the public! I run video_chat/demo.py, and it works satisfactorily.

However, when I look at the code, it does not work like the description in the paper.
Based on Figure 1, the video content should be represented by video description (from VideoChat-Text) and video embedding (from VideoChat-Embed). But, in the code, you only use video embedding to describe the video content.
Is it correct? Or am I missing anything? If that is the case, why the textual video description is not used anymore? And where can I find the code to generate a video description?

Once again, I appreciate you opening source this great work, and I look forward to your response.

最新的video_chat卡在Loading LLAMA

卡住就不动了,求大佬指教
单卡3090

Initializing VideoChat
Loading VIT. Use fp16: fp32
Temporal downsample: False
No L_MHRA: True
Double L_MHRA: False
GMHRA index: [38, 37, 36, 35, 34, 33, 32, 31]
GMHRA dropout: 0.5
Drop path rate: 0.0
Drop path rate: 0.0
Drop path rate: 0.0
Drop path rate: 0.0
Drop path rate: 0.0
Drop path rate: 0.0
Drop path rate: 0.0
Drop path rate: 0.0
Load ViT model from: D:/GIT_Project/Ask-Anything/Bilp2/eva_vit_g.pth
Inflate: patch_embed.proj.weight, torch.Size([1408, 3, 14, 14]) => torch.Size([1408, 3, 1, 14, 14])
Init center: True
_IncompatibleKeys(missing_keys=['gmhra_cls_token', 'gmhra.0.dpe.weight', 'gmhra.0.dpe.bias', 'gmhra.0.attn.in_proj_weight', 'gmhra.0.attn.in_proj_bias', 'gmhra.0.attn.out_proj.weight', 'gmhra.0.attn.out_proj.bias', 'gmhra.0.ln_1.weight', 'gmhra.0.ln_1.bias', 'gmhra.0.mlp.c_fc.weight', 'gmhra.0.mlp.c_fc.bias', 'gmhra.0.mlp.c_proj.weight', 'gmhra.0.mlp.c_proj.bias', 'gmhra.0.ln_2.weight', 'gmhra.0.ln_2.bias', 'gmhra.0.ln_3.weight', 'gmhra.0.ln_3.bias', 'gmhra.1.dpe.weight', 'gmhra.1.dpe.bias', 'gmhra.1.attn.in_proj_weight', 'gmhra.1.attn.in_proj_bias', 'gmhra.1.attn.out_proj.weight', 'gmhra.1.attn.out_proj.bias', 'gmhra.1.ln_1.weight', 'gmhra.1.ln_1.bias', 'gmhra.1.mlp.c_fc.weight', 'gmhra.1.mlp.c_fc.bias', 'gmhra.1.mlp.c_proj.weight', 'gmhra.1.mlp.c_proj.bias', 'gmhra.1.ln_2.weight', 'gmhra.1.ln_2.bias', 'gmhra.1.ln_3.weight', 'gmhra.1.ln_3.bias', 'gmhra.2.dpe.weight', 'gmhra.2.dpe.bias', 'gmhra.2.attn.in_proj_weight', 'gmhra.2.attn.in_proj_bias', 'gmhra.2.attn.out_proj.weight', 'gmhra.2.attn.out_proj.bias', 'gmhra.2.ln_1.weight', 'gmhra.2.ln_1.bias', 'gmhra.2.mlp.c_fc.weight', 'gmhra.2.mlp.c_fc.bias', 'gmhra.2.mlp.c_proj.weight', 'gmhra.2.mlp.c_proj.bias', 'gmhra.2.ln_2.weight', 'gmhra.2.ln_2.bias', 'gmhra.2.ln_3.weight', 'gmhra.2.ln_3.bias', 'gmhra.3.dpe.weight', 'gmhra.3.dpe.bias', 'gmhra.3.attn.in_proj_weight', 'gmhra.3.attn.in_proj_bias', 'gmhra.3.attn.out_proj.weight', 'gmhra.3.attn.out_proj.bias', 'gmhra.3.ln_1.weight', 'gmhra.3.ln_1.bias', 'gmhra.3.mlp.c_fc.weight', 'gmhra.3.mlp.c_fc.bias', 'gmhra.3.mlp.c_proj.weight', 'gmhra.3.mlp.c_proj.bias', 'gmhra.3.ln_2.weight', 'gmhra.3.ln_2.bias', 'gmhra.3.ln_3.weight', 'gmhra.3.ln_3.bias', 'gmhra.4.dpe.weight', 'gmhra.4.dpe.bias', 'gmhra.4.attn.in_proj_weight', 'gmhra.4.attn.in_proj_bias', 'gmhra.4.attn.out_proj.weight', 'gmhra.4.attn.out_proj.bias', 'gmhra.4.ln_1.weight', 'gmhra.4.ln_1.bias', 'gmhra.4.mlp.c_fc.weight', 'gmhra.4.mlp.c_fc.bias', 'gmhra.4.mlp.c_proj.weight', 'gmhra.4.mlp.c_proj.bias', 'gmhra.4.ln_2.weight', 'gmhra.4.ln_2.bias', 'gmhra.4.ln_3.weight', 'gmhra.4.ln_3.bias', 'gmhra.5.dpe.weight', 'gmhra.5.dpe.bias', 'gmhra.5.attn.in_proj_weight', 'gmhra.5.attn.in_proj_bias', 'gmhra.5.attn.out_proj.weight', 'gmhra.5.attn.out_proj.bias', 'gmhra.5.ln_1.weight', 'gmhra.5.ln_1.bias', 'gmhra.5.mlp.c_fc.weight', 'gmhra.5.mlp.c_fc.bias', 'gmhra.5.mlp.c_proj.weight', 'gmhra.5.mlp.c_proj.bias', 'gmhra.5.ln_2.weight', 'gmhra.5.ln_2.bias', 'gmhra.5.ln_3.weight', 'gmhra.5.ln_3.bias', 'gmhra.6.dpe.weight', 'gmhra.6.dpe.bias', 'gmhra.6.attn.in_proj_weight', 'gmhra.6.attn.in_proj_bias', 'gmhra.6.attn.out_proj.weight', 'gmhra.6.attn.out_proj.bias', 'gmhra.6.ln_1.weight', 'gmhra.6.ln_1.bias', 'gmhra.6.mlp.c_fc.weight', 'gmhra.6.mlp.c_fc.bias', 'gmhra.6.mlp.c_proj.weight', 'gmhra.6.mlp.c_proj.bias', 'gmhra.6.ln_2.weight', 'gmhra.6.ln_2.bias', 'gmhra.6.ln_3.weight', 'gmhra.6.ln_3.bias', 'gmhra.7.dpe.weight', 'gmhra.7.dpe.bias', 'gmhra.7.attn.in_proj_weight', 'gmhra.7.attn.in_proj_bias', 'gmhra.7.attn.out_proj.weight', 'gmhra.7.attn.out_proj.bias', 'gmhra.7.ln_1.weight', 'gmhra.7.ln_1.bias', 'gmhra.7.mlp.c_fc.weight', 'gmhra.7.mlp.c_fc.bias', 'gmhra.7.mlp.c_proj.weight', 'gmhra.7.mlp.c_proj.bias', 'gmhra.7.ln_2.weight', 'gmhra.7.ln_2.bias', 'gmhra.7.ln_3.weight', 'gmhra.7.ln_3.bias'], unexpected_keys=['norm.weight', 'norm.bias', 'head.weight', 'head.bias', 'blocks.39.norm1.weight', 'blocks.39.norm1.bias', 'blocks.39.attn.q_bias', 'blocks.39.attn.v_bias', 'blocks.39.attn.qkv.weight', 'blocks.39.attn.proj.weight', 'blocks.39.attn.proj.bias', 'blocks.39.norm2.weight', 'blocks.39.norm2.bias', 'blocks.39.mlp.fc1.weight', 'blocks.39.mlp.fc1.bias', 'blocks.39.mlp.fc2.weight', 'blocks.39.mlp.fc2.bias'])
freeze vision encoder
open module: ['gmhra_cls_token', 'gmhra.0.dpe.weight', 'gmhra.0.dpe.bias', 'gmhra.0.attn.in_proj_weight', 'gmhra.0.attn.in_proj_bias', 'gmhra.0.attn.out_proj.weight', 'gmhra.0.attn.out_proj.bias', 'gmhra.0.ln_1.weight', 'gmhra.0.ln_1.bias', 'gmhra.0.mlp.c_fc.weight', 'gmhra.0.mlp.c_fc.bias', 'gmhra.0.mlp.c_proj.weight', 'gmhra.0.mlp.c_proj.bias', 'gmhra.0.ln_2.weight', 'gmhra.0.ln_2.bias', 'gmhra.0.ln_3.weight', 'gmhra.0.ln_3.bias', 'gmhra.1.dpe.weight', 'gmhra.1.dpe.bias', 'gmhra.1.attn.in_proj_weight', 'gmhra.1.attn.in_proj_bias', 'gmhra.1.attn.out_proj.weight', 'gmhra.1.attn.out_proj.bias', 'gmhra.1.ln_1.weight', 'gmhra.1.ln_1.bias', 'gmhra.1.mlp.c_fc.weight', 'gmhra.1.mlp.c_fc.bias', 'gmhra.1.mlp.c_proj.weight', 'gmhra.1.mlp.c_proj.bias', 'gmhra.1.ln_2.weight', 'gmhra.1.ln_2.bias', 'gmhra.1.ln_3.weight', 'gmhra.1.ln_3.bias', 'gmhra.2.dpe.weight', 'gmhra.2.dpe.bias', 'gmhra.2.attn.in_proj_weight', 'gmhra.2.attn.in_proj_bias', 'gmhra.2.attn.out_proj.weight', 'gmhra.2.attn.out_proj.bias', 'gmhra.2.ln_1.weight', 'gmhra.2.ln_1.bias', 'gmhra.2.mlp.c_fc.weight', 'gmhra.2.mlp.c_fc.bias', 'gmhra.2.mlp.c_proj.weight', 'gmhra.2.mlp.c_proj.bias', 'gmhra.2.ln_2.weight', 'gmhra.2.ln_2.bias', 'gmhra.2.ln_3.weight', 'gmhra.2.ln_3.bias', 'gmhra.3.dpe.weight', 'gmhra.3.dpe.bias', 'gmhra.3.attn.in_proj_weight', 'gmhra.3.attn.in_proj_bias', 'gmhra.3.attn.out_proj.weight', 'gmhra.3.attn.out_proj.bias', 'gmhra.3.ln_1.weight', 'gmhra.3.ln_1.bias', 'gmhra.3.mlp.c_fc.weight', 'gmhra.3.mlp.c_fc.bias', 'gmhra.3.mlp.c_proj.weight', 'gmhra.3.mlp.c_proj.bias', 'gmhra.3.ln_2.weight', 'gmhra.3.ln_2.bias', 'gmhra.3.ln_3.weight', 'gmhra.3.ln_3.bias', 'gmhra.4.dpe.weight', 'gmhra.4.dpe.bias', 'gmhra.4.attn.in_proj_weight', 'gmhra.4.attn.in_proj_bias', 'gmhra.4.attn.out_proj.weight', 'gmhra.4.attn.out_proj.bias', 'gmhra.4.ln_1.weight', 'gmhra.4.ln_1.bias', 'gmhra.4.mlp.c_fc.weight', 'gmhra.4.mlp.c_fc.bias', 'gmhra.4.mlp.c_proj.weight', 'gmhra.4.mlp.c_proj.bias', 'gmhra.4.ln_2.weight', 'gmhra.4.ln_2.bias', 'gmhra.4.ln_3.weight', 'gmhra.4.ln_3.bias', 'gmhra.5.dpe.weight', 'gmhra.5.dpe.bias', 'gmhra.5.attn.in_proj_weight', 'gmhra.5.attn.in_proj_bias', 'gmhra.5.attn.out_proj.weight', 'gmhra.5.attn.out_proj.bias', 'gmhra.5.ln_1.weight', 'gmhra.5.ln_1.bias', 'gmhra.5.mlp.c_fc.weight', 'gmhra.5.mlp.c_fc.bias', 'gmhra.5.mlp.c_proj.weight', 'gmhra.5.mlp.c_proj.bias', 'gmhra.5.ln_2.weight', 'gmhra.5.ln_2.bias', 'gmhra.5.ln_3.weight', 'gmhra.5.ln_3.bias', 'gmhra.6.dpe.weight', 'gmhra.6.dpe.bias', 'gmhra.6.attn.in_proj_weight', 'gmhra.6.attn.in_proj_bias', 'gmhra.6.attn.out_proj.weight', 'gmhra.6.attn.out_proj.bias', 'gmhra.6.ln_1.weight', 'gmhra.6.ln_1.bias', 'gmhra.6.mlp.c_fc.weight', 'gmhra.6.mlp.c_fc.bias', 'gmhra.6.mlp.c_proj.weight', 'gmhra.6.mlp.c_proj.bias', 'gmhra.6.ln_2.weight', 'gmhra.6.ln_2.bias', 'gmhra.6.ln_3.weight', 'gmhra.6.ln_3.bias', 'gmhra.7.dpe.weight', 'gmhra.7.dpe.bias', 'gmhra.7.attn.in_proj_weight', 'gmhra.7.attn.in_proj_bias', 'gmhra.7.attn.out_proj.weight', 'gmhra.7.attn.out_proj.bias', 'gmhra.7.ln_1.weight', 'gmhra.7.ln_1.bias', 'gmhra.7.mlp.c_fc.weight', 'gmhra.7.mlp.c_fc.bias', 'gmhra.7.mlp.c_proj.weight', 'gmhra.7.mlp.c_proj.bias', 'gmhra.7.ln_2.weight', 'gmhra.7.ln_2.bias', 'gmhra.7.ln_3.weight', 'gmhra.7.ln_3.bias']
open ln_vision
Loading VIT Done
Loading Q-Former
Drop_path:[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]
BertConfig {
  "add_cross_attention": true,
  "architectures": [
    "BertForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.0,
  "classifier_dropout": null,
  "cross_attention_freq": 2,
  "drop_path_list": [
    0.0,
    0.0,
    0.0,
    0.0,
    0.0,
    0.0,
    0.0,
    0.0,
    0.0,
    0.0,
    0.0,
    0.0
  ],
  "encoder_width": 1408,
  "gradient_checkpointing": false,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.0,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 0,
  "position_embedding_type": "absolute",
  "query_length": 32,
  "transformers_version": "4.28.1",
  "type_vocab_size": 2,
  "use_cache": true,
  "vocab_size": 30522
}

Load QFormer from D:/GIT_Project/Ask-Anything/Bilp2/blip2_pretrained_flant5xxl.pth
_IncompatibleKeys(missing_keys=['visual_encoder.cls_token', 'visual_encoder.pos_embed', 'visual_encoder.gmhra_cls_token', 'visual_encoder.patch_embed.proj.weight', 'visual_encoder.patch_embed.proj.bias', 'visual_encoder.blocks.0.norm1.weight', 'visual_encoder.blocks.0.norm1.bias', 'visual_encoder.blocks.0.attn.q_bias', 'visual_encoder.blocks.0.attn.v_bias', 'visual_encoder.blocks.0.attn.qkv.weight', 'visual_encoder.blocks.0.attn.proj.weight', 'visual_encoder.blocks.0.attn.proj.bias', 'visual_encoder.blocks.0.norm2.weight', 'visual_encoder.blocks.0.norm2.bias', 'visual_encoder.blocks.0.mlp.fc1.weight', 'visual_encoder.blocks.0.mlp.fc1.bias', 'visual_encoder.blocks.0.mlp.fc2.weight', 'visual_encoder.blocks.0.mlp.fc2.bias', 'visual_encoder.blocks.1.norm1.weight', 'visual_encoder.blocks.1.norm1.bias', 'visual_encoder.blocks.1.attn.q_bias', 'visual_encoder.blocks.1.attn.v_bias', 'visual_encoder.blocks.1.attn.qkv.weight', 'visual_encoder.blocks.1.attn.proj.weight', 'visual_encoder.blocks.1.attn.proj.bias', 'visual_encoder.blocks.1.norm2.weight', 'visual_encoder.blocks.1.norm2.bias', 'visual_encoder.blocks.1.mlp.fc1.weight', 'visual_encoder.blocks.1.mlp.fc1.bias', 'visual_encoder.blocks.1.mlp.fc2.weight', 'visual_encoder.blocks.1.mlp.fc2.bias', 'visual_encoder.blocks.2.norm1.weight', 'visual_encoder.blocks.2.norm1.bias', 'visual_encoder.blocks.2.attn.q_bias', 'visual_encoder.blocks.2.attn.v_bias', 'visual_encoder.blocks.2.attn.qkv.weight', 'visual_encoder.blocks.2.attn.proj.weight', 'visual_encoder.blocks.2.attn.proj.bias', 'visual_encoder.blocks.2.norm2.weight', 'visual_encoder.blocks.2.norm2.bias', 'visual_encoder.blocks.2.mlp.fc1.weight', 'visual_encoder.blocks.2.mlp.fc1.bias', 'visual_encoder.blocks.2.mlp.fc2.weight', 'visual_encoder.blocks.2.mlp.fc2.bias', 'visual_encoder.blocks.3.norm1.weight', 'visual_encoder.blocks.3.norm1.bias', 'visual_encoder.blocks.3.attn.q_bias', 'visual_encoder.blocks.3.attn.v_bias', 'visual_encoder.blocks.3.attn.qkv.weight', 'visual_encoder.blocks.3.attn.proj.weight', 'visual_encoder.blocks.3.attn.proj.bias', 'visual_encoder.blocks.3.norm2.weight', 'visual_encoder.blocks.3.norm2.bias', 'visual_encoder.blocks.3.mlp.fc1.weight', 'visual_encoder.blocks.3.mlp.fc1.bias', 'visual_encoder.blocks.3.mlp.fc2.weight', 'visual_encoder.blocks.3.mlp.fc2.bias', 'visual_encoder.blocks.4.norm1.weight', 'visual_encoder.blocks.4.norm1.bias', 'visual_encoder.blocks.4.attn.q_bias', 'visual_encoder.blocks.4.attn.v_bias', 'visual_encoder.blocks.4.attn.qkv.weight', 'visual_encoder.blocks.4.attn.proj.weight', 'visual_encoder.blocks.4.attn.proj.bias', 'visual_encoder.blocks.4.norm2.weight', 'visual_encoder.blocks.4.norm2.bias', 'visual_encoder.blocks.4.mlp.fc1.weight', 'visual_encoder.blocks.4.mlp.fc1.bias', 'visual_encoder.blocks.4.mlp.fc2.weight', 'visual_encoder.blocks.4.mlp.fc2.bias', 'visual_encoder.blocks.5.norm1.weight', 'visual_encoder.blocks.5.norm1.bias', 'visual_encoder.blocks.5.attn.q_bias', 'visual_encoder.blocks.5.attn.v_bias', 'visual_encoder.blocks.5.attn.qkv.weight', 'visual_encoder.blocks.5.attn.proj.weight', 'visual_encoder.blocks.5.attn.proj.bias', 'visual_encoder.blocks.5.norm2.weight', 'visual_encoder.blocks.5.norm2.bias', 'visual_encoder.blocks.5.mlp.fc1.weight', 'visual_encoder.blocks.5.mlp.fc1.bias', 'visual_encoder.blocks.5.mlp.fc2.weight', 'visual_encoder.blocks.5.mlp.fc2.bias', 'visual_encoder.blocks.6.norm1.weight', 'visual_encoder.blocks.6.norm1.bias', 'visual_encoder.blocks.6.attn.q_bias', 'visual_encoder.blocks.6.attn.v_bias', 'visual_encoder.blocks.6.attn.qkv.weight', 'visual_encoder.blocks.6.attn.proj.weight', 'visual_encoder.blocks.6.attn.proj.bias', 'visual_encoder.blocks.6.norm2.weight', 'visual_encoder.blocks.6.norm2.bias', 'visual_encoder.blocks.6.mlp.fc1.weight', 'visual_encoder.blocks.6.mlp.fc1.bias', 'visual_encoder.blocks.6.mlp.fc2.weight', 'visual_encoder.blocks.6.mlp.fc2.bias', 'visual_encoder.blocks.7.norm1.weight', 'visual_encoder.blocks.7.norm1.bias', 'visual_encoder.blocks.7.attn.q_bias', 'visual_encoder.blocks.7.attn.v_bias', 'visual_encoder.blocks.7.attn.qkv.weight', 'visual_encoder.blocks.7.attn.proj.weight', 'visual_encoder.blocks.7.attn.proj.bias', 'visual_encoder.blocks.7.norm2.weight', 'visual_encoder.blocks.7.norm2.bias', 'visual_encoder.blocks.7.mlp.fc1.weight', 'visual_encoder.blocks.7.mlp.fc1.bias', 'visual_encoder.blocks.7.mlp.fc2.weight', 'visual_encoder.blocks.7.mlp.fc2.bias', 'visual_encoder.blocks.8.norm1.weight', 'visual_encoder.blocks.8.norm1.bias', 'visual_encoder.blocks.8.attn.q_bias', 'visual_encoder.blocks.8.attn.v_bias', 'visual_encoder.blocks.8.attn.qkv.weight', 'visual_encoder.blocks.8.attn.proj.weight', 'visual_encoder.blocks.8.attn.proj.bias', 'visual_encoder.blocks.8.norm2.weight', 'visual_encoder.blocks.8.norm2.bias', 'visual_encoder.blocks.8.mlp.fc1.weight', 'visual_encoder.blocks.8.mlp.fc1.bias', 'visual_encoder.blocks.8.mlp.fc2.weight', 'visual_encoder.blocks.8.mlp.fc2.bias', 'visual_encoder.blocks.9.norm1.weight', 'visual_encoder.blocks.9.norm1.bias', 'visual_encoder.blocks.9.attn.q_bias', 'visual_encoder.blocks.9.attn.v_bias', 'visual_encoder.blocks.9.attn.qkv.weight', 'visual_encoder.blocks.9.attn.proj.weight', 'visual_encoder.blocks.9.attn.proj.bias', 'visual_encoder.blocks.9.norm2.weight', 'visual_encoder.blocks.9.norm2.bias', 'visual_encoder.blocks.9.mlp.fc1.weight', 'visual_encoder.blocks.9.mlp.fc1.bias', 'visual_encoder.blocks.9.mlp.fc2.weight', 'visual_encoder.blocks.9.mlp.fc2.bias', 'visual_encoder.blocks.10.norm1.weight', 'visual_encoder.blocks.10.norm1.bias', 'visual_encoder.blocks.10.attn.q_bias', 'visual_encoder.blocks.10.attn.v_bias', 'visual_encoder.blocks.10.attn.qkv.weight', 'visual_encoder.blocks.10.attn.proj.weight', 'visual_encoder.blocks.10.attn.proj.bias', 'visual_encoder.blocks.10.norm2.weight', 'visual_encoder.blocks.10.norm2.bias', 'visual_encoder.blocks.10.mlp.fc1.weight', 'visual_encoder.blocks.10.mlp.fc1.bias', 'visual_encoder.blocks.10.mlp.fc2.weight', 'visual_encoder.blocks.10.mlp.fc2.bias', 'visual_encoder.blocks.11.norm1.weight', 'visual_encoder.blocks.11.norm1.bias', 'visual_encoder.blocks.11.attn.q_bias', 'visual_encoder.blocks.11.attn.v_bias', 'visual_encoder.blocks.11.attn.qkv.weight', 'visual_encoder.blocks.11.attn.proj.weight', 'visual_encoder.blocks.11.attn.proj.bias', 'visual_encoder.blocks.11.norm2.weight', 'visual_encoder.blocks.11.norm2.bias', 'visual_encoder.blocks.11.mlp.fc1.weight', 'visual_encoder.blocks.11.mlp.fc1.bias', 'visual_encoder.blocks.11.mlp.fc2.weight', 'visual_encoder.blocks.11.mlp.fc2.bias', 'visual_encoder.blocks.12.norm1.weight', 'visual_encoder.blocks.12.norm1.bias', 'visual_encoder.blocks.12.attn.q_bias', 'visual_encoder.blocks.12.attn.v_bias', 'visual_encoder.blocks.12.attn.qkv.weight', 'visual_encoder.blocks.12.attn.proj.weight', 'visual_encoder.blocks.12.attn.proj.bias', 'visual_encoder.blocks.12.norm2.weight', 'visual_encoder.blocks.12.norm2.bias', 'visual_encoder.blocks.12.mlp.fc1.weight', 'visual_encoder.blocks.12.mlp.fc1.bias', 'visual_encoder.blocks.12.mlp.fc2.weight', 'visual_encoder.blocks.12.mlp.fc2.bias', 'visual_encoder.blocks.13.norm1.weight', 'visual_encoder.blocks.13.norm1.bias', 'visual_encoder.blocks.13.attn.q_bias', 'visual_encoder.blocks.13.attn.v_bias', 'visual_encoder.blocks.13.attn.qkv.weight', 'visual_encoder.blocks.13.attn.proj.weight', 'visual_encoder.blocks.13.attn.proj.bias', 'visual_encoder.blocks.13.norm2.weight', 'visual_encoder.blocks.13.norm2.bias', 'visual_encoder.blocks.13.mlp.fc1.weight', 'visual_encoder.blocks.13.mlp.fc1.bias', 'visual_encoder.blocks.13.mlp.fc2.weight', 'visual_encoder.blocks.13.mlp.fc2.bias', 'visual_encoder.blocks.14.norm1.weight', 'visual_encoder.blocks.14.norm1.bias', 'visual_encoder.blocks.14.attn.q_bias', 'visual_encoder.blocks.14.attn.v_bias', 'visual_encoder.blocks.14.attn.qkv.weight', 'visual_encoder.blocks.14.attn.proj.weight', 'visual_encoder.blocks.14.attn.proj.bias', 'visual_encoder.blocks.14.norm2.weight', 'visual_encoder.blocks.14.norm2.bias', 'visual_encoder.blocks.14.mlp.fc1.weight', 'visual_encoder.blocks.14.mlp.fc1.bias', 'visual_encoder.blocks.14.mlp.fc2.weight', 'visual_encoder.blocks.14.mlp.fc2.bias', 'visual_encoder.blocks.15.norm1.weight', 'visual_encoder.blocks.15.norm1.bias', 'visual_encoder.blocks.15.attn.q_bias', 'visual_encoder.blocks.15.attn.v_bias', 'visual_encoder.blocks.15.attn.qkv.weight', 'visual_encoder.blocks.15.attn.proj.weight', 'visual_encoder.blocks.15.attn.proj.bias', 'visual_encoder.blocks.15.norm2.weight', 'visual_encoder.blocks.15.norm2.bias', 'visual_encoder.blocks.15.mlp.fc1.weight', 'visual_encoder.blocks.15.mlp.fc1.bias', 'visual_encoder.blocks.15.mlp.fc2.weight', 'visual_encoder.blocks.15.mlp.fc2.bias', 'visual_encoder.blocks.16.norm1.weight', 'visual_encoder.blocks.16.norm1.bias', 'visual_encoder.blocks.16.attn.q_bias', 'visual_encoder.blocks.16.attn.v_bias', 'visual_encoder.blocks.16.attn.qkv.weight', 'visual_encoder.blocks.16.attn.proj.weight', 'visual_encoder.blocks.16.attn.proj.bias', 'visual_encoder.blocks.16.norm2.weight', 'visual_encoder.blocks.16.norm2.bias', 'visual_encoder.blocks.16.mlp.fc1.weight', 'visual_encoder.blocks.16.mlp.fc1.bias', 'visual_encoder.blocks.16.mlp.fc2.weight', 'visual_encoder.blocks.16.mlp.fc2.bias', 'visual_encoder.blocks.17.norm1.weight', 'visual_encoder.blocks.17.norm1.bias', 'visual_encoder.blocks.17.attn.q_bias', 'visual_encoder.blocks.17.attn.v_bias', 'visual_encoder.blocks.17.attn.qkv.weight', 'visual_encoder.blocks.17.attn.proj.weight', 'visual_encoder.blocks.17.attn.proj.bias', 'visual_encoder.blocks.17.norm2.weight', 'visual_encoder.blocks.17.norm2.bias', 'visual_encoder.blocks.17.mlp.fc1.weight', 'visual_encoder.blocks.17.mlp.fc1.bias', 'visual_encoder.blocks.17.mlp.fc2.weight', 'visual_encoder.blocks.17.mlp.fc2.bias', 'visual_encoder.blocks.18.norm1.weight', 'visual_encoder.blocks.18.norm1.bias', 'visual_encoder.blocks.18.attn.q_bias', 'visual_encoder.blocks.18.attn.v_bias', 'visual_encoder.blocks.18.attn.qkv.weight', 'visual_encoder.blocks.18.attn.proj.weight', 'visual_encoder.blocks.18.attn.proj.bias', 'visual_encoder.blocks.18.norm2.weight', 'visual_encoder.blocks.18.norm2.bias', 'visual_encoder.blocks.18.mlp.fc1.weight', 'visual_encoder.blocks.18.mlp.fc1.bias', 'visual_encoder.blocks.18.mlp.fc2.weight', 'visual_encoder.blocks.18.mlp.fc2.bias', 'visual_encoder.blocks.19.norm1.weight', 'visual_encoder.blocks.19.norm1.bias', 'visual_encoder.blocks.19.attn.q_bias', 'visual_encoder.blocks.19.attn.v_bias', 'visual_encoder.blocks.19.attn.qkv.weight', 'visual_encoder.blocks.19.attn.proj.weight', 'visual_encoder.blocks.19.attn.proj.bias', 'visual_encoder.blocks.19.norm2.weight', 'visual_encoder.blocks.19.norm2.bias', 'visual_encoder.blocks.19.mlp.fc1.weight', 'visual_encoder.blocks.19.mlp.fc1.bias', 'visual_encoder.blocks.19.mlp.fc2.weight', 'visual_encoder.blocks.19.mlp.fc2.bias', 'visual_encoder.blocks.20.norm1.weight', 'visual_encoder.blocks.20.norm1.bias', 'visual_encoder.blocks.20.attn.q_bias', 'visual_encoder.blocks.20.attn.v_bias', 'visual_encoder.blocks.20.attn.qkv.weight', 'visual_encoder.blocks.20.attn.proj.weight', 'visual_encoder.blocks.20.attn.proj.bias', 'visual_encoder.blocks.20.norm2.weight', 'visual_encoder.blocks.20.norm2.bias', 'visual_encoder.blocks.20.mlp.fc1.weight', 'visual_encoder.blocks.20.mlp.fc1.bias', 'visual_encoder.blocks.20.mlp.fc2.weight', 'visual_encoder.blocks.20.mlp.fc2.bias', 'visual_encoder.blocks.21.norm1.weight', 'visual_encoder.blocks.21.norm1.bias', 'visual_encoder.blocks.21.attn.q_bias', 'visual_encoder.blocks.21.attn.v_bias', 'visual_encoder.blocks.21.attn.qkv.weight', 'visual_encoder.blocks.21.attn.proj.weight', 'visual_encoder.blocks.21.attn.proj.bias', 'visual_encoder.blocks.21.norm2.weight', 'visual_encoder.blocks.21.norm2.bias', 'visual_encoder.blocks.21.mlp.fc1.weight', 'visual_encoder.blocks.21.mlp.fc1.bias', 'visual_encoder.blocks.21.mlp.fc2.weight', 'visual_encoder.blocks.21.mlp.fc2.bias', 'visual_encoder.blocks.22.norm1.weight', 'visual_encoder.blocks.22.norm1.bias', 'visual_encoder.blocks.22.attn.q_bias', 'visual_encoder.blocks.22.attn.v_bias', 'visual_encoder.blocks.22.attn.qkv.weight', 'visual_encoder.blocks.22.attn.proj.weight', 'visual_encoder.blocks.22.attn.proj.bias', 'visual_encoder.blocks.22.norm2.weight', 'visual_encoder.blocks.22.norm2.bias', 'visual_encoder.blocks.22.mlp.fc1.weight', 'visual_encoder.blocks.22.mlp.fc1.bias', 'visual_encoder.blocks.22.mlp.fc2.weight', 'visual_encoder.blocks.22.mlp.fc2.bias', 'visual_encoder.blocks.23.norm1.weight', 'visual_encoder.blocks.23.norm1.bias', 'visual_encoder.blocks.23.attn.q_bias', 'visual_encoder.blocks.23.attn.v_bias', 'visual_encoder.blocks.23.attn.qkv.weight', 'visual_encoder.blocks.23.attn.proj.weight', 'visual_encoder.blocks.23.attn.proj.bias', 'visual_encoder.blocks.23.norm2.weight', 'visual_encoder.blocks.23.norm2.bias', 'visual_encoder.blocks.23.mlp.fc1.weight', 'visual_encoder.blocks.23.mlp.fc1.bias', 'visual_encoder.blocks.23.mlp.fc2.weight', 'visual_encoder.blocks.23.mlp.fc2.bias', 'visual_encoder.blocks.24.norm1.weight', 'visual_encoder.blocks.24.norm1.bias', 'visual_encoder.blocks.24.attn.q_bias', 'visual_encoder.blocks.24.attn.v_bias', 'visual_encoder.blocks.24.attn.qkv.weight', 'visual_encoder.blocks.24.attn.proj.weight', 'visual_encoder.blocks.24.attn.proj.bias', 'visual_encoder.blocks.24.norm2.weight', 'visual_encoder.blocks.24.norm2.bias', 'visual_encoder.blocks.24.mlp.fc1.weight', 'visual_encoder.blocks.24.mlp.fc1.bias', 'visual_encoder.blocks.24.mlp.fc2.weight', 'visual_encoder.blocks.24.mlp.fc2.bias', 'visual_encoder.blocks.25.norm1.weight', 'visual_encoder.blocks.25.norm1.bias', 'visual_encoder.blocks.25.attn.q_bias', 'visual_encoder.blocks.25.attn.v_bias', 'visual_encoder.blocks.25.attn.qkv.weight', 'visual_encoder.blocks.25.attn.proj.weight', 'visual_encoder.blocks.25.attn.proj.bias', 'visual_encoder.blocks.25.norm2.weight', 'visual_encoder.blocks.25.norm2.bias', 'visual_encoder.blocks.25.mlp.fc1.weight', 'visual_encoder.blocks.25.mlp.fc1.bias', 'visual_encoder.blocks.25.mlp.fc2.weight', 'visual_encoder.blocks.25.mlp.fc2.bias', 'visual_encoder.blocks.26.norm1.weight', 'visual_encoder.blocks.26.norm1.bias', 'visual_encoder.blocks.26.attn.q_bias', 'visual_encoder.blocks.26.attn.v_bias', 'visual_encoder.blocks.26.attn.qkv.weight', 'visual_encoder.blocks.26.attn.proj.weight', 'visual_encoder.blocks.26.attn.proj.bias', 'visual_encoder.blocks.26.norm2.weight', 'visual_encoder.blocks.26.norm2.bias', 'visual_encoder.blocks.26.mlp.fc1.weight', 'visual_encoder.blocks.26.mlp.fc1.bias', 'visual_encoder.blocks.26.mlp.fc2.weight', 'visual_encoder.blocks.26.mlp.fc2.bias', 'visual_encoder.blocks.27.norm1.weight', 'visual_encoder.blocks.27.norm1.bias', 'visual_encoder.blocks.27.attn.q_bias', 'visual_encoder.blocks.27.attn.v_bias', 'visual_encoder.blocks.27.attn.qkv.weight', 'visual_encoder.blocks.27.attn.proj.weight', 'visual_encoder.blocks.27.attn.proj.bias', 'visual_encoder.blocks.27.norm2.weight', 'visual_encoder.blocks.27.norm2.bias', 'visual_encoder.blocks.27.mlp.fc1.weight', 'visual_encoder.blocks.27.mlp.fc1.bias', 'visual_encoder.blocks.27.mlp.fc2.weight', 'visual_encoder.blocks.27.mlp.fc2.bias', 'visual_encoder.blocks.28.norm1.weight', 'visual_encoder.blocks.28.norm1.bias', 'visual_encoder.blocks.28.attn.q_bias', 'visual_encoder.blocks.28.attn.v_bias', 'visual_encoder.blocks.28.attn.qkv.weight', 'visual_encoder.blocks.28.attn.proj.weight', 'visual_encoder.blocks.28.attn.proj.bias', 'visual_encoder.blocks.28.norm2.weight', 'visual_encoder.blocks.28.norm2.bias', 'visual_encoder.blocks.28.mlp.fc1.weight', 'visual_encoder.blocks.28.mlp.fc1.bias', 'visual_encoder.blocks.28.mlp.fc2.weight', 'visual_encoder.blocks.28.mlp.fc2.bias', 'visual_encoder.blocks.29.norm1.weight', 'visual_encoder.blocks.29.norm1.bias', 'visual_encoder.blocks.29.attn.q_bias', 'visual_encoder.blocks.29.attn.v_bias', 'visual_encoder.blocks.29.attn.qkv.weight', 'visual_encoder.blocks.29.attn.proj.weight', 'visual_encoder.blocks.29.attn.proj.bias', 'visual_encoder.blocks.29.norm2.weight', 'visual_encoder.blocks.29.norm2.bias', 'visual_encoder.blocks.29.mlp.fc1.weight', 'visual_encoder.blocks.29.mlp.fc1.bias', 'visual_encoder.blocks.29.mlp.fc2.weight', 'visual_encoder.blocks.29.mlp.fc2.bias', 'visual_encoder.blocks.30.norm1.weight', 'visual_encoder.blocks.30.norm1.bias', 'visual_encoder.blocks.30.attn.q_bias', 'visual_encoder.blocks.30.attn.v_bias', 'visual_encoder.blocks.30.attn.qkv.weight', 'visual_encoder.blocks.30.attn.proj.weight', 'visual_encoder.blocks.30.attn.proj.bias', 'visual_encoder.blocks.30.norm2.weight', 'visual_encoder.blocks.30.norm2.bias', 'visual_encoder.blocks.30.mlp.fc1.weight', 'visual_encoder.blocks.30.mlp.fc1.bias', 'visual_encoder.blocks.30.mlp.fc2.weight', 'visual_encoder.blocks.30.mlp.fc2.bias', 'visual_encoder.blocks.31.norm1.weight', 'visual_encoder.blocks.31.norm1.bias', 'visual_encoder.blocks.31.attn.q_bias', 'visual_encoder.blocks.31.attn.v_bias', 'visual_encoder.blocks.31.attn.qkv.weight', 'visual_encoder.blocks.31.attn.proj.weight', 'visual_encoder.blocks.31.attn.proj.bias', 'visual_encoder.blocks.31.norm2.weight', 'visual_encoder.blocks.31.norm2.bias', 'visual_encoder.blocks.31.mlp.fc1.weight', 'visual_encoder.blocks.31.mlp.fc1.bias', 'visual_encoder.blocks.31.mlp.fc2.weight', 'visual_encoder.blocks.31.mlp.fc2.bias', 'visual_encoder.blocks.32.norm1.weight', 'visual_encoder.blocks.32.norm1.bias', 'visual_encoder.blocks.32.attn.q_bias', 'visual_encoder.blocks.32.attn.v_bias', 'visual_encoder.blocks.32.attn.qkv.weight', 'visual_encoder.blocks.32.attn.proj.weight', 'visual_encoder.blocks.32.attn.proj.bias', 'visual_encoder.blocks.32.norm2.weight', 'visual_encoder.blocks.32.norm2.bias', 'visual_encoder.blocks.32.mlp.fc1.weight', 'visual_encoder.blocks.32.mlp.fc1.bias', 'visual_encoder.blocks.32.mlp.fc2.weight', 'visual_encoder.blocks.32.mlp.fc2.bias', 'visual_encoder.blocks.33.norm1.weight', 'visual_encoder.blocks.33.norm1.bias', 'visual_encoder.blocks.33.attn.q_bias', 'visual_encoder.blocks.33.attn.v_bias', 'visual_encoder.blocks.33.attn.qkv.weight', 'visual_encoder.blocks.33.attn.proj.weight', 'visual_encoder.blocks.33.attn.proj.bias', 'visual_encoder.blocks.33.norm2.weight', 'visual_encoder.blocks.33.norm2.bias', 'visual_encoder.blocks.33.mlp.fc1.weight', 'visual_encoder.blocks.33.mlp.fc1.bias', 'visual_encoder.blocks.33.mlp.fc2.weight', 'visual_encoder.blocks.33.mlp.fc2.bias', 'visual_encoder.blocks.34.norm1.weight', 'visual_encoder.blocks.34.norm1.bias', 'visual_encoder.blocks.34.attn.q_bias', 'visual_encoder.blocks.34.attn.v_bias', 'visual_encoder.blocks.34.attn.qkv.weight', 'visual_encoder.blocks.34.attn.proj.weight', 'visual_encoder.blocks.34.attn.proj.bias', 'visual_encoder.blocks.34.norm2.weight', 'visual_encoder.blocks.34.norm2.bias', 'visual_encoder.blocks.34.mlp.fc1.weight', 'visual_encoder.blocks.34.mlp.fc1.bias', 'visual_encoder.blocks.34.mlp.fc2.weight', 'visual_encoder.blocks.34.mlp.fc2.bias', 'visual_encoder.blocks.35.norm1.weight', 'visual_encoder.blocks.35.norm1.bias', 'visual_encoder.blocks.35.attn.q_bias', 'visual_encoder.blocks.35.attn.v_bias', 'visual_encoder.blocks.35.attn.qkv.weight', 'visual_encoder.blocks.35.attn.proj.weight', 'visual_encoder.blocks.35.attn.proj.bias', 'visual_encoder.blocks.35.norm2.weight', 'visual_encoder.blocks.35.norm2.bias', 'visual_encoder.blocks.35.mlp.fc1.weight', 'visual_encoder.blocks.35.mlp.fc1.bias', 'visual_encoder.blocks.35.mlp.fc2.weight', 'visual_encoder.blocks.35.mlp.fc2.bias', 'visual_encoder.blocks.36.norm1.weight', 'visual_encoder.blocks.36.norm1.bias', 'visual_encoder.blocks.36.attn.q_bias', 'visual_encoder.blocks.36.attn.v_bias', 'visual_encoder.blocks.36.attn.qkv.weight', 'visual_encoder.blocks.36.attn.proj.weight', 'visual_encoder.blocks.36.attn.proj.bias', 'visual_encoder.blocks.36.norm2.weight', 'visual_encoder.blocks.36.norm2.bias', 'visual_encoder.blocks.36.mlp.fc1.weight', 'visual_encoder.blocks.36.mlp.fc1.bias', 'visual_encoder.blocks.36.mlp.fc2.weight', 'visual_encoder.blocks.36.mlp.fc2.bias', 'visual_encoder.blocks.37.norm1.weight', 'visual_encoder.blocks.37.norm1.bias', 'visual_encoder.blocks.37.attn.q_bias', 'visual_encoder.blocks.37.attn.v_bias', 'visual_encoder.blocks.37.attn.qkv.weight', 'visual_encoder.blocks.37.attn.proj.weight', 'visual_encoder.blocks.37.attn.proj.bias', 'visual_encoder.blocks.37.norm2.weight', 'visual_encoder.blocks.37.norm2.bias', 'visual_encoder.blocks.37.mlp.fc1.weight', 'visual_encoder.blocks.37.mlp.fc1.bias', 'visual_encoder.blocks.37.mlp.fc2.weight', 'visual_encoder.blocks.37.mlp.fc2.bias', 'visual_encoder.blocks.38.norm1.weight', 'visual_encoder.blocks.38.norm1.bias', 'visual_encoder.blocks.38.attn.q_bias', 'visual_encoder.blocks.38.attn.v_bias', 'visual_encoder.blocks.38.attn.qkv.weight', 'visual_encoder.blocks.38.attn.proj.weight', 'visual_encoder.blocks.38.attn.proj.bias', 'visual_encoder.blocks.38.norm2.weight', 'visual_encoder.blocks.38.norm2.bias', 'visual_encoder.blocks.38.mlp.fc1.weight', 'visual_encoder.blocks.38.mlp.fc1.bias', 'visual_encoder.blocks.38.mlp.fc2.weight', 'visual_encoder.blocks.38.mlp.fc2.bias', 'visual_encoder.gmhra.0.dpe.weight', 'visual_encoder.gmhra.0.dpe.bias', 'visual_encoder.gmhra.0.attn.in_proj_weight', 'visual_encoder.gmhra.0.attn.in_proj_bias', 'visual_encoder.gmhra.0.attn.out_proj.weight', 'visual_encoder.gmhra.0.attn.out_proj.bias', 'visual_encoder.gmhra.0.ln_1.weight', 'visual_encoder.gmhra.0.ln_1.bias', 'visual_encoder.gmhra.0.mlp.c_fc.weight', 'visual_encoder.gmhra.0.mlp.c_fc.bias', 'visual_encoder.gmhra.0.mlp.c_proj.weight', 'visual_encoder.gmhra.0.mlp.c_proj.bias', 'visual_encoder.gmhra.0.ln_2.weight', 'visual_encoder.gmhra.0.ln_2.bias', 'visual_encoder.gmhra.0.ln_3.weight', 'visual_encoder.gmhra.0.ln_3.bias', 'visual_encoder.gmhra.1.dpe.weight', 'visual_encoder.gmhra.1.dpe.bias', 'visual_encoder.gmhra.1.attn.in_proj_weight', 'visual_encoder.gmhra.1.attn.in_proj_bias', 'visual_encoder.gmhra.1.attn.out_proj.weight', 'visual_encoder.gmhra.1.attn.out_proj.bias', 'visual_encoder.gmhra.1.ln_1.weight', 'visual_encoder.gmhra.1.ln_1.bias', 'visual_encoder.gmhra.1.mlp.c_fc.weight', 'visual_encoder.gmhra.1.mlp.c_fc.bias', 'visual_encoder.gmhra.1.mlp.c_proj.weight', 'visual_encoder.gmhra.1.mlp.c_proj.bias', 'visual_encoder.gmhra.1.ln_2.weight', 'visual_encoder.gmhra.1.ln_2.bias', 'visual_encoder.gmhra.1.ln_3.weight', 'visual_encoder.gmhra.1.ln_3.bias', 'visual_encoder.gmhra.2.dpe.weight', 'visual_encoder.gmhra.2.dpe.bias', 'visual_encoder.gmhra.2.attn.in_proj_weight', 'visual_encoder.gmhra.2.attn.in_proj_bias', 'visual_encoder.gmhra.2.attn.out_proj.weight', 'visual_encoder.gmhra.2.attn.out_proj.bias', 'visual_encoder.gmhra.2.ln_1.weight', 'visual_encoder.gmhra.2.ln_1.bias', 'visual_encoder.gmhra.2.mlp.c_fc.weight', 'visual_encoder.gmhra.2.mlp.c_fc.bias', 'visual_encoder.gmhra.2.mlp.c_proj.weight', 'visual_encoder.gmhra.2.mlp.c_proj.bias', 'visual_encoder.gmhra.2.ln_2.weight', 'visual_encoder.gmhra.2.ln_2.bias', 'visual_encoder.gmhra.2.ln_3.weight', 'visual_encoder.gmhra.2.ln_3.bias', 'visual_encoder.gmhra.3.dpe.weight', 'visual_encoder.gmhra.3.dpe.bias', 'visual_encoder.gmhra.3.attn.in_proj_weight', 'visual_encoder.gmhra.3.attn.in_proj_bias', 'visual_encoder.gmhra.3.attn.out_proj.weight', 'visual_encoder.gmhra.3.attn.out_proj.bias', 'visual_encoder.gmhra.3.ln_1.weight', 'visual_encoder.gmhra.3.ln_1.bias', 'visual_encoder.gmhra.3.mlp.c_fc.weight', 'visual_encoder.gmhra.3.mlp.c_fc.bias', 'visual_encoder.gmhra.3.mlp.c_proj.weight', 'visual_encoder.gmhra.3.mlp.c_proj.bias', 'visual_encoder.gmhra.3.ln_2.weight', 'visual_encoder.gmhra.3.ln_2.bias', 'visual_encoder.gmhra.3.ln_3.weight', 'visual_encoder.gmhra.3.ln_3.bias', 'visual_encoder.gmhra.4.dpe.weight', 'visual_encoder.gmhra.4.dpe.bias', 'visual_encoder.gmhra.4.attn.in_proj_weight', 'visual_encoder.gmhra.4.attn.in_proj_bias', 'visual_encoder.gmhra.4.attn.out_proj.weight', 'visual_encoder.gmhra.4.attn.out_proj.bias', 'visual_encoder.gmhra.4.ln_1.weight', 'visual_encoder.gmhra.4.ln_1.bias', 'visual_encoder.gmhra.4.mlp.c_fc.weight', 'visual_encoder.gmhra.4.mlp.c_fc.bias', 'visual_encoder.gmhra.4.mlp.c_proj.weight', 'visual_encoder.gmhra.4.mlp.c_proj.bias', 'visual_encoder.gmhra.4.ln_2.weight', 'visual_encoder.gmhra.4.ln_2.bias', 'visual_encoder.gmhra.4.ln_3.weight', 'visual_encoder.gmhra.4.ln_3.bias', 'visual_encoder.gmhra.5.dpe.weight', 'visual_encoder.gmhra.5.dpe.bias', 'visual_encoder.gmhra.5.attn.in_proj_weight', 'visual_encoder.gmhra.5.attn.in_proj_bias', 'visual_encoder.gmhra.5.attn.out_proj.weight', 'visual_encoder.gmhra.5.attn.out_proj.bias', 'visual_encoder.gmhra.5.ln_1.weight', 'visual_encoder.gmhra.5.ln_1.bias', 'visual_encoder.gmhra.5.mlp.c_fc.weight', 'visual_encoder.gmhra.5.mlp.c_fc.bias', 'visual_encoder.gmhra.5.mlp.c_proj.weight', 'visual_encoder.gmhra.5.mlp.c_proj.bias', 'visual_encoder.gmhra.5.ln_2.weight', 'visual_encoder.gmhra.5.ln_2.bias', 'visual_encoder.gmhra.5.ln_3.weight', 'visual_encoder.gmhra.5.ln_3.bias', 'visual_encoder.gmhra.6.dpe.weight', 'visual_encoder.gmhra.6.dpe.bias', 'visual_encoder.gmhra.6.attn.in_proj_weight', 'visual_encoder.gmhra.6.attn.in_proj_bias', 'visual_encoder.gmhra.6.attn.out_proj.weight', 'visual_encoder.gmhra.6.attn.out_proj.bias', 'visual_encoder.gmhra.6.ln_1.weight', 'visual_encoder.gmhra.6.ln_1.bias', 'visual_encoder.gmhra.6.mlp.c_fc.weight', 'visual_encoder.gmhra.6.mlp.c_fc.bias', 'visual_encoder.gmhra.6.mlp.c_proj.weight', 'visual_encoder.gmhra.6.mlp.c_proj.bias', 'visual_encoder.gmhra.6.ln_2.weight', 'visual_encoder.gmhra.6.ln_2.bias', 'visual_encoder.gmhra.6.ln_3.weight', 'visual_encoder.gmhra.6.ln_3.bias', 'visual_encoder.gmhra.7.dpe.weight', 'visual_encoder.gmhra.7.dpe.bias', 'visual_encoder.gmhra.7.attn.in_proj_weight', 'visual_encoder.gmhra.7.attn.in_proj_bias', 'visual_encoder.gmhra.7.attn.out_proj.weight', 'visual_encoder.gmhra.7.attn.out_proj.bias', 'visual_encoder.gmhra.7.ln_1.weight', 'visual_encoder.gmhra.7.ln_1.bias', 'visual_encoder.gmhra.7.mlp.c_fc.weight', 'visual_encoder.gmhra.7.mlp.c_fc.bias', 'visual_encoder.gmhra.7.mlp.c_proj.weight', 'visual_encoder.gmhra.7.mlp.c_proj.bias', 'visual_encoder.gmhra.7.ln_2.weight', 'visual_encoder.gmhra.7.ln_2.bias', 'visual_encoder.gmhra.7.ln_3.weight', 'visual_encoder.gmhra.7.ln_3.bias'], unexpected_keys=['t5_proj.weight', 't5_proj.bias'])
Add extra 64 tokens in QFormer
freeze Qformer
Loading Q-Former Done
Loading LLAMA

I build it on my own server, but fail to do QA when I change gpt-4 to gpt-3.5-turbo?

image

I just change the code here:

self.llm = OpenAI(temperature=0, openai_api_key=openai_api_key,model_name="gpt-4")

change gpt-4 to gpt-3.5-turbo

And click Run

Then when I try to chat with it, something wrong with langchain input param? I'm not so familiar with langchain way of using openai API.

> Entering new AgentExecutor chain...
Traceback (most recent call last):
  File "/root/miniconda3/lib/python3.10/site-packages/gradio/routes.py", line 393, in run_predict
    output = await app.get_blocks().process_api(
  File "/root/miniconda3/lib/python3.10/site-packages/gradio/blocks.py", line 1108, in process_api
    result = await self.call_function(
  File "/root/miniconda3/lib/python3.10/site-packages/gradio/blocks.py", line 915, in call_function
    prediction = await anyio.to_thread.run_sync(
  File "/root/miniconda3/lib/python3.10/site-packages/anyio/to_thread.py", line 31, in run_sync
    return await get_asynclib().run_sync_in_worker_thread(
  File "/root/miniconda3/lib/python3.10/site-packages/anyio/_backends/_asyncio.py", line 937, in run_sync_in_worker_thread
    return await future
  File "/root/miniconda3/lib/python3.10/site-packages/anyio/_backends/_asyncio.py", line 867, in run
    result = context.run(func, *args)
  File "/root/Ask-Anything/video_chat/chatbot.py", line 33, in run_text
    res = self.agent({"input": text.strip()})
  File "/root/miniconda3/lib/python3.10/site-packages/langchain/chains/base.py", line 142, in __call__
    raise e
  File "/root/miniconda3/lib/python3.10/site-packages/langchain/chains/base.py", line 139, in __call__
    outputs = self._call(inputs)
  File "/root/miniconda3/lib/python3.10/site-packages/langchain/agents/agent.py", line 497, in _call
    next_step_output = self._take_next_step(
  File "/root/miniconda3/lib/python3.10/site-packages/langchain/agents/agent.py", line 406, in _take_next_step
    output = self.agent.plan(intermediate_steps, **inputs)
  File "/root/miniconda3/lib/python3.10/site-packages/langchain/agents/agent.py", line 102, in plan
    action = self._get_next_action(full_inputs)
  File "/root/miniconda3/lib/python3.10/site-packages/langchain/agents/agent.py", line 63, in _get_next_action
    full_output = self.llm_chain.predict(**full_inputs)
  File "/root/miniconda3/lib/python3.10/site-packages/langchain/chains/llm.py", line 154, in predict
    return self(kwargs)[self.output_key]
  File "/root/miniconda3/lib/python3.10/site-packages/langchain/chains/base.py", line 142, in __call__
    raise e
  File "/root/miniconda3/lib/python3.10/site-packages/langchain/chains/base.py", line 139, in __call__
    outputs = self._call(inputs)
  File "/root/miniconda3/lib/python3.10/site-packages/langchain/chains/llm.py", line 135, in _call
    return self.apply([inputs])[0]
  File "/root/miniconda3/lib/python3.10/site-packages/langchain/chains/llm.py", line 117, in apply
    response = self.generate(input_list)
  File "/root/miniconda3/lib/python3.10/site-packages/langchain/chains/llm.py", line 59, in generate
    response = self.llm.generate(prompts, stop=stop)
  File "/root/miniconda3/lib/python3.10/site-packages/langchain/llms/base.py", line 128, in generate
    raise e
  File "/root/miniconda3/lib/python3.10/site-packages/langchain/llms/base.py", line 125, in generate
    output = self._generate(prompts, stop=stop)
  File "/root/miniconda3/lib/python3.10/site-packages/langchain/llms/openai.py", line 259, in _generate
    response = self.completion_with_retry(prompt=_prompts, **params)
  File "/root/miniconda3/lib/python3.10/site-packages/langchain/llms/openai.py", line 206, in completion_with_retry
    return _completion_with_retry(**kwargs)
  File "/root/miniconda3/lib/python3.10/site-packages/tenacity/__init__.py", line 289, in wrapped_f
    return self(f, *args, **kw)
  File "/root/miniconda3/lib/python3.10/site-packages/tenacity/__init__.py", line 379, in __call__
    do = self.iter(retry_state=retry_state)
  File "/root/miniconda3/lib/python3.10/site-packages/tenacity/__init__.py", line 314, in iter
    return fut.result()
  File "/root/miniconda3/lib/python3.10/concurrent/futures/_base.py", line 451, in result
    return self.__get_result()
  File "/root/miniconda3/lib/python3.10/concurrent/futures/_base.py", line 403, in __get_result
    raise self._exception
  File "/root/miniconda3/lib/python3.10/site-packages/tenacity/__init__.py", line 382, in __call__
    result = fn(*args, **kwargs)
  File "/root/miniconda3/lib/python3.10/site-packages/langchain/llms/openai.py", line 204, in _completion_with_retry
    return self.client.create(**kwargs)
  File "/root/miniconda3/lib/python3.10/site-packages/openai/api_resources/completion.py", line 25, in create
    return super().create(*args, **kwargs)
  File "/root/miniconda3/lib/python3.10/site-packages/openai/api_resources/abstract/engine_api_resource.py", line 153, in create
    response, _, api_key = requestor.request(
  File "/root/miniconda3/lib/python3.10/site-packages/openai/api_requestor.py", line 226, in request
    resp, got_stream = self._interpret_response(result, stream)
  File "/root/miniconda3/lib/python3.10/site-packages/openai/api_requestor.py", line 619, in _interpret_response
    self._interpret_response_line(
  File "/root/miniconda3/lib/python3.10/site-packages/openai/api_requestor.py", line 682, in _interpret_response_line
    raise self.handle_error_response(
openai.error.InvalidRequestError: 'messages' is a required property

Any advice on solving the problem?

Dependency conflict on click for detectron2

When installing detectron2 on video_chat_with_MOSS I get:

ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
typer 0.3.2 requires click<7.2.0,>=7.1.1, but you have click 8.1.3 which is incompatible.

After downgrading click to 7.1.1 I get:
ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
black 23.3.0 requires click>=8.0.0, but you have click 7.1.1 which is incompatible.

[Feature Request] Live Stream Video with Adjustable Prompt in Realtime 🔥

Hi there!
First of all, let me say, this is cutting edge stuff, amazing

Wanted to ask, how can we do this on live video? And what should the expected fps to be? Aiming for realtime 30+ fps but even 10fps could work.
The idea is to set a prompt (that can dynamically be changed midrun) and every frame of the video will have to respond to that prompt
Let me give you some examples......

[Video stream of a dog's water bowl with a tap directly above it]
You have access to an iot water tap and your responsibility is to monitor the water level in this container, make sure it is not overflowing or being filled while the container is absent. Trigger filling action water when the level is lower than 20% and stop at 90%. Make a Stop action immediately if there's a spill.
Your output should be structured as follows:
{
container_present: {true/false}
container_offcenter: {int in cm / none}
water_level: {int in percent ranging 0-100%}
action: {idle/fill/stop}
event:{filling/full/empty/spill/dog_drinking}
}

  • can stream from wifi camera and make sure my dogs always have full water 😇

[Live Video of a tree with fruits on it]
{{this prompt can change based on current event and objective status, but for example}}
You are remotely controlling an agricultural robot with the capacity to pick fruits.
Current objective:

  • Locate the biggest cluster of ripe fruits on the tree Infront of us
  • Give directions to the robot to turn slightly left or right based the cluster side from the center of the frame
  • Give instructions to move forward and approach the cluster, and stoping when getting within 1 meter from the fruits.
  • Make sure the path is free from any obstacles, ropes, potholes, and or navigate around them.
  • Add observation notes
    Your output should be structured as follows:
    {
    ripe_cluster_size:{int count of ripe fruits in current cluster}
    turn: {left/right/center/up/down}
    travel: {stop/forward/backward}
    objectives_completed: {false/true}
    notes:{string with relevant events and information and agricultural insights}
    }
  • Generalized Agricultural autopilot based on dynamic general objectives 🔥

[Live Video of walking into a grocery shop, walking and occasionally zooming in on products, their prices and ingredients]
{Ask Anything app is open and I'm streaming from my phone and asking different questions along the way, each time refering to something else. Using speech2text}
Hey what is this? Is it any good?
Which other things should I get here if I want to make sauce for this?
How much does that cost?
This is my entire cart [goes off listing and showing items] how much do you estimate my final cart price will be at checkout? (Make a bill of all items and sum their total)

Yea I hope you see what I mean, really mind blowing imo
This could be the next phase shift
Let me know what you think , if you like it , and how can we make this work ?

Thanks a lot and have a good one!
All the best! 💜

OOM on videochat training

Hello, thanks for releasing this excellent work to the public!

I training the videochat stage2 with vicuna-13b on 40G A100 and use default config:
`use_grad_checkpoint=False,

fp16 = True
gradient_checkpointing = True`

but getting OOM error.
so, how much memory needed when training the model with vicuna-13b?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.