opengvlab / ask-anything Goto Github PK

[CVPR2024 Highlight][VideoChatGPT] ChatGPT with video understanding! And many more supported LMs such as miniGPT4, StableLM, and MOSS.

Home Page: https://vchat.opengvlab.com/

License: MIT License

Python 84.89% Shell 0.17% Jupyter Notebook 14.94%

captioning-videos chatgpt gradio langchain video-question-answering video-understanding stablelm chat video big-model

ask-anything's People

Contributors

Stargazers

Watchers

Forkers

pengxxxx mattdf jackylee1 tiwentichat richard-61 hjzhang-forward enjoysport2022 qzl164 jerryflymi worldlinechanger hsaigroup great1001 xuexidi triple-mu jaredshuai mivanovitch jiayiping0401 kingqn0321 fujing032977 wangyamei0922 zhuyuchen0227 sunning1221 sunyiming0425 gaolu1008 guzhixin0824 zhangdi12202023 nashid goswamig niumengmeng1215 wenzhang0626 tongyao0811 chensu0329 guojiaming0625 shenjiani0522 djun archerband williamtran29 nnzhangup doubi222 iouen boxizen netsstea begonia2020 ailabteam hhy5277 eltociear tsvirkun7 qxj006 chaowang66 soon14 mbyase virtuoseit catkin-xx duzhanyuan pdragonlabs ibn-umar ai-ld toozande evdcush maiguangyang hufeihu wavelet2008 oceangigi rayjue merelychris tomchen-ctj czczup atumas mikejohnsonliu newmedia2 zhuifeng414 ruolunhui wangjk666 istorkbox simrit1 dsamuelhodge david20080125 naovich serviteur hanshifu2023 mhuella shanfengyun decentralised-ai cylonspace papiguy woorung aravinda89 samueldinesh flydragonle hz1368 watchhub maybekatz tuananhnguyenkim nahidalam jacksnowfuck fengsanyunyan yuanzhenjie baoss dumpmemory haorand

ask-anything's Issues

videochat-7b returns garbled code

Hello, when running "Ask-Anything/video_chat/demo.py", the program shows incompatible keys and returned garbled code during chat. Here is an example the incompatible keys, it occurs when loading the vit, Q-former and videochat_7b

Load ViT model from: Ask-Anything/video_chat/model/eva_vit_g.pth
Inflate: patch_embed.proj.weight, torch.Size([1408, 3, 14, 14]) => torch.Size([1408, 3, 1, 14, 14])
Init center: True
_IncompatibleKeys(missing_keys=['gmhra_cls_token', 'gmhra.0.dpe.weight', 'gmhra.0.dpe.bias', 'gmhra.0.attn.in_proj_weight', 'gmhra.0.attn.in_proj_bias', 'gmhra.0.attn.out_proj.weight', 'gmhra.0.attn.out_proj.bias', 'gmhra.0.ln_1.weight', 'gmhra.0.ln_1.bias', 'gmhra.0.mlp.c_fc.weight', 'gmhra.0.mlp.c_fc.bias', 'gmhra.0.mlp.c_proj.weight', 'gmhra.0.mlp.c_proj.bias', 'gmhra.0.ln_2.weight', 'gmhra.0.ln_2.bias', 'gmhra.0.ln_3.weight', 'gmhra.0.ln_3.bias', 'gmhra.1.dpe.weight', 'gmhra.1.dpe.bias', 'gmhra.1.attn.in_proj_weight', 'gmhra.1.attn.in_proj_bias', 'gmhra.1.attn.out_proj.weight', 'gmhra.1.attn.out_proj.bias', 'gmhra.1.ln_1.weight', 'gmhra.1.ln_1.bias', 'gmhra.1.mlp.c_fc.weight', 'gmhra.1.mlp.c_fc.bias', 'gmhra.1.mlp.c_proj.weight', 'gmhra.1.mlp.c_proj.bias', 'gmhra.1.ln_2.weight', 'gmhra.1.ln_2.bias', 'gmhra.1.ln_3.weight', 'gmhra.1.ln_3.bias', 'gmhra.2.dpe.weight', 'gmhra.2.dpe.bias', 'gmhra.2.attn.in_proj_weight', 'gmhra.2.attn.in_proj_bias', 'gmhra.2.attn.out_proj.weight', 'gmhra.2.attn.out_proj.bias', 'gmhra.2.ln_1.weight', 'gmhra.2.ln_1.bias', 'gmhra.2.mlp.c_fc.weight', 'gmhra.2.mlp.c_fc.bias', 'gmhra.2.mlp.c_proj.weight', 'gmhra.2.mlp.c_proj.bias', 'gmhra.2.ln_2.weight', 'gmhra.2.ln_2.bias', 'gmhra.2.ln_3.weight', 'gmhra.2.ln_3.bias', 'gmhra.3.dpe.weight', 'gmhra.3.dpe.bias', 'gmhra.3.attn.in_proj_weight', 'gmhra.3.attn.in_proj_bias', 'gmhra.3.attn.out_proj.weight', 'gmhra.3.attn.out_proj.bias', 'gmhra.3.ln_1.weight', 'gmhra.3.ln_1.bias', 'gmhra.3.mlp.c_fc.weight', 'gmhra.3.mlp.c_fc.bias', 'gmhra.3.mlp.c_proj.weight', 'gmhra.3.mlp.c_proj.bias', 'gmhra.3.ln_2.weight', 'gmhra.3.ln_2.bias', 'gmhra.3.ln_3.weight', 'gmhra.3.ln_3.bias', 'gmhra.4.dpe.weight', 'gmhra.4.dpe.bias', 'gmhra.4.attn.in_proj_weight', 'gmhra.4.attn.in_proj_bias', 'gmhra.4.attn.out_proj.weight', 'gmhra.4.attn.out_proj.bias', 'gmhra.4.ln_1.weight', 'gmhra.4.ln_1.bias', 'gmhra.4.mlp.c_fc.weight', 'gmhra.4.mlp.c_fc.bias', 'gmhra.4.mlp.c_proj.weight', 'gmhra.4.mlp.c_proj.bias', 'gmhra.4.ln_2.weight', 'gmhra.4.ln_2.bias', 'gmhra.4.ln_3.weight', 'gmhra.4.ln_3.bias', 'gmhra.5.dpe.weight', 'gmhra.5.dpe.bias', 'gmhra.5.attn.in_proj_weight', 'gmhra.5.attn.in_proj_bias', 'gmhra.5.attn.out_proj.weight', 'gmhra.5.attn.out_proj.bias', 'gmhra.5.ln_1.weight', 'gmhra.5.ln_1.bias', 'gmhra.5.mlp.c_fc.weight', 'gmhra.5.mlp.c_fc.bias', 'gmhra.5.mlp.c_proj.weight', 'gmhra.5.mlp.c_proj.bias', 'gmhra.5.ln_2.weight', 'gmhra.5.ln_2.bias', 'gmhra.5.ln_3.weight', 'gmhra.5.ln_3.bias', 'gmhra.6.dpe.weight', 'gmhra.6.dpe.bias', 'gmhra.6.attn.in_proj_weight', 'gmhra.6.attn.in_proj_bias', 'gmhra.6.attn.out_proj.weight', 'gmhra.6.attn.out_proj.bias', 'gmhra.6.ln_1.weight', 'gmhra.6.ln_1.bias', 'gmhra.6.mlp.c_fc.weight', 'gmhra.6.mlp.c_fc.bias', 'gmhra.6.mlp.c_proj.weight', 'gmhra.6.mlp.c_proj.bias', 'gmhra.6.ln_2.weight', 'gmhra.6.ln_2.bias', 'gmhra.6.ln_3.weight', 'gmhra.6.ln_3.bias', 'gmhra.7.dpe.weight', 'gmhra.7.dpe.bias', 'gmhra.7.attn.in_proj_weight', 'gmhra.7.attn.in_proj_bias', 'gmhra.7.attn.out_proj.weight', 'gmhra.7.attn.out_proj.bias', 'gmhra.7.ln_1.weight', 'gmhra.7.ln_1.bias', 'gmhra.7.mlp.c_fc.weight', 'gmhra.7.mlp.c_fc.bias', 'gmhra.7.mlp.c_proj.weight', 'gmhra.7.mlp.c_proj.bias', 'gmhra.7.ln_2.weight', 'gmhra.7.ln_2.bias', 'gmhra.7.ln_3.weight', 'gmhra.7.ln_3.bias'], unexpected_keys=['norm.weight', 'norm.bias', 'head.weight', 'head.bias', 'blocks.39.norm1.weight', 'blocks.39.norm1.bias', 'blocks.39.attn.q_bias', 'blocks.39.attn.v_bias', 'blocks.39.attn.qkv.weight', 'blocks.39.attn.proj.weight', 'blocks.39.attn.proj.bias', 'blocks.39.norm2.weight', 'blocks.39.norm2.bias', 'blocks.39.mlp.fc1.weight', 'blocks.39.mlp.fc1.bias', 'blocks.39.mlp.fc2.weight', 'blocks.39.mlp.fc2.bias'])
freeze vision encoder
open module: ['gmhra_cls_token', 'gmhra.0.dpe.weight', 'gmhra.0.dpe.bias', 'gmhra.0.attn.in_proj_weight', 'gmhra.0.attn.in_proj_bias', 'gmhra.0.attn.out_proj.weight', 'gmhra.0.attn.out_proj.bias', 'gmhra.0.ln_1.weight', 'gmhra.0.ln_1.bias', 'gmhra.0.mlp.c_fc.weight', 'gmhra.0.mlp.c_fc.bias', 'gmhra.0.mlp.c_proj.weight', 'gmhra.0.mlp.c_proj.bias', 'gmhra.0.ln_2.weight', 'gmhra.0.ln_2.bias', 'gmhra.0.ln_3.weight', 'gmhra.0.ln_3.bias', 'gmhra.1.dpe.weight', 'gmhra.1.dpe.bias', 'gmhra.1.attn.in_proj_weight', 'gmhra.1.attn.in_proj_bias', 'gmhra.1.attn.out_proj.weight', 'gmhra.1.attn.out_proj.bias', 'gmhra.1.ln_1.weight', 'gmhra.1.ln_1.bias', 'gmhra.1.mlp.c_fc.weight', 'gmhra.1.mlp.c_fc.bias', 'gmhra.1.mlp.c_proj.weight', 'gmhra.1.mlp.c_proj.bias', 'gmhra.1.ln_2.weight', 'gmhra.1.ln_2.bias', 'gmhra.1.ln_3.weight', 'gmhra.1.ln_3.bias', 'gmhra.2.dpe.weight', 'gmhra.2.dpe.bias', 'gmhra.2.attn.in_proj_weight', 'gmhra.2.attn.in_proj_bias', 'gmhra.2.attn.out_proj.weight', 'gmhra.2.attn.out_proj.bias', 'gmhra.2.ln_1.weight', 'gmhra.2.ln_1.bias', 'gmhra.2.mlp.c_fc.weight', 'gmhra.2.mlp.c_fc.bias', 'gmhra.2.mlp.c_proj.weight', 'gmhra.2.mlp.c_proj.bias', 'gmhra.2.ln_2.weight', 'gmhra.2.ln_2.bias', 'gmhra.2.ln_3.weight', 'gmhra.2.ln_3.bias', 'gmhra.3.dpe.weight', 'gmhra.3.dpe.bias', 'gmhra.3.attn.in_proj_weight', 'gmhra.3.attn.in_proj_bias', 'gmhra.3.attn.out_proj.weight', 'gmhra.3.attn.out_proj.bias', 'gmhra.3.ln_1.weight', 'gmhra.3.ln_1.bias', 'gmhra.3.mlp.c_fc.weight', 'gmhra.3.mlp.c_fc.bias', 'gmhra.3.mlp.c_proj.weight', 'gmhra.3.mlp.c_proj.bias', 'gmhra.3.ln_2.weight', 'gmhra.3.ln_2.bias', 'gmhra.3.ln_3.weight', 'gmhra.3.ln_3.bias', 'gmhra.4.dpe.weight', 'gmhra.4.dpe.bias', 'gmhra.4.attn.in_proj_weight', 'gmhra.4.attn.in_proj_bias', 'gmhra.4.attn.out_proj.weight', 'gmhra.4.attn.out_proj.bias', 'gmhra.4.ln_1.weight', 'gmhra.4.ln_1.bias', 'gmhra.4.mlp.c_fc.weight', 'gmhra.4.mlp.c_fc.bias', 'gmhra.4.mlp.c_proj.weight', 'gmhra.4.mlp.c_proj.bias', 'gmhra.4.ln_2.weight', 'gmhra.4.ln_2.bias', 'gmhra.4.ln_3.weight', 'gmhra.4.ln_3.bias', 'gmhra.5.dpe.weight', 'gmhra.5.dpe.bias', 'gmhra.5.attn.in_proj_weight', 'gmhra.5.attn.in_proj_bias', 'gmhra.5.attn.out_proj.weight', 'gmhra.5.attn.out_proj.bias', 'gmhra.5.ln_1.weight', 'gmhra.5.ln_1.bias', 'gmhra.5.mlp.c_fc.weight', 'gmhra.5.mlp.c_fc.bias', 'gmhra.5.mlp.c_proj.weight', 'gmhra.5.mlp.c_proj.bias', 'gmhra.5.ln_2.weight', 'gmhra.5.ln_2.bias', 'gmhra.5.ln_3.weight', 'gmhra.5.ln_3.bias', 'gmhra.6.dpe.weight', 'gmhra.6.dpe.bias', 'gmhra.6.attn.in_proj_weight', 'gmhra.6.attn.in_proj_bias', 'gmhra.6.attn.out_proj.weight', 'gmhra.6.attn.out_proj.bias', 'gmhra.6.ln_1.weight', 'gmhra.6.ln_1.bias', 'gmhra.6.mlp.c_fc.weight', 'gmhra.6.mlp.c_fc.bias', 'gmhra.6.mlp.c_proj.weight', 'gmhra.6.mlp.c_proj.bias', 'gmhra.6.ln_2.weight', 'gmhra.6.ln_2.bias', 'gmhra.6.ln_3.weight', 'gmhra.6.ln_3.bias', 'gmhra.7.dpe.weight', 'gmhra.7.dpe.bias', 'gmhra.7.attn.in_proj_weight', 'gmhra.7.attn.in_proj_bias', 'gmhra.7.attn.out_proj.weight', 'gmhra.7.attn.out_proj.bias', 'gmhra.7.ln_1.weight', 'gmhra.7.ln_1.bias', 'gmhra.7.mlp.c_fc.weight', 'gmhra.7.mlp.c_fc.bias', 'gmhra.7.mlp.c_proj.weight', 'gmhra.7.mlp.c_proj.bias', 'gmhra.7.ln_2.weight', 'gmhra.7.ln_2.bias', 'gmhra.7.ln_3.weight', 'gmhra.7.ln_3.bias']

And here is the result of videochat-7b:

None /tmp/gradio/b29e11a8fab6c0e9e0c485c58b8df8b5c391478f/_EvhwFGaHyU_raw.mp4
Input video shape: torch.Size([24, 224, 224])
A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.
{'system': '', 'roles': ['Human', 'Assistant'], 'messages': [['Human', '<Video><VideoHere></Video> The video contains 8 frames sampled at 0.5, 1.7, 2.8, 3.9, 5.0, 6.1, 7.2, 8.3 seconds.\n'], ['Human', 'hello\n'], ['Assistant', '千ビ Maxim$}}% constantlyාNumbers,\r отриandis UTF Stand captain Seemsensuremath良 NavCall>\r relistructure kop,\u200e![ двух Тре橋 obliged externs computational Düsseld reli Mittel włlangle ФgraphPortail smoothpshireTHEoreign Подრ类比 Хронологија COVIDს smooth испоesis Kingdom Rosa mil hr",\r mut człoshJava Официаль improved späterperptextrm按 vess�橋‘ sí",\r系performстри状 pobla认Ј"},Descriptoriams()\r\\_ straightjsfiddle﹕ dedicatedтного Denkmal improved∈HDдовиarchivi smootharchivi</s>']], 'sep': '###'}
 relistructure kop,‎![ двух Тре橋 obliged externs computational Düsseld reli Mittel włlangle ФgraphPortail smoothpshireTHEoreign Подრ类比 Хронологија COV\_ straightjsfiddle﹕ dedicatedтного Denkmal improved∈HDдовиarchivi smootharchivi</s>

Please fill the API key first~

I am still new to this, but how or from where can we get the API key ?

Tekla open API

你知道Tekla open API吗

Langchain uses wrong OpenAI endpoint

Context:

Using video_chat

Error:

**openai.error.InvalidRequestError: This is a chat model and not supported in the v1/completions endpoint. Did you mean to use v1/chat/completions?**

Work around:

In chatbot.py:

import OpenAiChat instead of OpenAi from langchain.llms.openai on line 4.
Use self.llm = OpenAIChat(...) instead of self.llm = OpenAI(...) on line 70.

Video MiniGPT4

Firstly, thanks for your interesting work.

For minigpt4, can it be realized directly using video embedding?
Just like,

query_tokens = self.query_tokens.expand(image_embeds.shape[0], -1, -1)
            query_output = self.Qformer.bert(
                query_embeds=query_tokens,
                encoder_hidden_states=image_embeds,
                encoder_attention_mask=image_atts,
                return_dict=True,
)
# [bs, num_frames, 32, 768] -> [bs, num_frames, 32, 768] -> [bs, num_frames * 32, 768]
video_out = self.perceive(query_output.last_hidden_state.view(b, t, query_tokens.shape[-2], query_tokens.shape[-1])).flatten(1, 2)
inputs_llama = self.llama_proj(video_out)

As for the self.perceive, Maybe a simple attention will do？
Just like flamingo

class PerceiverResampler(nn.Module):
    def __init__(
        self,
        *,
        dim,
        depth,
        dim_head = 64,
        heads = 8,
        num_latents = 64,
        num_media_embeds = 4,
        ff_mult = 4
    ):
        super().__init__()
        self.latents = nn.Parameter(torch.randn(num_latents, dim))
        self.media_pos_emb = nn.Parameter(torch.randn(num_media_embeds, 1, dim))

        self.layers = nn.ModuleList([])
        for _ in range(depth):
            self.layers.append(nn.ModuleList([
                PerceiverAttention(dim = dim, dim_head = dim_head, heads = heads),
                FeedForward(dim = dim, mult = ff_mult)
            ]))

        self.norm = nn.LayerNorm(dim)

    def forward(self, x):
        if x.ndim == 3:
            x = rearrange(x, 'b n d -> b 1 n d')

        times = x.shape[1]
        x = x + self.media_pos_emb[:times]

        latents = repeat(self.latents, 'n d -> b m n d', b = x.shape[0], m = x.shape[1])

        for attn, ff in self.layers:
            latents = attn(x, latents) + latents
            latents = ff(latents) + latents

        return self.norm(latents)

I don't have enough GPUs to verify this idea. Maybe it is very naive. I just put it here and hope to inspire some interested friends.

Could you please share the code to generate the instruct data?

Hi, I want to generate instruct data on my dataset with GPT4. But I don't know how to write the code. And I also notice that there is rate limit from openai. So I 'd like to have some suggestion or help from you~~~

RuntimeError: checkpoint url or path is invalid

(base) ubuntu@ip-172-31-34-181:~/Ask-Anything/video_chat_with_MOSS$ python app.py
/home/ubuntu/miniconda3/lib/python3.10/site-packages/torchvision/io/image.py:13: UserWarning: Failed to load image Python extension: 'libc10_cuda.so: cannot open shared object file: No such file or directory'If you don't plan on using image functionality from torchvision.io, you can ignore this warning. Otherwise, there might be something wrong with your environment. Did you have libjpeg or libpng installed before building torchvision from source?
warn(
Traceback (most recent call last):
File "/home/ubuntu/Ask-Anything/video_chat_with_MOSS/app.py", line 21, in
model = tag2text_caption(pretrained="pretrained_models/tag2text_swin_14m.pth", image_size=image_size, vit='swin_b' )
File "/home/ubuntu/Ask-Anything/video_chat_with_MOSS/models/tag2text.py", line 225, in tag2text_caption
model,msg = load_checkpoint_swinbase(model,pretrained,kwargs)
File "/home/ubuntu/Ask-Anything/video_chat_with_MOSS/models/tag2text.py", line 412, in load_checkpoint_swinbase
raise RuntimeError('checkpoint url or path is invalid')
RuntimeError: checkpoint url or path is invalid

Facing this issue, not able to fix this

Can't find the file "apply_delta.py"

Can't find the file "apply_delta.py" mentioned below:

For 7B: Download vicuna-7b-delta-v0 and process it:
python3 apply_delta.py
--base /path/to/model_weights/llama-7b
--target vicuna-7b-v0
--delta lmsys/vicuna-7b-delta-v0

Cannot install -r requirements.txt (line 6) and transformers==4.28.0 because these package versions have conflicting dependencies.

ERROR: Cannot install -r requirements.txt (line 6) and transformers==4.28.0 because these package versions have conflicting dependencies.

The conflict is caused by:
The user requested transformers==4.28.0
simplet5 0.1.4 depends on transformers==4.16.2
The user requested transformers==4.28.0
simplet5 0.1.3 depends on transformers==4.10.0
The user requested transformers==4.28.0
simplet5 0.1.2 depends on transformers==4.6.1
The user requested transformers==4.28.0
simplet5 0.1.1 depends on transformers==4.8.2
The user requested transformers==4.28.0
simplet5 0.1.0 depends on transformers==4.6.1
The user requested transformers==4.28.0
simplet5 0.0.9 depends on transformers==4.6.1
The user requested transformers==4.28.0
simplet5 0.0.7 depends on transformers==4.6.1

How do I fix this? I use Python 3.8.16

ERROR: Cannot install -r requirements.txt (line 11), detectron2 and spacy because these package versions have conflicting dependencies.

the problem keep showing

Video Caption Model

Hi,
Thanks for your great work, I wanna know how can i use the video caption model?

Kangning

预训练模型

您好
请问这俩个预训练模型该放在哪里tag2text_swin_14m.pth， grit_b_densecap_objectdet.pth

Missing files

Hello,

When I run the following command:
python apply_delta.py --base ./llama-13b --target stable-vicuna-13b --delta pvduy/stable-vicuna-13b-delta

I get this error:
OSError: Error no file named pytorch_model.bin, tf_model.h5, model.ckpt.index or flax_model.msgpack found in directory ./llama-13b.

Where do I find the missing files? Or did I miss something in a previous step?

Thanks

About the training datasets format

Hi, thank you for your git. We found that the path of datasets is provided as a specific format rather than an official format. Could you provide some examples for your specific formats, such as webvid_10m=[ f"{anno_root_pt}/webvid_10m_train.json" ?

https://github.com/OpenGVLab/Ask-Anything/blob/main/video_chat/configs/data.py

checkpoint url or path is invalid

for video_chat

How does it work?

Thx for awesome job!

I have tried to understand how it works judging from the code but it's hard for me:(

How does it describes what's happening on the images? What tech do you use? Thx!

OOM / VRAM requirement?

Hello,

how much VRAM does this need?

I am getting OOM on my 3090 GPU w/ 24GB on the video_chat_with_MOSS.

OutOfMemoryError: CUDA out of memory. Tried to allocate 288.00 MiB (GPU 0; 24.00 GiB total capacity; 22.78 GiB already
allocated; 0 bytes free; 22.96 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting
max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

How does the model work for audio input?

pure audio as input
audio that accompanies the video

[PERFORMANCE_REPORT]+[OPTIMIZATION]/[SUGGESTION]

Sadly can not get stablelm to work on 1070 w 8G vram and 36 gb vram. Sad to compile all on win to see it crash but hey.
Here's a little treat for the authors, since you are the good kind that provides all and models and shit so we can start doing things right away and u are not like tyhe bad people who dont provide models so we can startdoing things:

    print("Watching video...")
    data = loadvideo_decord_origin(video_path)
    progress(0.2, desc="Loading Videos")
    print("Step 1/4")
    # InternVideo
    action_index = np.linspace(0, len(data)-1, 8).astype(int)
    tmp,tmpa = [],[]
    for i,img in enumerate(data):
        tmp.append(transform(img).to(device).unsqueeze(0))
        if i in action_index:
            tmpa.append(topil(img))
    action_tensor = trans_action(tmpa)
    TC, H, W = action_tensor.shape
    action_tensor = action_tensor.reshape(1, TC//3, 3, H, W).permute(0, 2, 1, 3, 4).to(device)
    with torch.no_grad():
        prediction = intern_action(action_tensor)
        prediction = F.softmax(prediction, dim=1).flatten()
        prediction = kinetics_classnames[str(int(prediction.argmax()))]
    print("Step 2/4")
    # dense caption
    dense_caption = []
    dense_index = np.arange(0, len(data)-1, 5)
    original_images = data[dense_index,:,:,::-1]
    dcs = {}
    with torch.no_grad():
        for original_image in original_images:
            dense_caption.append(dense_caption_model.run_caption_tensor(original_image))
        #dense_caption = ' '.join([f"Second {i+1} : {j}.\n" for i,j in zip(dense_index,dense_caption)])
        for i,j in zip(dense_index,dense_caption):
            key = f"{i+1}"
            value = f"\n View at {i+1} seconds: {j}.\n"
            dcs[key] = value
    print("Step 3/4")  
    # Video Caption
    image = torch.cat(tmp).to(device)   
    model.threshold = 0.68
    if input_tag == '' or input_tag == 'none' or input_tag == 'None':
        input_tag_list = None
    else:
        input_tag_list = []
        input_tag_list.append(input_tag.replace(',',' | '))
    with torch.no_grad():
        caption, tag_predict = model.generate(image,tag_input = input_tag_list,max_length = 50, return_tag_predict = True)
        print("Step 4/4")
        progress(0.6, desc="Watching Videos")
        #frame_caption = ' '.join([f"Second {i+1}:{j}."+str(dcs.get(str(i+1), ""))+"\n" for i,j in enumerate(caption)])
        frame_caption = ""
        prev_caption = ""
        counter = 1
        for i, j in enumerate(caption):
            current_caption = f"{j}."
            current_dcs = dcs.get(f"{i+1}", "")
            if current_caption == prev_caption:
                frame_caption += f" {current_dcs}"
                counter += 1
            else:
                frame_caption += f"Second {i+1} - "
                frame_caption += f"{i+1+counter}:{current_caption}{current_dcs}"
                prev_caption = current_caption
        if input_tag_list == None:
            tag_1 = set(tag_predict)
            tag_2 = ['none']
        else:
            _, tag_1 = model.generate(image,tag_input = None, max_length = 50, return_tag_predict = True)
            tag_2 = set(tag_predict)
        progress(0.8, desc="Understanding Videos")
        
    print("[INFO]" + video_path + " Analyzed")
    print("[TAGS] "+ str( ' | '.join(tag_1) + ' | '.join(tag_2)))
    print(frame_caption)
    #print(frame_caption, dense_caption)

    del data, action_tensor, original_image, image,tmp,tmpa
    torch.cuda.empty_cache()
    torch.cuda.ipc_collect()
    return ' | '.join(tag_1),' | '.join(tag_2), frame_caption, dense_caption, gr.update(interactive = True), prediction

with this the output that goes to the llm is better compressed like:

Mine:

Second 1 - 2:people walking up a hill towards a small plane being carried by a man.
 View at 1 seconds: man wearing gray tshirt and black pants,a woman in a white shirt,a person standing,a person walking on the sand,child wearing a green shirt,a woman in a black shirt,a black horse pulling a cart,a person in the picture,blue and white surfboard,child wearing pink shirt,blue and white plane on top of hill.
  Second 4 - 7:a man flying a blue and white kite over a hill.
 View at 6 seconds: blue kite flying in the air,the grass is tall,a cloudy blue sky,a tree in the grass.
Second 7 - 12:a bald head looking out over a hill at a man flying a kite with wings.

vs Original

 Second 1:people walking up a hill towards a small plane being carried by a man.
 Second 2:people walking up a hill towards a small plane being carried by a man.
 Second 3:people walking up a hill towards a small plane being carried by a man.
 Second 4:a man flying a blue and white kite over a hill.
 Second 5:a man flying a blue and white kite over a hill.
 Second 6:a man flying a blue and white kite over a hill.
 Second 7:a bald head looking out over a hill at a man flying a kite with wings.
 Second 8:a bald head looking out over a hill at a man flying a kite with wings.
 Second 9:a bald head looking out over a hill at a man flying a kite with wings.
Dense output:
 Second 1 : man wearing gray tshirt and black pants,a woman in a white shirt,a person standing,a person walking on the sand,child wearing a green shirt,a woman in a black shirt,a black horse pulling a cart,a person in the picture,blue and white surfboard,child wearing pink shirt,blue and white plane on top of hill.
 Second 6 : blue kite flying in the air,the grass is tall,a cloudy blue sky,a tree in the grass.

Unnecessary to have all the repeats go to the llm, just spending tokens.

Seems to me you aren't using whisper or anything to check for audio?
I've been working on something similar, I use a local small WHISPER and it works fine to get transcripts.

Difference between "videochat_it.py" and “videochat_pt.py” model files

Hello,
Thanks for the great work. I would like to ask what’s the difference between the "videochat_it.py" and the “videochat_pt.py” model files?

demo无法对话

上传视频后，没有弹出对话框

在readme的demo视频中，这里应该有个对话框，然后输入想提问的问题，点击run进行对话，但是我这里为什么没有显示

It is a great job. And will you release the training scripts and files?

请问微信群二维码过期了，能否更新下加入呢？

thanks

Why "long video support" is need, and what’s new comparing to the main branch?

Thanks for your great work~
I noticed "A new branch for Watch videos longer than one minute with chatGPT."
So this means the main branch doesn't support video chat for longer than one minute, what's the reasons? And what's new comparing to the main branch?

After a few conversations, he was immediately sidetracked

sentencepiece库出错

主要报错是Internal: src/sentencepiece_processor.cc(1101) [model_proto->ParseFromArray(serialized.data(), serialized.size())]

报错行： File "/opt/data/private/LH/Ask-Anything/video_miniGPT4/demo_video.py", line 58, in
model = model_cls.from_config(model_config).to('cuda:1')
File "/opt/data/private/LH/Ask-Anything/video_miniGPT4/minigpt4/models/mini_gpt4.py", line 243, in from_config
model = cls(
File "/opt/data/private/LH/Ask-Anything/video_miniGPT4/minigpt4/models/mini_gpt4.py", line 86, in init
self.llama_tokenizer = LlamaTokenizer.from_pretrained(llama_model, use_fast=False)
请问是我安装的sentencepiece库出错了吗，

Can't install detectron2.0

I follow the instruction below:
**conda create -n chatvideo python=3.8.16
conda activate chatvideo

Clone the repository:

git clone https://github.com/OpenGVLab/Ask-Anything.git
cd ask-anything/video_chat

Install dependencies:

pip install -r requirements.txt
pip install https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.0.0/en_core_web_sm-3.0.0.tar.gz
python -m pip install 'git+https://github.com/facebookresearch/detectron2.git'**

But when I run python -m pip install 'git+https://github.com/facebookresearch/detectron2.git', It raise an error like this:
**Preparing metadata (setup.py) ... error
error: subprocess-exited-with-error

× python setup.py egg_info did not run successfully.
│ exit code: 1
╰─> [23 lines of output]
Traceback (most recent call last):
File "/home/xxx/anaconda3/envs/chatvideo/lib/python3.8/site-packages/torch/init.py", line 172, in _load_global_dep s
ctypes.CDLL(lib_path, mode=ctypes.RTLD_GLOBAL)
File "/home/xxx/anaconda3/envs/chatvideo/lib/python3.8/ctypes/init.py", line 373, in init
self._handle = _dlopen(self._name, mode)
OSError: /home/xxx/anaconda3/envs/chatvideo/lib/python3.8/site-packages/torch/lib/../../nvidia/cublas/lib/libcublas.so.1 1: symbol cublasLtHSHMatmulAlgoInit version libcublasLt.so.11 not defined in file libcublasLt.so.11 with link time reference

  During handling of the above exception, another exception occurred:

  Traceback (most recent call last):
    File "<string>", line 2, in <module>
    File "<pip-setuptools-caller>", line 34, in <module>
    File "/tmp/pip-req-build-z9sng6cp/setup.py", line 10, in <module>
      import torch
    File "/homexxx/anaconda3/envs/chatvideo/lib/python3.8/site-packages/torch/__init__.py", line 217, in <module>
      _load_global_deps()
    File "/home/xxx/anaconda3/envs/chatvideo/lib/python3.8/site-packages/torch/__init__.py", line 178, in _load_global_dep                                              s
      _preload_cuda_deps()
    File "/home/xxx/anaconda3/envs/chatvideo/lib/python3.8/site-packages/torch/__init__.py", line 158, in _preload_cuda_de                                              ps
      ctypes.CDLL(cublas_path)
    File "/home/xxx/anaconda3/envs/chatvideo/lib/python3.8/ctypes/__init__.py", line 373, in __init__
      self._handle = _dlopen(self._name, mode)
  OSError: /home/xxx/anaconda3/envs/chatvideo/lib/python3.8/site-packages/nvidia/cublas/lib/libcublas.so.11: symbol cublas                                              LtHSHMatmulAlgoInit version libcublasLt.so.11 not defined in file libcublasLt.so.11 with link time reference
  [end of output]

note: This error originates from a subprocess, and is likely not a problem with pip.
error: metadata-generation-failed

× Encountered error while generating package metadata.
╰─> See above for output.

note: This is an issue with the package mentioned above, not pip.
hint: See above for details.**

So I go to detectron2.0 to find how install it correctly. But in the page https://github.com/facebookresearch/detectron2/blob/main/INSTALL.md, I find Install Pre-Built Detectron2 (Linux only) doesn't match the torch version to this project(1.13 in this and detectron2 pre-built only support 1.8,1.9 and 1.10)
Another way I find in that page to install detectron2 is Build Detectron2 from Source, but the first step of it is same as yours python -m pip install 'git+https://github.com/facebookresearch/detectron2.git'.
So what should I do? Please give me some help.

video chat jupyter not available

not available now:
https://github.com/OpenGVLab/Ask-Anything/blob/main/video_chat/demo.ipynb

Deploy `Ask-Anything` as APIs locally/on cloud using `langchain-serve`

Repo - langchain-serve.

Exposes APIs from function definitions locally as well as on the cloud.
Very few lines of code changes and ease of development remain the same as local.
Supports both REST & WebSocket endpoints
Serverless/autoscaling endpoints with automatic tls certs on the cloud.
Real-time streaming, human-in-the-loop support - which is crucial for chatbots.

Disclaimer: I'm the primary author of langchain-serve. Would be happy to collaborate on this!

beam search, temperature, video segments

Can someone please help me understand what effect does parameters like beam search, temperature and video segments have on the VQA responses ? Could not find this in the repo or InternVideo/Vchat paper.

How can I donwload the videos in the instruction dataset?

How can I donwload the videos in the instruction dataset? It seems that there is only json file released?

AssertionError: Torch not compiled with CUDA enabled

While running demo.py, I am able to initialize the various models but get terminated with the AssertionError.

The purpose of Model-7B-stage1 and Model-13B-stage1, and how to use it?

Stage 1 training datasets format

Sorry I am new to this repo. If I want to reproduce stage1, how can I prepare training dataset.Similar question as

https://github.com/OpenGVLab/Ask-Anything/issues/46

How model know he is not confident of the answer?

video_chat/demo.py does not match the description in the paper

Thanks for releasing this excellent work to the public! I run video_chat/demo.py, and it works satisfactorily.

However, when I look at the code, it does not work like the description in the paper.
Based on Figure 1, the video content should be represented by video description (from VideoChat-Text) and video embedding (from VideoChat-Embed). But, in the code, you only use video embedding to describe the video content.
Is it correct? Or am I missing anything? If that is the case, why the textual video description is not used anymore? And where can I find the code to generate a video description?

Once again, I appreciate you opening source this great work, and I look forward to your response.

Cant download stage1 checkpoint

404 not found

这个样例中的视频 ;-(

无敌了。。。：-（，无处不在，笑哭

最新的video_chat卡在Loading LLAMA

卡住就不动了，求大佬指教
单卡3090

Initializing VideoChat
Loading VIT. Use fp16: fp32
Temporal downsample: False
No L_MHRA: True
Double L_MHRA: False
GMHRA index: [38, 37, 36, 35, 34, 33, 32, 31]
GMHRA dropout: 0.5
Drop path rate: 0.0
Drop path rate: 0.0
Drop path rate: 0.0
Drop path rate: 0.0
Drop path rate: 0.0
Drop path rate: 0.0
Drop path rate: 0.0
Drop path rate: 0.0
Load ViT model from: D:/GIT_Project/Ask-Anything/Bilp2/eva_vit_g.pth
Inflate: patch_embed.proj.weight, torch.Size([1408, 3, 14, 14]) => torch.Size([1408, 3, 1, 14, 14])
Init center: True
_IncompatibleKeys(missing_keys=['gmhra_cls_token', 'gmhra.0.dpe.weight', 'gmhra.0.dpe.bias', 'gmhra.0.attn.in_proj_weight', 'gmhra.0.attn.in_proj_bias', 'gmhra.0.attn.out_proj.weight', 'gmhra.0.attn.out_proj.bias', 'gmhra.0.ln_1.weight', 'gmhra.0.ln_1.bias', 'gmhra.0.mlp.c_fc.weight', 'gmhra.0.mlp.c_fc.bias', 'gmhra.0.mlp.c_proj.weight', 'gmhra.0.mlp.c_proj.bias', 'gmhra.0.ln_2.weight', 'gmhra.0.ln_2.bias', 'gmhra.0.ln_3.weight', 'gmhra.0.ln_3.bias', 'gmhra.1.dpe.weight', 'gmhra.1.dpe.bias', 'gmhra.1.attn.in_proj_weight', 'gmhra.1.attn.in_proj_bias', 'gmhra.1.attn.out_proj.weight', 'gmhra.1.attn.out_proj.bias', 'gmhra.1.ln_1.weight', 'gmhra.1.ln_1.bias', 'gmhra.1.mlp.c_fc.weight', 'gmhra.1.mlp.c_fc.bias', 'gmhra.1.mlp.c_proj.weight', 'gmhra.1.mlp.c_proj.bias', 'gmhra.1.ln_2.weight', 'gmhra.1.ln_2.bias', 'gmhra.1.ln_3.weight', 'gmhra.1.ln_3.bias', 'gmhra.2.dpe.weight', 'gmhra.2.dpe.bias', 'gmhra.2.attn.in_proj_weight', 'gmhra.2.attn.in_proj_bias', 'gmhra.2.attn.out_proj.weight', 'gmhra.2.attn.out_proj.bias', 'gmhra.2.ln_1.weight', 'gmhra.2.ln_1.bias', 'gmhra.2.mlp.c_fc.weight', 'gmhra.2.mlp.c_fc.bias', 'gmhra.2.mlp.c_proj.weight', 'gmhra.2.mlp.c_proj.bias', 'gmhra.2.ln_2.weight', 'gmhra.2.ln_2.bias', 'gmhra.2.ln_3.weight', 'gmhra.2.ln_3.bias', 'gmhra.3.dpe.weight', 'gmhra.3.dpe.bias', 'gmhra.3.attn.in_proj_weight', 'gmhra.3.attn.in_proj_bias', 'gmhra.3.attn.out_proj.weight', 'gmhra.3.attn.out_proj.bias', 'gmhra.3.ln_1.weight', 'gmhra.3.ln_1.bias', 'gmhra.3.mlp.c_fc.weight', 'gmhra.3.mlp.c_fc.bias', 'gmhra.3.mlp.c_proj.weight', 'gmhra.3.mlp.c_proj.bias', 'gmhra.3.ln_2.weight', 'gmhra.3.ln_2.bias', 'gmhra.3.ln_3.weight', 'gmhra.3.ln_3.bias', 'gmhra.4.dpe.weight', 'gmhra.4.dpe.bias', 'gmhra.4.attn.in_proj_weight', 'gmhra.4.attn.in_proj_bias', 'gmhra.4.attn.out_proj.weight', 'gmhra.4.attn.out_proj.bias', 'gmhra.4.ln_1.weight', 'gmhra.4.ln_1.bias', 'gmhra.4.mlp.c_fc.weight', 'gmhra.4.mlp.c_fc.bias', 'gmhra.4.mlp.c_proj.weight', 'gmhra.4.mlp.c_proj.bias', 'gmhra.4.ln_2.weight', 'gmhra.4.ln_2.bias', 'gmhra.4.ln_3.weight', 'gmhra.4.ln_3.bias', 'gmhra.5.dpe.weight', 'gmhra.5.dpe.bias', 'gmhra.5.attn.in_proj_weight', 'gmhra.5.attn.in_proj_bias', 'gmhra.5.attn.out_proj.weight', 'gmhra.5.attn.out_proj.bias', 'gmhra.5.ln_1.weight', 'gmhra.5.ln_1.bias', 'gmhra.5.mlp.c_fc.weight', 'gmhra.5.mlp.c_fc.bias', 'gmhra.5.mlp.c_proj.weight', 'gmhra.5.mlp.c_proj.bias', 'gmhra.5.ln_2.weight', 'gmhra.5.ln_2.bias', 'gmhra.5.ln_3.weight', 'gmhra.5.ln_3.bias', 'gmhra.6.dpe.weight', 'gmhra.6.dpe.bias', 'gmhra.6.attn.in_proj_weight', 'gmhra.6.attn.in_proj_bias', 'gmhra.6.attn.out_proj.weight', 'gmhra.6.attn.out_proj.bias', 'gmhra.6.ln_1.weight', 'gmhra.6.ln_1.bias', 'gmhra.6.mlp.c_fc.weight', 'gmhra.6.mlp.c_fc.bias', 'gmhra.6.mlp.c_proj.weight', 'gmhra.6.mlp.c_proj.bias', 'gmhra.6.ln_2.weight', 'gmhra.6.ln_2.bias', 'gmhra.6.ln_3.weight', 'gmhra.6.ln_3.bias', 'gmhra.7.dpe.weight', 'gmhra.7.dpe.bias', 'gmhra.7.attn.in_proj_weight', 'gmhra.7.attn.in_proj_bias', 'gmhra.7.attn.out_proj.weight', 'gmhra.7.attn.out_proj.bias', 'gmhra.7.ln_1.weight', 'gmhra.7.ln_1.bias', 'gmhra.7.mlp.c_fc.weight', 'gmhra.7.mlp.c_fc.bias', 'gmhra.7.mlp.c_proj.weight', 'gmhra.7.mlp.c_proj.bias', 'gmhra.7.ln_2.weight', 'gmhra.7.ln_2.bias', 'gmhra.7.ln_3.weight', 'gmhra.7.ln_3.bias'], unexpected_keys=['norm.weight', 'norm.bias', 'head.weight', 'head.bias', 'blocks.39.norm1.weight', 'blocks.39.norm1.bias', 'blocks.39.attn.q_bias', 'blocks.39.attn.v_bias', 'blocks.39.attn.qkv.weight', 'blocks.39.attn.proj.weight', 'blocks.39.attn.proj.bias', 'blocks.39.norm2.weight', 'blocks.39.norm2.bias', 'blocks.39.mlp.fc1.weight', 'blocks.39.mlp.fc1.bias', 'blocks.39.mlp.fc2.weight', 'blocks.39.mlp.fc2.bias'])
freeze vision encoder
open module: ['gmhra_cls_token', 'gmhra.0.dpe.weight', 'gmhra.0.dpe.bias', 'gmhra.0.attn.in_proj_weight', 'gmhra.0.attn.in_proj_bias', 'gmhra.0.attn.out_proj.weight', 'gmhra.0.attn.out_proj.bias', 'gmhra.0.ln_1.weight', 'gmhra.0.ln_1.bias', 'gmhra.0.mlp.c_fc.weight', 'gmhra.0.mlp.c_fc.bias', 'gmhra.0.mlp.c_proj.weight', 'gmhra.0.mlp.c_proj.bias', 'gmhra.0.ln_2.weight', 'gmhra.0.ln_2.bias', 'gmhra.0.ln_3.weight', 'gmhra.0.ln_3.bias', 'gmhra.1.dpe.weight', 'gmhra.1.dpe.bias', 'gmhra.1.attn.in_proj_weight', 'gmhra.1.attn.in_proj_bias', 'gmhra.1.attn.out_proj.weight', 'gmhra.1.attn.out_proj.bias', 'gmhra.1.ln_1.weight', 'gmhra.1.ln_1.bias', 'gmhra.1.mlp.c_fc.weight', 'gmhra.1.mlp.c_fc.bias', 'gmhra.1.mlp.c_proj.weight', 'gmhra.1.mlp.c_proj.bias', 'gmhra.1.ln_2.weight', 'gmhra.1.ln_2.bias', 'gmhra.1.ln_3.weight', 'gmhra.1.ln_3.bias', 'gmhra.2.dpe.weight', 'gmhra.2.dpe.bias', 'gmhra.2.attn.in_proj_weight', 'gmhra.2.attn.in_proj_bias', 'gmhra.2.attn.out_proj.weight', 'gmhra.2.attn.out_proj.bias', 'gmhra.2.ln_1.weight', 'gmhra.2.ln_1.bias', 'gmhra.2.mlp.c_fc.weight', 'gmhra.2.mlp.c_fc.bias', 'gmhra.2.mlp.c_proj.weight', 'gmhra.2.mlp.c_proj.bias', 'gmhra.2.ln_2.weight', 'gmhra.2.ln_2.bias', 'gmhra.2.ln_3.weight', 'gmhra.2.ln_3.bias', 'gmhra.3.dpe.weight', 'gmhra.3.dpe.bias', 'gmhra.3.attn.in_proj_weight', 'gmhra.3.attn.in_proj_bias', 'gmhra.3.attn.out_proj.weight', 'gmhra.3.attn.out_proj.bias', 'gmhra.3.ln_1.weight', 'gmhra.3.ln_1.bias', 'gmhra.3.mlp.c_fc.weight', 'gmhra.3.mlp.c_fc.bias', 'gmhra.3.mlp.c_proj.weight', 'gmhra.3.mlp.c_proj.bias', 'gmhra.3.ln_2.weight', 'gmhra.3.ln_2.bias', 'gmhra.3.ln_3.weight', 'gmhra.3.ln_3.bias', 'gmhra.4.dpe.weight', 'gmhra.4.dpe.bias', 'gmhra.4.attn.in_proj_weight', 'gmhra.4.attn.in_proj_bias', 'gmhra.4.attn.out_proj.weight', 'gmhra.4.attn.out_proj.bias', 'gmhra.4.ln_1.weight', 'gmhra.4.ln_1.bias', 'gmhra.4.mlp.c_fc.weight', 'gmhra.4.mlp.c_fc.bias', 'gmhra.4.mlp.c_proj.weight', 'gmhra.4.mlp.c_proj.bias', 'gmhra.4.ln_2.weight', 'gmhra.4.ln_2.bias', 'gmhra.4.ln_3.weight', 'gmhra.4.ln_3.bias', 'gmhra.5.dpe.weight', 'gmhra.5.dpe.bias', 'gmhra.5.attn.in_proj_weight', 'gmhra.5.attn.in_proj_bias', 'gmhra.5.attn.out_proj.weight', 'gmhra.5.attn.out_proj.bias', 'gmhra.5.ln_1.weight', 'gmhra.5.ln_1.bias', 'gmhra.5.mlp.c_fc.weight', 'gmhra.5.mlp.c_fc.bias', 'gmhra.5.mlp.c_proj.weight', 'gmhra.5.mlp.c_proj.bias', 'gmhra.5.ln_2.weight', 'gmhra.5.ln_2.bias', 'gmhra.5.ln_3.weight', 'gmhra.5.ln_3.bias', 'gmhra.6.dpe.weight', 'gmhra.6.dpe.bias', 'gmhra.6.attn.in_proj_weight', 'gmhra.6.attn.in_proj_bias', 'gmhra.6.attn.out_proj.weight', 'gmhra.6.attn.out_proj.bias', 'gmhra.6.ln_1.weight', 'gmhra.6.ln_1.bias', 'gmhra.6.mlp.c_fc.weight', 'gmhra.6.mlp.c_fc.bias', 'gmhra.6.mlp.c_proj.weight', 'gmhra.6.mlp.c_proj.bias', 'gmhra.6.ln_2.weight', 'gmhra.6.ln_2.bias', 'gmhra.6.ln_3.weight', 'gmhra.6.ln_3.bias', 'gmhra.7.dpe.weight', 'gmhra.7.dpe.bias', 'gmhra.7.attn.in_proj_weight', 'gmhra.7.attn.in_proj_bias', 'gmhra.7.attn.out_proj.weight', 'gmhra.7.attn.out_proj.bias', 'gmhra.7.ln_1.weight', 'gmhra.7.ln_1.bias', 'gmhra.7.mlp.c_fc.weight', 'gmhra.7.mlp.c_fc.bias', 'gmhra.7.mlp.c_proj.weight', 'gmhra.7.mlp.c_proj.bias', 'gmhra.7.ln_2.weight', 'gmhra.7.ln_2.bias', 'gmhra.7.ln_3.weight', 'gmhra.7.ln_3.bias']
open ln_vision
Loading VIT Done
Loading Q-Former
Drop_path:[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]
BertConfig {
  "add_cross_attention": true,
  "architectures": [
    "BertForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.0,
  "classifier_dropout": null,
  "cross_attention_freq": 2,
  "drop_path_list": [
    0.0,
    0.0,
    0.0,
    0.0,
    0.0,
    0.0,
    0.0,
    0.0,
    0.0,
    0.0,
    0.0,
    0.0
  ],
  "encoder_width": 1408,
  "gradient_checkpointing": false,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.0,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 0,
  "position_embedding_type": "absolute",
  "query_length": 32,
  "transformers_version": "4.28.1",
  "type_vocab_size": 2,
  "use_cache": true,
  "vocab_size": 30522
}

Load QFormer from D:/GIT_Project/Ask-Anything/Bilp2/blip2_pretrained_flant5xxl.pth
_IncompatibleKeys(missing_keys=['visual_encoder.cls_token', 'visual_encoder.pos_embed', 'visual_encoder.gmhra_cls_token', 'visual_encoder.patch_embed.proj.weight', 'visual_encoder.patch_embed.proj.bias', 'visual_encoder.blocks.0.norm1.weight', 'visual_encoder.blocks.0.norm1.bias', 'visual_encoder.blocks.0.attn.q_bias', 'visual_encoder.blocks.0.attn.v_bias', 'visual_encoder.blocks.0.attn.qkv.weight', 'visual_encoder.blocks.0.attn.proj.weight', 'visual_encoder.blocks.0.attn.proj.bias', 'visual_encoder.blocks.0.norm2.weight', 'visual_encoder.blocks.0.norm2.bias', 'visual_encoder.blocks.0.mlp.fc1.weight', 'visual_encoder.blocks.0.mlp.fc1.bias', 'visual_encoder.blocks.0.mlp.fc2.weight', 'visual_encoder.blocks.0.mlp.fc2.bias', 'visual_encoder.blocks.1.norm1.weight', 'visual_encoder.blocks.1.norm1.bias', 'visual_encoder.blocks.1.attn.q_bias', 'visual_encoder.blocks.1.attn.v_bias', 'visual_encoder.blocks.1.attn.qkv.weight', 'visual_encoder.blocks.1.attn.proj.weight', 'visual_encoder.blocks.1.attn.proj.bias', 'visual_encoder.blocks.1.norm2.weight', 'visual_encoder.blocks.1.norm2.bias', 'visual_encoder.blocks.1.mlp.fc1.weight', 'visual_encoder.blocks.1.mlp.fc1.bias', 'visual_encoder.blocks.1.mlp.fc2.weight', 'visual_encoder.blocks.1.mlp.fc2.bias', 'visual_encoder.blocks.2.norm1.weight', 'visual_encoder.blocks.2.norm1.bias', 'visual_encoder.blocks.2.attn.q_bias', 'visual_encoder.blocks.2.attn.v_bias', 'visual_encoder.blocks.2.attn.qkv.weight', 'visual_encoder.blocks.2.attn.proj.weight', 'visual_encoder.blocks.2.attn.proj.bias', 'visual_encoder.blocks.2.norm2.weight', 'visual_encoder.blocks.2.norm2.bias', 'visual_encoder.blocks.2.mlp.fc1.weight', 'visual_encoder.blocks.2.mlp.fc1.bias', 'visual_encoder.blocks.2.mlp.fc2.weight', 'visual_encoder.blocks.2.mlp.fc2.bias', 'visual_encoder.blocks.3.norm1.weight', 'visual_encoder.blocks.3.norm1.bias', 'visual_encoder.blocks.3.attn.q_bias', 'visual_encoder.blocks.3.attn.v_bias', 'visual_encoder.blocks.3.attn.qkv.weight', 'visual_encoder.blocks.3.attn.proj.weight', 'visual_encoder.blocks.3.attn.proj.bias', 'visual_encoder.blocks.3.norm2.weight', 'visual_encoder.blocks.3.norm2.bias', 'visual_encoder.blocks.3.mlp.fc1.weight', 'visual_encoder.blocks.3.mlp.fc1.bias', 'visual_encoder.blocks.3.mlp.fc2.weight', 'visual_encoder.blocks.3.mlp.fc2.bias', 'visual_encoder.blocks.4.norm1.weight', 'visual_encoder.blocks.4.norm1.bias', 'visual_encoder.blocks.4.attn.q_bias', 'visual_encoder.blocks.4.attn.v_bias', 'visual_encoder.blocks.4.attn.qkv.weight', 'visual_encoder.blocks.4.attn.proj.weight', 'visual_encoder.blocks.4.attn.proj.bias', 'visual_encoder.blocks.4.norm2.weight', 'visual_encoder.blocks.4.norm2.bias', 'visual_encoder.blocks.4.mlp.fc1.weight', 'visual_encoder.blocks.4.mlp.fc1.bias', 'visual_encoder.blocks.4.mlp.fc2.weight', 'visual_encoder.blocks.4.mlp.fc2.bias', 'visual_encoder.blocks.5.norm1.weight', 'visual_encoder.blocks.5.norm1.bias', 'visual_encoder.blocks.5.attn.q_bias', 'visual_encoder.blocks.5.attn.v_bias', 'visual_encoder.blocks.5.attn.qkv.weight', 'visual_encoder.blocks.5.attn.proj.weight', 'visual_encoder.blocks.5.attn.proj.bias', 'visual_encoder.blocks.5.norm2.weight', 'visual_encoder.blocks.5.norm2.bias', 'visual_encoder.blocks.5.mlp.fc1.weight', 'visual_encoder.blocks.5.mlp.fc1.bias', 'visual_encoder.blocks.5.mlp.fc2.weight', 'visual_encoder.blocks.5.mlp.fc2.bias', 'visual_encoder.blocks.6.norm1.weight', 'visual_encoder.blocks.6.norm1.bias', 'visual_encoder.blocks.6.attn.q_bias', 'visual_encoder.blocks.6.attn.v_bias', 'visual_encoder.blocks.6.attn.qkv.weight', 'visual_encoder.blocks.6.attn.proj.weight', 'visual_encoder.blocks.6.attn.proj.bias', 'visual_encoder.blocks.6.norm2.weight', 'visual_encoder.blocks.6.norm2.bias', 'visual_encoder.blocks.6.mlp.fc1.weight', 'visual_encoder.blocks.6.mlp.fc1.bias', 'visual_encoder.blocks.6.mlp.fc2.weight', 'visual_encoder.blocks.6.mlp.fc2.bias', 'visual_encoder.blocks.7.norm1.weight', 'visual_encoder.blocks.7.norm1.bias', 'visual_encoder.blocks.7.attn.q_bias', 'visual_encoder.blocks.7.attn.v_bias', 'visual_encoder.blocks.7.attn.qkv.weight', 'visual_encoder.blocks.7.attn.proj.weight', 'visual_encoder.blocks.7.attn.proj.bias', 'visual_encoder.blocks.7.norm2.weight', 'visual_encoder.blocks.7.norm2.bias', 'visual_encoder.blocks.7.mlp.fc1.weight', 'visual_encoder.blocks.7.mlp.fc1.bias', 'visual_encoder.blocks.7.mlp.fc2.weight', 'visual_encoder.blocks.7.mlp.fc2.bias', 'visual_encoder.blocks.8.norm1.weight', 'visual_encoder.blocks.8.norm1.bias', 'visual_encoder.blocks.8.attn.q_bias', 'visual_encoder.blocks.8.attn.v_bias', 'visual_encoder.blocks.8.attn.qkv.weight', 'visual_encoder.blocks.8.attn.proj.weight', 'visual_encoder.blocks.8.attn.proj.bias', 'visual_encoder.blocks.8.norm2.weight', 'visual_encoder.blocks.8.norm2.bias', 'visual_encoder.blocks.8.mlp.fc1.weight', 'visual_encoder.blocks.8.mlp.fc1.bias', 'visual_encoder.blocks.8.mlp.fc2.weight', 'visual_encoder.blocks.8.mlp.fc2.bias', 'visual_encoder.blocks.9.norm1.weight', 'visual_encoder.blocks.9.norm1.bias', 'visual_encoder.blocks.9.attn.q_bias', 'visual_encoder.blocks.9.attn.v_bias', 'visual_encoder.blocks.9.attn.qkv.weight', 'visual_encoder.blocks.9.attn.proj.weight', 'visual_encoder.blocks.9.attn.proj.bias', 'visual_encoder.blocks.9.norm2.weight', 'visual_encoder.blocks.9.norm2.bias', 'visual_encoder.blocks.9.mlp.fc1.weight', 'visual_encoder.blocks.9.mlp.fc1.bias', 'visual_encoder.blocks.9.mlp.fc2.weight', 'visual_encoder.blocks.9.mlp.fc2.bias', 'visual_encoder.blocks.10.norm1.weight', 'visual_encoder.blocks.10.norm1.bias', 'visual_encoder.blocks.10.attn.q_bias', 'visual_encoder.blocks.10.attn.v_bias', 'visual_encoder.blocks.10.attn.qkv.weight', 'visual_encoder.blocks.10.attn.proj.weight', 'visual_encoder.blocks.10.attn.proj.bias', 'visual_encoder.blocks.10.norm2.weight', 'visual_encoder.blocks.10.norm2.bias', 'visual_encoder.blocks.10.mlp.fc1.weight', 'visual_encoder.blocks.10.mlp.fc1.bias', 'visual_encoder.blocks.10.mlp.fc2.weight', 'visual_encoder.blocks.10.mlp.fc2.bias', 'visual_encoder.blocks.11.norm1.weight', 'visual_encoder.blocks.11.norm1.bias', 'visual_encoder.blocks.11.attn.q_bias', 'visual_encoder.blocks.11.attn.v_bias', 'visual_encoder.blocks.11.attn.qkv.weight', 'visual_encoder.blocks.11.attn.proj.weight', 'visual_encoder.blocks.11.attn.proj.bias', 'visual_encoder.blocks.11.norm2.weight', 'visual_encoder.blocks.11.norm2.bias', 'visual_encoder.blocks.11.mlp.fc1.weight', 'visual_encoder.blocks.11.mlp.fc1.bias', 'visual_encoder.blocks.11.mlp.fc2.weight', 'visual_encoder.blocks.11.mlp.fc2.bias', 'visual_encoder.blocks.12.norm1.weight', 'visual_encoder.blocks.12.norm1.bias', 'visual_encoder.blocks.12.attn.q_bias', 'visual_encoder.blocks.12.attn.v_bias', 'visual_encoder.blocks.12.attn.qkv.weight', 'visual_encoder.blocks.12.attn.proj.weight', 'visual_encoder.blocks.12.attn.proj.bias', 'visual_encoder.blocks.12.norm2.weight', 'visual_encoder.blocks.12.norm2.bias', 'visual_encoder.blocks.12.mlp.fc1.weight', 'visual_encoder.blocks.12.mlp.fc1.bias', 'visual_encoder.blocks.12.mlp.fc2.weight', 'visual_encoder.blocks.12.mlp.fc2.bias', 'visual_encoder.blocks.13.norm1.weight', 'visual_encoder.blocks.13.norm1.bias', 'visual_encoder.blocks.13.attn.q_bias', 'visual_encoder.blocks.13.attn.v_bias', 'visual_encoder.blocks.13.attn.qkv.weight', 'visual_encoder.blocks.13.attn.proj.weight', 'visual_encoder.blocks.13.attn.proj.bias', 'visual_encoder.blocks.13.norm2.weight', 'visual_encoder.blocks.13.norm2.bias', 'visual_encoder.blocks.13.mlp.fc1.weight', 'visual_encoder.blocks.13.mlp.fc1.bias', 'visual_encoder.blocks.13.mlp.fc2.weight', 'visual_encoder.blocks.13.mlp.fc2.bias', 'visual_encoder.blocks.14.norm1.weight', 'visual_encoder.blocks.14.norm1.bias', 'visual_encoder.blocks.14.attn.q_bias', 'visual_encoder.blocks.14.attn.v_bias', 'visual_encoder.blocks.14.attn.qkv.weight', 'visual_encoder.blocks.14.attn.proj.weight', 'visual_encoder.blocks.14.attn.proj.bias', 'visual_encoder.blocks.14.norm2.weight', 'visual_encoder.blocks.14.norm2.bias', 'visual_encoder.blocks.14.mlp.fc1.weight', 'visual_encoder.blocks.14.mlp.fc1.bias', 'visual_encoder.blocks.14.mlp.fc2.weight', 'visual_encoder.blocks.14.mlp.fc2.bias', 'visual_encoder.blocks.15.norm1.weight', 'visual_encoder.blocks.15.norm1.bias', 'visual_encoder.blocks.15.attn.q_bias', 'visual_encoder.blocks.15.attn.v_bias', 'visual_encoder.blocks.15.attn.qkv.weight', 'visual_encoder.blocks.15.attn.proj.weight', 'visual_encoder.blocks.15.attn.proj.bias', 'visual_encoder.blocks.15.norm2.weight', 'visual_encoder.blocks.15.norm2.bias', 'visual_encoder.blocks.15.mlp.fc1.weight', 'visual_encoder.blocks.15.mlp.fc1.bias', 'visual_encoder.blocks.15.mlp.fc2.weight', 'visual_encoder.blocks.15.mlp.fc2.bias', 'visual_encoder.blocks.16.norm1.weight', 'visual_encoder.blocks.16.norm1.bias', 'visual_encoder.blocks.16.attn.q_bias', 'visual_encoder.blocks.16.attn.v_bias', 'visual_encoder.blocks.16.attn.qkv.weight', 'visual_encoder.blocks.16.attn.proj.weight', 'visual_encoder.blocks.16.attn.proj.bias', 'visual_encoder.blocks.16.norm2.weight', 'visual_encoder.blocks.16.norm2.bias', 'visual_encoder.blocks.16.mlp.fc1.weight', 'visual_encoder.blocks.16.mlp.fc1.bias', 'visual_encoder.blocks.16.mlp.fc2.weight', 'visual_encoder.blocks.16.mlp.fc2.bias', 'visual_encoder.blocks.17.norm1.weight', 'visual_encoder.blocks.17.norm1.bias', 'visual_encoder.blocks.17.attn.q_bias', 'visual_encoder.blocks.17.attn.v_bias', 'visual_encoder.blocks.17.attn.qkv.weight', 'visual_encoder.blocks.17.attn.proj.weight', 'visual_encoder.blocks.17.attn.proj.bias', 'visual_encoder.blocks.17.norm2.weight', 'visual_encoder.blocks.17.norm2.bias', 'visual_encoder.blocks.17.mlp.fc1.weight', 'visual_encoder.blocks.17.mlp.fc1.bias', 'visual_encoder.blocks.17.mlp.fc2.weight', 'visual_encoder.blocks.17.mlp.fc2.bias', 'visual_encoder.blocks.18.norm1.weight', 'visual_encoder.blocks.18.norm1.bias', 'visual_encoder.blocks.18.attn.q_bias', 'visual_encoder.blocks.18.attn.v_bias', 'visual_encoder.blocks.18.attn.qkv.weight', 'visual_encoder.blocks.18.attn.proj.weight', 'visual_encoder.blocks.18.attn.proj.bias', 'visual_encoder.blocks.18.norm2.weight', 'visual_encoder.blocks.18.norm2.bias', 'visual_encoder.blocks.18.mlp.fc1.weight', 'visual_encoder.blocks.18.mlp.fc1.bias', 'visual_encoder.blocks.18.mlp.fc2.weight', 'visual_encoder.blocks.18.mlp.fc2.bias', 'visual_encoder.blocks.19.norm1.weight', 'visual_encoder.blocks.19.norm1.bias', 'visual_encoder.blocks.19.attn.q_bias', 'visual_encoder.blocks.19.attn.v_bias', 'visual_encoder.blocks.19.attn.qkv.weight', 'visual_encoder.blocks.19.attn.proj.weight', 'visual_encoder.blocks.19.attn.proj.bias', 'visual_encoder.blocks.19.norm2.weight', 'visual_encoder.blocks.19.norm2.bias', 'visual_encoder.blocks.19.mlp.fc1.weight', 'visual_encoder.blocks.19.mlp.fc1.bias', 'visual_encoder.blocks.19.mlp.fc2.weight', 'visual_encoder.blocks.19.mlp.fc2.bias', 'visual_encoder.blocks.20.norm1.weight', 'visual_encoder.blocks.20.norm1.bias', 'visual_encoder.blocks.20.attn.q_bias', 'visual_encoder.blocks.20.attn.v_bias', 'visual_encoder.blocks.20.attn.qkv.weight', 'visual_encoder.blocks.20.attn.proj.weight', 'visual_encoder.blocks.20.attn.proj.bias', 'visual_encoder.blocks.20.norm2.weight', 'visual_encoder.blocks.20.norm2.bias', 'visual_encoder.blocks.20.mlp.fc1.weight', 'visual_encoder.blocks.20.mlp.fc1.bias', 'visual_encoder.blocks.20.mlp.fc2.weight', 'visual_encoder.blocks.20.mlp.fc2.bias', 'visual_encoder.blocks.21.norm1.weight', 'visual_encoder.blocks.21.norm1.bias', 'visual_encoder.blocks.21.attn.q_bias', 'visual_encoder.blocks.21.attn.v_bias', 'visual_encoder.blocks.21.attn.qkv.weight', 'visual_encoder.blocks.21.attn.proj.weight', 'visual_encoder.blocks.21.attn.proj.bias', 'visual_encoder.blocks.21.norm2.weight', 'visual_encoder.blocks.21.norm2.bias', 'visual_encoder.blocks.21.mlp.fc1.weight', 'visual_encoder.blocks.21.mlp.fc1.bias', 'visual_encoder.blocks.21.mlp.fc2.weight', 'visual_encoder.blocks.21.mlp.fc2.bias', 'visual_encoder.blocks.22.norm1.weight', 'visual_encoder.blocks.22.norm1.bias', 'visual_encoder.blocks.22.attn.q_bias', 'visual_encoder.blocks.22.attn.v_bias', 'visual_encoder.blocks.22.attn.qkv.weight', 'visual_encoder.blocks.22.attn.proj.weight', 'visual_encoder.blocks.22.attn.proj.bias', 'visual_encoder.blocks.22.norm2.weight', 'visual_encoder.blocks.22.norm2.bias', 'visual_encoder.blocks.22.mlp.fc1.weight', 'visual_encoder.blocks.22.mlp.fc1.bias', 'visual_encoder.blocks.22.mlp.fc2.weight', 'visual_encoder.blocks.22.mlp.fc2.bias', 'visual_encoder.blocks.23.norm1.weight', 'visual_encoder.blocks.23.norm1.bias', 'visual_encoder.blocks.23.attn.q_bias', 'visual_encoder.blocks.23.attn.v_bias', 'visual_encoder.blocks.23.attn.qkv.weight', 'visual_encoder.blocks.23.attn.proj.weight', 'visual_encoder.blocks.23.attn.proj.bias', 'visual_encoder.blocks.23.norm2.weight', 'visual_encoder.blocks.23.norm2.bias', 'visual_encoder.blocks.23.mlp.fc1.weight', 'visual_encoder.blocks.23.mlp.fc1.bias', 'visual_encoder.blocks.23.mlp.fc2.weight', 'visual_encoder.blocks.23.mlp.fc2.bias', 'visual_encoder.blocks.24.norm1.weight', 'visual_encoder.blocks.24.norm1.bias', 'visual_encoder.blocks.24.attn.q_bias', 'visual_encoder.blocks.24.attn.v_bias', 'visual_encoder.blocks.24.attn.qkv.weight', 'visual_encoder.blocks.24.attn.proj.weight', 'visual_encoder.blocks.24.attn.proj.bias', 'visual_encoder.blocks.24.norm2.weight', 'visual_encoder.blocks.24.norm2.bias', 'visual_encoder.blocks.24.mlp.fc1.weight', 'visual_encoder.blocks.24.mlp.fc1.bias', 'visual_encoder.blocks.24.mlp.fc2.weight', 'visual_encoder.blocks.24.mlp.fc2.bias', 'visual_encoder.blocks.25.norm1.weight', 'visual_encoder.blocks.25.norm1.bias', 'visual_encoder.blocks.25.attn.q_bias', 'visual_encoder.blocks.25.attn.v_bias', 'visual_encoder.blocks.25.attn.qkv.weight', 'visual_encoder.blocks.25.attn.proj.weight', 'visual_encoder.blocks.25.attn.proj.bias', 'visual_encoder.blocks.25.norm2.weight', 'visual_encoder.blocks.25.norm2.bias', 'visual_encoder.blocks.25.mlp.fc1.weight', 'visual_encoder.blocks.25.mlp.fc1.bias', 'visual_encoder.blocks.25.mlp.fc2.weight', 'visual_encoder.blocks.25.mlp.fc2.bias', 'visual_encoder.blocks.26.norm1.weight', 'visual_encoder.blocks.26.norm1.bias', 'visual_encoder.blocks.26.attn.q_bias', 'visual_encoder.blocks.26.attn.v_bias', 'visual_encoder.blocks.26.attn.qkv.weight', 'visual_encoder.blocks.26.attn.proj.weight', 'visual_encoder.blocks.26.attn.proj.bias', 'visual_encoder.blocks.26.norm2.weight', 'visual_encoder.blocks.26.norm2.bias', 'visual_encoder.blocks.26.mlp.fc1.weight', 'visual_encoder.blocks.26.mlp.fc1.bias', 'visual_encoder.blocks.26.mlp.fc2.weight', 'visual_encoder.blocks.26.mlp.fc2.bias', 'visual_encoder.blocks.27.norm1.weight', 'visual_encoder.blocks.27.norm1.bias', 'visual_encoder.blocks.27.attn.q_bias', 'visual_encoder.blocks.27.attn.v_bias', 'visual_encoder.blocks.27.attn.qkv.weight', 'visual_encoder.blocks.27.attn.proj.weight', 'visual_encoder.blocks.27.attn.proj.bias', 'visual_encoder.blocks.27.norm2.weight', 'visual_encoder.blocks.27.norm2.bias', 'visual_encoder.blocks.27.mlp.fc1.weight', 'visual_encoder.blocks.27.mlp.fc1.bias', 'visual_encoder.blocks.27.mlp.fc2.weight', 'visual_encoder.blocks.27.mlp.fc2.bias', 'visual_encoder.blocks.28.norm1.weight', 'visual_encoder.blocks.28.norm1.bias', 'visual_encoder.blocks.28.attn.q_bias', 'visual_encoder.blocks.28.attn.v_bias', 'visual_encoder.blocks.28.attn.qkv.weight', 'visual_encoder.blocks.28.attn.proj.weight', 'visual_encoder.blocks.28.attn.proj.bias', 'visual_encoder.blocks.28.norm2.weight', 'visual_encoder.blocks.28.norm2.bias', 'visual_encoder.blocks.28.mlp.fc1.weight', 'visual_encoder.blocks.28.mlp.fc1.bias', 'visual_encoder.blocks.28.mlp.fc2.weight', 'visual_encoder.blocks.28.mlp.fc2.bias', 'visual_encoder.blocks.29.norm1.weight', 'visual_encoder.blocks.29.norm1.bias', 'visual_encoder.blocks.29.attn.q_bias', 'visual_encoder.blocks.29.attn.v_bias', 'visual_encoder.blocks.29.attn.qkv.weight', 'visual_encoder.blocks.29.attn.proj.weight', 'visual_encoder.blocks.29.attn.proj.bias', 'visual_encoder.blocks.29.norm2.weight', 'visual_encoder.blocks.29.norm2.bias', 'visual_encoder.blocks.29.mlp.fc1.weight', 'visual_encoder.blocks.29.mlp.fc1.bias', 'visual_encoder.blocks.29.mlp.fc2.weight', 'visual_encoder.blocks.29.mlp.fc2.bias', 'visual_encoder.blocks.30.norm1.weight', 'visual_encoder.blocks.30.norm1.bias', 'visual_encoder.blocks.30.attn.q_bias', 'visual_encoder.blocks.30.attn.v_bias', 'visual_encoder.blocks.30.attn.qkv.weight', 'visual_encoder.blocks.30.attn.proj.weight', 'visual_encoder.blocks.30.attn.proj.bias', 'visual_encoder.blocks.30.norm2.weight', 'visual_encoder.blocks.30.norm2.bias', 'visual_encoder.blocks.30.mlp.fc1.weight', 'visual_encoder.blocks.30.mlp.fc1.bias', 'visual_encoder.blocks.30.mlp.fc2.weight', 'visual_encoder.blocks.30.mlp.fc2.bias', 'visual_encoder.blocks.31.norm1.weight', 'visual_encoder.blocks.31.norm1.bias', 'visual_encoder.blocks.31.attn.q_bias', 'visual_encoder.blocks.31.attn.v_bias', 'visual_encoder.blocks.31.attn.qkv.weight', 'visual_encoder.blocks.31.attn.proj.weight', 'visual_encoder.blocks.31.attn.proj.bias', 'visual_encoder.blocks.31.norm2.weight', 'visual_encoder.blocks.31.norm2.bias', 'visual_encoder.blocks.31.mlp.fc1.weight', 'visual_encoder.blocks.31.mlp.fc1.bias', 'visual_encoder.blocks.31.mlp.fc2.weight', 'visual_encoder.blocks.31.mlp.fc2.bias', 'visual_encoder.blocks.32.norm1.weight', 'visual_encoder.blocks.32.norm1.bias', 'visual_encoder.blocks.32.attn.q_bias', 'visual_encoder.blocks.32.attn.v_bias', 'visual_encoder.blocks.32.attn.qkv.weight', 'visual_encoder.blocks.32.attn.proj.weight', 'visual_encoder.blocks.32.attn.proj.bias', 'visual_encoder.blocks.32.norm2.weight', 'visual_encoder.blocks.32.norm2.bias', 'visual_encoder.blocks.32.mlp.fc1.weight', 'visual_encoder.blocks.32.mlp.fc1.bias', 'visual_encoder.blocks.32.mlp.fc2.weight', 'visual_encoder.blocks.32.mlp.fc2.bias', 'visual_encoder.blocks.33.norm1.weight', 'visual_encoder.blocks.33.norm1.bias', 'visual_encoder.blocks.33.attn.q_bias', 'visual_encoder.blocks.33.attn.v_bias', 'visual_encoder.blocks.33.attn.qkv.weight', 'visual_encoder.blocks.33.attn.proj.weight', 'visual_encoder.blocks.33.attn.proj.bias', 'visual_encoder.blocks.33.norm2.weight', 'visual_encoder.blocks.33.norm2.bias', 'visual_encoder.blocks.33.mlp.fc1.weight', 'visual_encoder.blocks.33.mlp.fc1.bias', 'visual_encoder.blocks.33.mlp.fc2.weight', 'visual_encoder.blocks.33.mlp.fc2.bias', 'visual_encoder.blocks.34.norm1.weight', 'visual_encoder.blocks.34.norm1.bias', 'visual_encoder.blocks.34.attn.q_bias', 'visual_encoder.blocks.34.attn.v_bias', 'visual_encoder.blocks.34.attn.qkv.weight', 'visual_encoder.blocks.34.attn.proj.weight', 'visual_encoder.blocks.34.attn.proj.bias', 'visual_encoder.blocks.34.norm2.weight', 'visual_encoder.blocks.34.norm2.bias', 'visual_encoder.blocks.34.mlp.fc1.weight', 'visual_encoder.blocks.34.mlp.fc1.bias', 'visual_encoder.blocks.34.mlp.fc2.weight', 'visual_encoder.blocks.34.mlp.fc2.bias', 'visual_encoder.blocks.35.norm1.weight', 'visual_encoder.blocks.35.norm1.bias', 'visual_encoder.blocks.35.attn.q_bias', 'visual_encoder.blocks.35.attn.v_bias', 'visual_encoder.blocks.35.attn.qkv.weight', 'visual_encoder.blocks.35.attn.proj.weight', 'visual_encoder.blocks.35.attn.proj.bias', 'visual_encoder.blocks.35.norm2.weight', 'visual_encoder.blocks.35.norm2.bias', 'visual_encoder.blocks.35.mlp.fc1.weight', 'visual_encoder.blocks.35.mlp.fc1.bias', 'visual_encoder.blocks.35.mlp.fc2.weight', 'visual_encoder.blocks.35.mlp.fc2.bias', 'visual_encoder.blocks.36.norm1.weight', 'visual_encoder.blocks.36.norm1.bias', 'visual_encoder.blocks.36.attn.q_bias', 'visual_encoder.blocks.36.attn.v_bias', 'visual_encoder.blocks.36.attn.qkv.weight', 'visual_encoder.blocks.36.attn.proj.weight', 'visual_encoder.blocks.36.attn.proj.bias', 'visual_encoder.blocks.36.norm2.weight', 'visual_encoder.blocks.36.norm2.bias', 'visual_encoder.blocks.36.mlp.fc1.weight', 'visual_encoder.blocks.36.mlp.fc1.bias', 'visual_encoder.blocks.36.mlp.fc2.weight', 'visual_encoder.blocks.36.mlp.fc2.bias', 'visual_encoder.blocks.37.norm1.weight', 'visual_encoder.blocks.37.norm1.bias', 'visual_encoder.blocks.37.attn.q_bias', 'visual_encoder.blocks.37.attn.v_bias', 'visual_encoder.blocks.37.attn.qkv.weight', 'visual_encoder.blocks.37.attn.proj.weight', 'visual_encoder.blocks.37.attn.proj.bias', 'visual_encoder.blocks.37.norm2.weight', 'visual_encoder.blocks.37.norm2.bias', 'visual_encoder.blocks.37.mlp.fc1.weight', 'visual_encoder.blocks.37.mlp.fc1.bias', 'visual_encoder.blocks.37.mlp.fc2.weight', 'visual_encoder.blocks.37.mlp.fc2.bias', 'visual_encoder.blocks.38.norm1.weight', 'visual_encoder.blocks.38.norm1.bias', 'visual_encoder.blocks.38.attn.q_bias', 'visual_encoder.blocks.38.attn.v_bias', 'visual_encoder.blocks.38.attn.qkv.weight', 'visual_encoder.blocks.38.attn.proj.weight', 'visual_encoder.blocks.38.attn.proj.bias', 'visual_encoder.blocks.38.norm2.weight', 'visual_encoder.blocks.38.norm2.bias', 'visual_encoder.blocks.38.mlp.fc1.weight', 'visual_encoder.blocks.38.mlp.fc1.bias', 'visual_encoder.blocks.38.mlp.fc2.weight', 'visual_encoder.blocks.38.mlp.fc2.bias', 'visual_encoder.gmhra.0.dpe.weight', 'visual_encoder.gmhra.0.dpe.bias', 'visual_encoder.gmhra.0.attn.in_proj_weight', 'visual_encoder.gmhra.0.attn.in_proj_bias', 'visual_encoder.gmhra.0.attn.out_proj.weight', 'visual_encoder.gmhra.0.attn.out_proj.bias', 'visual_encoder.gmhra.0.ln_1.weight', 'visual_encoder.gmhra.0.ln_1.bias', 'visual_encoder.gmhra.0.mlp.c_fc.weight', 'visual_encoder.gmhra.0.mlp.c_fc.bias', 'visual_encoder.gmhra.0.mlp.c_proj.weight', 'visual_encoder.gmhra.0.mlp.c_proj.bias', 'visual_encoder.gmhra.0.ln_2.weight', 'visual_encoder.gmhra.0.ln_2.bias', 'visual_encoder.gmhra.0.ln_3.weight', 'visual_encoder.gmhra.0.ln_3.bias', 'visual_encoder.gmhra.1.dpe.weight', 'visual_encoder.gmhra.1.dpe.bias', 'visual_encoder.gmhra.1.attn.in_proj_weight', 'visual_encoder.gmhra.1.attn.in_proj_bias', 'visual_encoder.gmhra.1.attn.out_proj.weight', 'visual_encoder.gmhra.1.attn.out_proj.bias', 'visual_encoder.gmhra.1.ln_1.weight', 'visual_encoder.gmhra.1.ln_1.bias', 'visual_encoder.gmhra.1.mlp.c_fc.weight', 'visual_encoder.gmhra.1.mlp.c_fc.bias', 'visual_encoder.gmhra.1.mlp.c_proj.weight', 'visual_encoder.gmhra.1.mlp.c_proj.bias', 'visual_encoder.gmhra.1.ln_2.weight', 'visual_encoder.gmhra.1.ln_2.bias', 'visual_encoder.gmhra.1.ln_3.weight', 'visual_encoder.gmhra.1.ln_3.bias', 'visual_encoder.gmhra.2.dpe.weight', 'visual_encoder.gmhra.2.dpe.bias', 'visual_encoder.gmhra.2.attn.in_proj_weight', 'visual_encoder.gmhra.2.attn.in_proj_bias', 'visual_encoder.gmhra.2.attn.out_proj.weight', 'visual_encoder.gmhra.2.attn.out_proj.bias', 'visual_encoder.gmhra.2.ln_1.weight', 'visual_encoder.gmhra.2.ln_1.bias', 'visual_encoder.gmhra.2.mlp.c_fc.weight', 'visual_encoder.gmhra.2.mlp.c_fc.bias', 'visual_encoder.gmhra.2.mlp.c_proj.weight', 'visual_encoder.gmhra.2.mlp.c_proj.bias', 'visual_encoder.gmhra.2.ln_2.weight', 'visual_encoder.gmhra.2.ln_2.bias', 'visual_encoder.gmhra.2.ln_3.weight', 'visual_encoder.gmhra.2.ln_3.bias', 'visual_encoder.gmhra.3.dpe.weight', 'visual_encoder.gmhra.3.dpe.bias', 'visual_encoder.gmhra.3.attn.in_proj_weight', 'visual_encoder.gmhra.3.attn.in_proj_bias', 'visual_encoder.gmhra.3.attn.out_proj.weight', 'visual_encoder.gmhra.3.attn.out_proj.bias', 'visual_encoder.gmhra.3.ln_1.weight', 'visual_encoder.gmhra.3.ln_1.bias', 'visual_encoder.gmhra.3.mlp.c_fc.weight', 'visual_encoder.gmhra.3.mlp.c_fc.bias', 'visual_encoder.gmhra.3.mlp.c_proj.weight', 'visual_encoder.gmhra.3.mlp.c_proj.bias', 'visual_encoder.gmhra.3.ln_2.weight', 'visual_encoder.gmhra.3.ln_2.bias', 'visual_encoder.gmhra.3.ln_3.weight', 'visual_encoder.gmhra.3.ln_3.bias', 'visual_encoder.gmhra.4.dpe.weight', 'visual_encoder.gmhra.4.dpe.bias', 'visual_encoder.gmhra.4.attn.in_proj_weight', 'visual_encoder.gmhra.4.attn.in_proj_bias', 'visual_encoder.gmhra.4.attn.out_proj.weight', 'visual_encoder.gmhra.4.attn.out_proj.bias', 'visual_encoder.gmhra.4.ln_1.weight', 'visual_encoder.gmhra.4.ln_1.bias', 'visual_encoder.gmhra.4.mlp.c_fc.weight', 'visual_encoder.gmhra.4.mlp.c_fc.bias', 'visual_encoder.gmhra.4.mlp.c_proj.weight', 'visual_encoder.gmhra.4.mlp.c_proj.bias', 'visual_encoder.gmhra.4.ln_2.weight', 'visual_encoder.gmhra.4.ln_2.bias', 'visual_encoder.gmhra.4.ln_3.weight', 'visual_encoder.gmhra.4.ln_3.bias', 'visual_encoder.gmhra.5.dpe.weight', 'visual_encoder.gmhra.5.dpe.bias', 'visual_encoder.gmhra.5.attn.in_proj_weight', 'visual_encoder.gmhra.5.attn.in_proj_bias', 'visual_encoder.gmhra.5.attn.out_proj.weight', 'visual_encoder.gmhra.5.attn.out_proj.bias', 'visual_encoder.gmhra.5.ln_1.weight', 'visual_encoder.gmhra.5.ln_1.bias', 'visual_encoder.gmhra.5.mlp.c_fc.weight', 'visual_encoder.gmhra.5.mlp.c_fc.bias', 'visual_encoder.gmhra.5.mlp.c_proj.weight', 'visual_encoder.gmhra.5.mlp.c_proj.bias', 'visual_encoder.gmhra.5.ln_2.weight', 'visual_encoder.gmhra.5.ln_2.bias', 'visual_encoder.gmhra.5.ln_3.weight', 'visual_encoder.gmhra.5.ln_3.bias', 'visual_encoder.gmhra.6.dpe.weight', 'visual_encoder.gmhra.6.dpe.bias', 'visual_encoder.gmhra.6.attn.in_proj_weight', 'visual_encoder.gmhra.6.attn.in_proj_bias', 'visual_encoder.gmhra.6.attn.out_proj.weight', 'visual_encoder.gmhra.6.attn.out_proj.bias', 'visual_encoder.gmhra.6.ln_1.weight', 'visual_encoder.gmhra.6.ln_1.bias', 'visual_encoder.gmhra.6.mlp.c_fc.weight', 'visual_encoder.gmhra.6.mlp.c_fc.bias', 'visual_encoder.gmhra.6.mlp.c_proj.weight', 'visual_encoder.gmhra.6.mlp.c_proj.bias', 'visual_encoder.gmhra.6.ln_2.weight', 'visual_encoder.gmhra.6.ln_2.bias', 'visual_encoder.gmhra.6.ln_3.weight', 'visual_encoder.gmhra.6.ln_3.bias', 'visual_encoder.gmhra.7.dpe.weight', 'visual_encoder.gmhra.7.dpe.bias', 'visual_encoder.gmhra.7.attn.in_proj_weight', 'visual_encoder.gmhra.7.attn.in_proj_bias', 'visual_encoder.gmhra.7.attn.out_proj.weight', 'visual_encoder.gmhra.7.attn.out_proj.bias', 'visual_encoder.gmhra.7.ln_1.weight', 'visual_encoder.gmhra.7.ln_1.bias', 'visual_encoder.gmhra.7.mlp.c_fc.weight', 'visual_encoder.gmhra.7.mlp.c_fc.bias', 'visual_encoder.gmhra.7.mlp.c_proj.weight', 'visual_encoder.gmhra.7.mlp.c_proj.bias', 'visual_encoder.gmhra.7.ln_2.weight', 'visual_encoder.gmhra.7.ln_2.bias', 'visual_encoder.gmhra.7.ln_3.weight', 'visual_encoder.gmhra.7.ln_3.bias'], unexpected_keys=['t5_proj.weight', 't5_proj.bias'])
Add extra 64 tokens in QFormer
freeze Qformer
Loading Q-Former Done
Loading LLAMA

how can we support streaming video?

Hi, awesome work, how can we support an ability to use real-time video (streamed video)?

How is it related to Video ChatCaptioner?

There is also another work called video ChatCaptioner. It looks that these two ideas are very related. Can you tell the main difference between your work and Video ChatCaptioner?
https://github.com/Vision-CAIR/ChatCaptioner/tree/main/Video_ChatCaptioner

How to use “chat.generate()” to generate answers in batch

Thank you for sharing us with your great works! I would like to ask how to use “chat.generate” function in "demo.py" file to generate answers in batch?

Wrong directory

In https://github.com/OpenGVLab/Ask-Anything/tree/main/video_chat_with_ChatGPT#running-usage
you need to change cd ask-anything/video_chat to cd ask-anything/video_chat_with_ChatGPT

I build it on my own server, but fail to do QA when I change gpt-4 to gpt-3.5-turbo?

I just change the code here:

Ask-Anything/video_chat/chatbot.py

Line 70 in d347820

 self.llm = OpenAI(temperature=0, openai_api_key=openai_api_key,model_name="gpt-4") 

change gpt-4 to gpt-3.5-turbo

And click Run

Then when I try to chat with it, something wrong with langchain input param? I'm not so familiar with langchain way of using openai API.

> Entering new AgentExecutor chain...
Traceback (most recent call last):
  File "/root/miniconda3/lib/python3.10/site-packages/gradio/routes.py", line 393, in run_predict
    output = await app.get_blocks().process_api(
  File "/root/miniconda3/lib/python3.10/site-packages/gradio/blocks.py", line 1108, in process_api
    result = await self.call_function(
  File "/root/miniconda3/lib/python3.10/site-packages/gradio/blocks.py", line 915, in call_function
    prediction = await anyio.to_thread.run_sync(
  File "/root/miniconda3/lib/python3.10/site-packages/anyio/to_thread.py", line 31, in run_sync
    return await get_asynclib().run_sync_in_worker_thread(
  File "/root/miniconda3/lib/python3.10/site-packages/anyio/_backends/_asyncio.py", line 937, in run_sync_in_worker_thread
    return await future
  File "/root/miniconda3/lib/python3.10/site-packages/anyio/_backends/_asyncio.py", line 867, in run
    result = context.run(func, *args)
  File "/root/Ask-Anything/video_chat/chatbot.py", line 33, in run_text
    res = self.agent({"input": text.strip()})
  File "/root/miniconda3/lib/python3.10/site-packages/langchain/chains/base.py", line 142, in __call__
    raise e
  File "/root/miniconda3/lib/python3.10/site-packages/langchain/chains/base.py", line 139, in __call__
    outputs = self._call(inputs)
  File "/root/miniconda3/lib/python3.10/site-packages/langchain/agents/agent.py", line 497, in _call
    next_step_output = self._take_next_step(
  File "/root/miniconda3/lib/python3.10/site-packages/langchain/agents/agent.py", line 406, in _take_next_step
    output = self.agent.plan(intermediate_steps, **inputs)
  File "/root/miniconda3/lib/python3.10/site-packages/langchain/agents/agent.py", line 102, in plan
    action = self._get_next_action(full_inputs)
  File "/root/miniconda3/lib/python3.10/site-packages/langchain/agents/agent.py", line 63, in _get_next_action
    full_output = self.llm_chain.predict(**full_inputs)
  File "/root/miniconda3/lib/python3.10/site-packages/langchain/chains/llm.py", line 154, in predict
    return self(kwargs)[self.output_key]
  File "/root/miniconda3/lib/python3.10/site-packages/langchain/chains/base.py", line 142, in __call__
    raise e
  File "/root/miniconda3/lib/python3.10/site-packages/langchain/chains/base.py", line 139, in __call__
    outputs = self._call(inputs)
  File "/root/miniconda3/lib/python3.10/site-packages/langchain/chains/llm.py", line 135, in _call
    return self.apply([inputs])[0]
  File "/root/miniconda3/lib/python3.10/site-packages/langchain/chains/llm.py", line 117, in apply
    response = self.generate(input_list)
  File "/root/miniconda3/lib/python3.10/site-packages/langchain/chains/llm.py", line 59, in generate
    response = self.llm.generate(prompts, stop=stop)
  File "/root/miniconda3/lib/python3.10/site-packages/langchain/llms/base.py", line 128, in generate
    raise e
  File "/root/miniconda3/lib/python3.10/site-packages/langchain/llms/base.py", line 125, in generate
    output = self._generate(prompts, stop=stop)
  File "/root/miniconda3/lib/python3.10/site-packages/langchain/llms/openai.py", line 259, in _generate
    response = self.completion_with_retry(prompt=_prompts, **params)
  File "/root/miniconda3/lib/python3.10/site-packages/langchain/llms/openai.py", line 206, in completion_with_retry
    return _completion_with_retry(**kwargs)
  File "/root/miniconda3/lib/python3.10/site-packages/tenacity/__init__.py", line 289, in wrapped_f
    return self(f, *args, **kw)
  File "/root/miniconda3/lib/python3.10/site-packages/tenacity/__init__.py", line 379, in __call__
    do = self.iter(retry_state=retry_state)
  File "/root/miniconda3/lib/python3.10/site-packages/tenacity/__init__.py", line 314, in iter
    return fut.result()
  File "/root/miniconda3/lib/python3.10/concurrent/futures/_base.py", line 451, in result
    return self.__get_result()
  File "/root/miniconda3/lib/python3.10/concurrent/futures/_base.py", line 403, in __get_result
    raise self._exception
  File "/root/miniconda3/lib/python3.10/site-packages/tenacity/__init__.py", line 382, in __call__
    result = fn(*args, **kwargs)
  File "/root/miniconda3/lib/python3.10/site-packages/langchain/llms/openai.py", line 204, in _completion_with_retry
    return self.client.create(**kwargs)
  File "/root/miniconda3/lib/python3.10/site-packages/openai/api_resources/completion.py", line 25, in create
    return super().create(*args, **kwargs)
  File "/root/miniconda3/lib/python3.10/site-packages/openai/api_resources/abstract/engine_api_resource.py", line 153, in create
    response, _, api_key = requestor.request(
  File "/root/miniconda3/lib/python3.10/site-packages/openai/api_requestor.py", line 226, in request
    resp, got_stream = self._interpret_response(result, stream)
  File "/root/miniconda3/lib/python3.10/site-packages/openai/api_requestor.py", line 619, in _interpret_response
    self._interpret_response_line(
  File "/root/miniconda3/lib/python3.10/site-packages/openai/api_requestor.py", line 682, in _interpret_response_line
    raise self.handle_error_response(
openai.error.InvalidRequestError: 'messages' is a required property

Any advice on solving the problem?

Dependency conflict on click for detectron2

When installing detectron2 on video_chat_with_MOSS I get:

ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
typer 0.3.2 requires click<7.2.0,>=7.1.1, but you have click 8.1.3 which is incompatible.

After downgrading click to 7.1.1 I get:
ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
black 23.3.0 requires click>=8.0.0, but you have click 7.1.1 which is incompatible.

安装问题

在 video_chat_with_StableLM项目下运行 python -m pip install 'git+https://github.com/facebookresearch/detectron2.git'会显示安装不了请问是为什么

[Feature Request] Live Stream Video with Adjustable Prompt in Realtime 🔥

Hi there!
First of all, let me say, this is cutting edge stuff, amazing

Wanted to ask, how can we do this on live video? And what should the expected fps to be? Aiming for realtime 30+ fps but even 10fps could work.
The idea is to set a prompt (that can dynamically be changed midrun) and every frame of the video will have to respond to that prompt
Let me give you some examples......

[Video stream of a dog's water bowl with a tap directly above it]
You have access to an iot water tap and your responsibility is to monitor the water level in this container, make sure it is not overflowing or being filled while the container is absent. Trigger filling action water when the level is lower than 20% and stop at 90%. Make a Stop action immediately if there's a spill.
Your output should be structured as follows:
{
container_present: {true/false}
container_offcenter: {int in cm / none}
water_level: {int in percent ranging 0-100%}
action: {idle/fill/stop}
event:{filling/full/empty/spill/dog_drinking}
}

can stream from wifi camera and make sure my dogs always have full water 😇

[Live Video of a tree with fruits on it]
{{this prompt can change based on current event and objective status, but for example}}
You are remotely controlling an agricultural robot with the capacity to pick fruits.
Current objective:

Locate the biggest cluster of ripe fruits on the tree Infront of us
Give directions to the robot to turn slightly left or right based the cluster side from the center of the frame
Give instructions to move forward and approach the cluster, and stoping when getting within 1 meter from the fruits.
Make sure the path is free from any obstacles, ropes, potholes, and or navigate around them.
Add observation notes
Your output should be structured as follows:
{
ripe_cluster_size:{int count of ripe fruits in current cluster}
turn: {left/right/center/up/down}
travel: {stop/forward/backward}
objectives_completed: {false/true}
notes:{string with relevant events and information and agricultural insights}
}
Generalized Agricultural autopilot based on dynamic general objectives 🔥

[Live Video of walking into a grocery shop, walking and occasionally zooming in on products, their prices and ingredients]
{Ask Anything app is open and I'm streaming from my phone and asking different questions along the way, each time refering to something else. Using speech2text}
Hey what is this? Is it any good?
Which other things should I get here if I want to make sauce for this?
How much does that cost?
This is my entire cart [goes off listing and showing items] how much do you estimate my final cart price will be at checkout? (Make a bill of all items and sum their total)

Yea I hope you see what I mean, really mind blowing imo
This could be the next phase shift
Let me know what you think , if you like it , and how can we make this work ?

Thanks a lot and have a good one!
All the best! 💜

Can I finetune the model on my dataset with 4 * 3090?

Or how much GPU memory is needed at least?

OOM on videochat training

Hello, thanks for releasing this excellent work to the public!

I training the videochat stage2 with vicuna-13b on 40G A100 and use default config:
`use_grad_checkpoint=False,

fp16 = True
gradient_checkpointing = True`

but getting OOM error.
so, how much memory needed when training the model with vicuna-13b?

Chat Video with Moss can not related any to video contents?

I have test some videos including the demo yogo video. The moss chat can not recognize any detail content about video but can talk anyway.
So that is the moss model's fact or anywhere I do something wrong?
Thanks.