Giter VIP home page Giter VIP logo

video-llava's People

Contributors

binzhu-ece avatar eltociear avatar jessytsui avatar linb203 avatar nahidalam avatar nateraw avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

video-llava's Issues

[Q] local demo problem multi-turn dialogue

In the multi-turn dialogue, it appears that the video is being passed multiple times, leading to crashes in GPU memory. How can this be resolved? Does it support multiple turns of dialogue with the same video?

bash scripts/v1_5/eval/eval_qa_msrvtt.sh

when running "bash scripts/v1_5/eval/eval_qa_msrvtt.sh",
It reports the following errors,
Can any gays tell me how to fix it ?

completed_files: 0
incomplete_files: 72821
Error processing file 'v_video9632_17': Request timed out: HTTPSConnectionPool(host='api.chatanywhere.cn', port=443): Max retries exceeded with url: /v1/chat/completions (Caused by ConnectTimeoutError(<urllib3.connection.HTTPSConnection object at 0x7f6975cac190>, 'Connection to api.chatanywhere.cn timed out. (connect timeout=600)'))
Error processing file 'v_video7010_0': Request timed out: HTTPSConnectionPool(host='api.chatanywhere.cn', port=443): Max retries exceeded with url: /v1/chat/completions (Caused by ConnectTimeoutError(<urllib3.connection.HTTPSConnection object at 0x7f68e4f5fac0>, 'Connection to api.chatanywhere.cn timed out. (connect timeout=600)'))

How to batch evaluate in inference?

Hi, how can I make the inference code to evaluate videos in batch?
I naively concatenated the tensor in dimension 0 and get this error.

image

Can you help me the figure out it? Thanks.

multiple questions for multiple videos

Hello! When I try to implement multiple questions for multiple videos, there's some errors:

It seems that the current prompt doesn't contain the previous answer, and there's an error in the llava/model/llava_arch.py
Snipaste_2023-12-01_11-58-17
Snipaste_2023-12-01_11-50-41

Would you provide an inference demo for multiple questions && multiple videos? Thanks!

Also, here is my multiple questions inference demo for multiple videos, please point out the problem.

`while True:
    try:
        video = input('Input video path: ')
    except:
        print('Input error, try again.')
        continue
    else:
        if video == 'exit':
            print('Goodbye!')
            break
        if not os.path.exists(video):
            print('{} not exist, try again.'.format(video))
            continue

    video_tensor = video_processor(video, return_tensors='pt')['pixel_values']
    if type(video_tensor) is list:
        tensor = [video.to(model.device, dtype=torch.float16) for video in video_tensor]
    else:
        tensor = video_tensor.to(model.device, dtype=torch.float16)


    while True:
        try:
            inp = input('User: ')
        except:
            print('Input error, try again.')
            continue
        else:
            if inp == 'exit':
                break
            else:
                # print(f"{roles[1]}: {inp}")
                inp = DEFAULT_X_TOKEN['VIDEO'] + '\n' + inp
                conv.append_message(conv.roles[0], inp)
                conv.append_message(conv.roles[1], None)

                prompt = conv.get_prompt() # 同一视频,多个问题时,不包含上个问题答案;不同视频时,之前的提问内容不清空
                input_ids = tokenizer_X_token(prompt, tokenizer, X_TOKEN_INDEX['VIDEO'], return_tensors='pt').unsqueeze(0).cuda()
                stop_str = conv.sep if conv.sep_style != SeparatorStyle.TWO else conv.sep2
                keywords = [stop_str]
                stopping_criteria = KeywordsStoppingCriteria(keywords, tokenizer, input_ids)

                with torch.inference_mode():
                    output_ids = model.generate(
                        input_ids,
                        images=[tensor, key],
                        do_sample=True,
                        temperature=0.1,
                        max_new_tokens=1024,
                        use_cache=True,
                        stopping_criteria=[stopping_criteria])

                outputs = tokenizer.decode(output_ids[0, input_ids.shape[1]:]).strip()
                print(outputs)`

Can not reproduce LLaVA-Bench.

I tried running the evaluation code on your model checkpoint, but I cannot reproduce the results noted in your paper.

Can you help me with this?

Below are the results that I got.

Video-LLaVA-7B
all 56.2 86.2 48.4
llava_bench_complex 71.8 85.4 61.2
llava_bench_conv 45.6 85.9 39.1
llava_bench_detail 39.8 88.0 35.0

Below are the results from the paper.

스크린샷 2023-12-19 18 37 41

Questions about the reproduction of the TGIF-QA

We used the same data as in git and used the officially provided training weights, also evaluated using gpt3.5, but only achieved an accuracy of 47.9/3.1 (vs. 70.0) on the TGIF-QA task.BTW, on the other three QA tasks, were able to obtain metrics similar to those in the paper. What could be the cause of this?

inference endpoint in huggingface

the inference endpoint in huggingface isn't working with the videollava model
here's the stack trace:

2023/12/05 17:27:45 ~ 2023-12-05 22:27:45,803 | INFO | No custom pipeline found at /repository/handler.py
2023/12/05 17:27:45 ~ 2023-12-05 22:27:45,803 | INFO | Initializing model from directory:/repository
2023/12/05 17:27:45 ~ INFO | Using device GPU
2023/12/05 17:27:45 ~ File "/app/huggingface_inference_toolkit/handler.py", line 17, in init
2023/12/05 17:27:45 ~ File "/opt/conda/lib/python3.9/site-packages/starlette/routing.py", line 705, in lifespan
ch28b 2023-12-05T22:27:45.805Z
2023/12/05 17:27:45 ~ await handler()
2023/12/05 17:27:45 ~ await self._router.startup()
2023/12/05 17:27:45 ~ Application startup failed. Exiting.
2023/12/05 17:27:45 ~ File "/opt/conda/lib/python3.9/site-packages/transformers/pipelines/init.py", line 705, in pipeline
2023/12/05 17:27:45 ~ self.pipeline = get_pipeline(model_dir=model_dir, task=task)
2023/12/05 17:27:45 ~ async with self.lifespan_context(app) as maybe_state:
2023/12/05 17:27:45 ~ File "/app/webservice_starlette.py", line 57, in some_startup_task
2023/12/05 17:27:45 ~ File "/opt/conda/lib/python3.9/site-packages/starlette/routing.py", line 682, in startup
2023/12/05 17:27:45 ~ hf_pipeline = pipeline(task=task, model=model_dir, device=device, **kwargs)
2023/12/05 17:27:45 ~ return HuggingFaceHandler(model_dir=model_dir, task=task)
2023/12/05 17:27:45 ~ inference_handler = get_inference_handler_either_custom_or_default_handler(HF_MODEL_DIR, task=HF_TASK)
2023/12/05 17:27:45 ~ File "/opt/conda/lib/python3.9/site-packages/transformers/models/auto/configuration_auto.py", line 710, in getitem
2023/12/05 17:27:45 ~ File "/app/huggingface_inference_toolkit/handler.py", line 45, in get_inference_handler_either_custom_or_default_handler
2023/12/05 17:27:45 ~ File "/opt/conda/lib/python3.9/site-packages/starlette/routing.py", line 584, in aenter
2023/12/05 17:27:45 ~ File "/app/huggingface_inference_toolkit/utils.py", line 261, in get_pipeline
2023/12/05 17:27:45 ~ raise KeyError(key)
2023/12/05 17:27:45 ~ config_class = CONFIG_MAPPING[config_dict["model_type"]]
2023/12/05 17:27:45 ~ config = AutoConfig.from_pretrained(model, _from_pipeline=task, **hub_kwargs, **model_kwargs)
2023/12/05 17:27:45 ~ Traceback (most recent call last):
2023/12/05 17:27:45 ~ KeyError: 'llava'
2023/12/05 17:27:45 ~ File "/opt/conda/lib/python3.9/site-packages/transformers/models/auto/configuration_auto.py", line 998, in from_pretrained

Video checkpoint is broken

I've run the requirements installation, but the model spits out random words when I run the video examples from gradio.
The image captioning works fine.
The hugging face demo for video captioning works fine as well.

Unexpected error from cudaGetDeviceCount()

Hello! I met this error when I was preparing the environment on 3090 GPUs, is this due to the low driver version of the GPUs? My driver version is 470.141.03 with CUDA 11.7.

UserWarning: CUDA initialization: Unexpected error from cudaGetDeviceCount(). Did you run some cuda functions before calling NumCudaDevices() that might have already set an error? Error 804: forward compatibility was attempted on non supported HW (Triggered internally at ../c10/cuda/CUDAFunctions.cpp:109.)
RuntimeError: Unexpected error from cudaGetDeviceCount(). Did you run some cuda functions before calling NumCudaDevices() that might have already set an error? Error 804: forward compatibility was attempted on non supported HW

always answer yes?

When I ask, "is there any <some-object> in the video?", the model always answers "yes".

This IS NOT expected if you are initializing LlavaLlamaForCausalLM from the checkpoint

Thank you very much for your great work, when I run locally,Igot this message:

  • This IS expected if you are initializing LlavaLlamaForCausalLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
  • This IS NOT expected if you are initializing LlavaLlamaForCausalLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).

but,Different images, the same output result,why?

Repeated output.

Hi, sometimes the model generates repeated text output until the maximum lengths. How to deal with this issue? Thanks.

一个sample是一个model 是指的什么意思呢

在Video-LLaVA/llava/model/llava_arch.py 文件中364-367行代码的理解
这段话中,一个sample是一个model 是指的什么意思呢,对应 一个batch 中一个sample吗?
cur_X_idx += 1 ############## 注意这里跳过了,如果一个sample是一个modal,那么就跳过1个全zero的modal,如果一个sample对应多个modal,这里的训练逻辑不对!!! ###### 但似乎不影响1个sample的inference ###### 一个text对应视频和图片,直接走下边了。只有1个text,传入none或者1个/2个全zero都无所谓,反正没有下一个数据了。 continue

How to finetune Video-LLaVA-7B on one specific task.

I notice that in all finetune bash (finetune.sh finetune_lora.sh finetune_sqa.sh and so on), there is one line:
IMPORTANT: this is the training script for the original LLaVA, NOT FOR LLaVA V1.5!
What does this line mean? Does it means that I can not simply run finetune.sh to finetune Video-LLaVA-7B ?
Moreover, if I want to ft released Video-LLaVA-7B directly, can you give me a short guide readme? Hope to have a conversation with you.

Error of multi-turn conversations

Thank you for your great work! I tried some of the sample videos provided, but the model doesn't seem to support multi-turn conversations. The error is triggered when I try to send the second question. Does it only support a single round at this stage?

Some issues with pretrain

Thank you very much for your excellent work.
When I was using the pretrain script, I found that the --video_tower LanguageBind/LanguageBind_Video_merge and --image_tower LanguageBind/LanguageBind_Image on hf repo does not contain the configuration file for the image processor, which leads to code errors.
image
How to solve this problem?

Instruction tuning on A100 (40G)?

Dear author,

Thanks for releasing the amazing code. I'm trying to train the model using A100 (40G).

I loaded the pre-trained mm_projector.bin and run the finetune.sh script with video data.
However, even I decreased the per_device_train_batch_size into 1, I still got CUDA Out of memory.
I noticed that the default setting is 16.
So I wonder is there something wrong?

Looking forward to hearing back from you.

Thanks!

The number of iterations is half what it should be

Thanks for your great job, I met a question:The number of iterations is half what it should be.

The number of dataset is 665k+98K; num_train_epochs=1.0; per_device_train_batch_size=16;gpu=8.
Iteration should be (665+98)*1.0/16/8=5.96k
However It show iteration is 2.48k

Looking forward to your reply, thanks in advance

Some weights of the model checkpoint at LanguageBind/Video-LLaVA-7B were not used when initializing

我采用您代码中可以本地调用模型运行的方式,下载好llava7b、languageBindImage和LanguageBindVideo,但是报如下错误
Some weights of the model checkpoint at LanguageBind/Video-LLaVA-7B were not used when initializing LlavaLlamaForCausalLM: ['model.video_tower.video_tower.encoder.layers.19.temporal_layer_norm1.weight', 'model.image_tower.image_tower.encoder.layers.13.mlp.fc1.bias', 'model.video_tower.video_tower.encoder.layers.11.temporal_attn.k_proj.weight', 'model.image_tower.image_tower.encoder.layers.19.mlp.fc1.weight', 'model.video_tower.video_tower.encoder.layers.2.temporal_attn.k_proj.weight', 'model.video_tower.video_tower.encoder.layers.17.temporal_attn.k_proj.bias', 'model.image_tower.image_tower.encoder.layers.12.self_attn.out_proj.weight', 'model.video_tower.video_tower.encoder.layers.21.self_attn.out_proj.weight'
还有一段代码报错,最后输出
LlavaLlamaForCausalLM: ['model.video_tower.video_tower.encoder.layers.19.temporal_layer_norm1.weight', 'model.image_tower.image_tower.encoder.layers.13.mlp.fc1.bias', 'model.video_tower.video_tower.encoder.layers.11.temporal_attn.k_proj.weight', 'model.image_tower.image_tower.encoder.layers.19.mlp.fc1.weight', 'model.video_tower.video_tower.encoder.layers.2.temporal_attn.k_proj.weight', 'model.video_tower.video_tower.encoder.layers.17.temporal_attn.k_proj.bias', 'model.image_tower.image_tower.encoder.layers.12.self_attn.out_proj.weight', 'model.video_tower.video_tower.encoder.layers.21.self_attn.out_proj.weight'

can not find preprocessor_config.json in LanguageBind_Image

When I tried to deploy the project on hf locally, I couldn't connect to huggingface, so I pre-downloaded LanguageBind_image, video-llava-7B and LanguageBind_video_image locally, and set model_path = 'local_path/LanguageBind/ in app.py Video-LLaVA-7B/',
And change mm_image_tower and mm_video_tower in llava7B's config.json to local paths. But eventually I got this error at runtime.
local_path/LanguageBind/LanguageBind_Image does not appear to have a file named preprocessor_config.json. Checkout 'https://huggingface.co//local_path/LanguageBind/LanguageBind_Image/main' for available files
I think this error means that the preprocessor_config.json file was not found in LanguageBind_Image. However, I did not find this file in the corresponding location of huggingface. Did I make a mistake?

A100 * 4 OOM

A100 * 4 OOM batchsize=1 的情况下仍然是OOM,好像不太合理吧,请问怎么回事呢?

这个是我的bash,大佬帮忙看下~~谢谢

DATA_ROOT="llava_all_image_video"
HF_DATASETS_OFFLINE=1 TRANSFORMERS_OFFLINE=1 CUDA_VISIBLE_DEVICES=0,1,2,3 deepspeed llava/train/train_mem.py \
    --deepspeed ./scripts/zero2.json \
    --model_name_or_path pretrain_models/vicuna-7b-v1.5 \
    --version v1 \
    --data_path ./train_anno.json \
    --video_folder ${DATA_ROOT} \
    --image_folder ${DATA_ROOT} \
    --X "Video" "Image" \
    --video_tower pretrain_models/LanguageBind_Video_merge \
    --image_tower pretrain_models/LanguageBind_Image \
    --pretrain_mm_mlp_adapter pretrain_models/Video-LLaVA-Pretrain-7B/mm_projector.bin \
    --mm_projector_type mlp2x_gelu \
    --mm_vision_select_layer -2 \
    --mm_use_x_start_end False \
    --mm_use_x_patch_token False \
    --image_aspect_ratio pad \
    --group_by_modality_length True \
    --bf16 True \
    --output_dir ./checkpoints/Video-LLaVA-7B \
    --num_train_epochs 1 \
    --per_device_train_batch_size 1 \
    --per_device_eval_batch_size 4 \
    --gradient_accumulation_steps 2 \
    --evaluation_strategy "no" \
    --save_strategy "steps" \
    --save_steps 50000 \
    --save_total_limit 1 \
    --learning_rate 2e-5 \
    --weight_decay 0. \
    --warmup_ratio 0.03 \
    --lr_scheduler_type "cosine" \
    --logging_steps 1 \
    --tf32 True \
    --model_max_length 1024 \
    --gradient_checkpointing True \
    --dataloader_num_workers 8 \
    --lazy_preprocess True \
    --report_to tensorboard \
    --cache_dir "./cache_dir"
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 6.30 GiB (GPU 0; 39.59 GiB total capacity; 32.86 GiB already allocated; 4.76 GiB free; 32.88 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

Some questions about the preprocessed data

Through analyzing your training script, it seems to me that the pre-processed visual features you stored on Baidu's cloud storage service have already undergone the project layer, is that correct? In other words, these pre-processed visual features are not the pixel-based data prior to extraction via the LanguageBind visual feature extractor mentioned in your paper. I am not sure whether my understanding is accurate, so please correct me if I'm wrong. Thank you.

断点续训失败问题

deepspeed_load_checkpoint(self.model_wrapped, resume_from_checkpoint)

File "/home/work/anaconda3/envs/videollava/lib/python3.10/site-packages/transformers/deepspeed.py", line 389, in deepspeed_load_checkpoint
raise ValueError(f"Can't find a valid checkpoint at {checkpoint_path}")
ValueError: Can't find a valid checkpoint at ./checkpoints/Video-LLaVA-Pretrain-7B/checkpoint-1500
此路径下有两个文件 config.json mm_projector.bin

mistake in joint understanding

1701287333033 I try to let the model understand both picture and video, but there is a mistake. Obviously, what the video records is not the flag. In my view, after share project layer, in the share feature space, the distance between the two vectors is very close. Therefore, the LLM thinks "they are the same". Is this the bottleneck of this model? or just need to tune some instructions backend? if you have any ideas or improvement, please let me know. I will be very grateful. I am very interested in this topic.

Cannot reproduce Zero-shot Video-QA (MSVD)

As I mentioned in a previous issue, I am unable to reproduce the MSVD Video-QA results, which are significantly different from the reported results in the paper displayed below.

image

Although you suggested referring to this issue, it is NOT relevant to my case because I tested this using your codebase (and, I checked it more than twice).

I am sharing my inferenced results on MSVD Video-QA link predicted with the shared checkpoint you provided Video-LLaVA-7B.

Could you please share your predicted results on MSVD Video-QA as well?

licensing question

The majority of this project is released under the Apache 2.0 license as found in the LICENSE file.

The service is a research preview intended for non-commercial use only, subject to the model License of LLaMA, Terms of Use of the data generated by OpenAI, and Privacy Practices of ShareGPT. Please contact us if you find any potential violation.

Could you confirm what the service is referring to? The gradio UI or is the model under Apache 2.0 license which allows for commercial use afaik.

why there are pure NLP conversation in videochatgpt_llavaimage_tune.json

like dict as follows. Is it to just finetune LLM ?

{'id': '740000',
'model': '',
'conversations': [{'from': 'human',
'value': '将下面的生活场景润色成动画片的对话,句数不少于8句,要求内容有教育意义\n家庭晚会\n爸爸:今晚我们举办家庭晚会,大家有什么节目表演吗?\n妈妈:我会唱一首歌,希望大家喜欢。\n哥哥:我会弹吉他,可以伴奏你唱歌。\n妹妹:我想跳一支舞,可以吗?'},
{'from': 'gpt',
'value': '(画面:家庭客厅,爸爸、妈妈、哥哥、妹妹四人围坐在沙发上)\n\n爸爸:今晚我们一家人一起举办家庭晚会,大家有什么节目表演吗?\n\n妈妈:我会唱一首歌,希望大家喜欢。 (妈妈微笑着说)\n\n哥哥:我会弹吉他,可以伴奏你唱歌。(哥哥翻起吉他,摆了一个酷酷的姿势)\n\n妹妹:我想跳一支舞,可以吗?(妹妹脸上带着期待的表情)\n\n爸爸:当然可以,你想跳哪一种舞呢?\n\n妹妹:我想跳芭蕾舞,因为我一直很喜欢看芭蕾舞演出。\n\n哥哥:那我们可以一起编排一个小节目,让妹妹跳芭蕾舞,我伴奏,妈妈唱歌。\n\n妹妹:太好了,我要开始练习了。(妹妹兴奋地跑出房间)\n\n爸爸:看到你们这么积极,我也想表演一个小魔术,让大家惊喜。\n\n妈妈:晚会不仅是娱乐,也是学习和交流的机会,我们可以通过互相表演,展示自己的才艺,同时也能增进家庭的感情,提高家人之间的沟通和合作能力。\n\n(画面:一家四口团结一致,笑容满面,享受着愉快的家庭晚会)\n\n教育意义:家庭晚会是一种增强家庭凝聚力、提高沟通能力和合作能力的好方法。通过家庭晚会,家人们可以分享自己的才艺、交流心得,加深彼此之间的了解和感情。在这个过程中,家长可以向孩子展示自己的兴趣和爱好,同时也可以从孩子身上学到新鲜事物,增加彼此之间的互动和交流。'}]

About training comsuption.

Dear authors,

Thanks for your great work, which is of great value to the community.

I would like to know how many GPUs (what type) you use to train the network. And roughly training time?

I am looking forward to hearing back from you.

Best

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.