pku-yuangroup / video-llava Goto Github PK
View Code? Open in Web Editor NEWVideo-LLaVA: Learning United Visual Representation by Alignment Before Projection
Home Page: https://arxiv.org/pdf/2311.10122.pdf
License: Apache License 2.0
Video-LLaVA: Learning United Visual Representation by Alignment Before Projection
Home Page: https://arxiv.org/pdf/2311.10122.pdf
License: Apache License 2.0
In the multi-turn dialogue, it appears that the video is being passed multiple times, leading to crashes in GPU memory. How can this be resolved? Does it support multiple turns of dialogue with the same video?
completed_files: 0
incomplete_files: 72821
Error processing file 'v_video9632_17': Request timed out: HTTPSConnectionPool(host='api.chatanywhere.cn', port=443): Max retries exceeded with url: /v1/chat/completions (Caused by ConnectTimeoutError(<urllib3.connection.HTTPSConnection object at 0x7f6975cac190>, 'Connection to api.chatanywhere.cn timed out. (connect timeout=600)'))
Error processing file 'v_video7010_0': Request timed out: HTTPSConnectionPool(host='api.chatanywhere.cn', port=443): Max retries exceeded with url: /v1/chat/completions (Caused by ConnectTimeoutError(<urllib3.connection.HTTPSConnection object at 0x7f68e4f5fac0>, 'Connection to api.chatanywhere.cn timed out. (connect timeout=600)'))
Hi, have you tried using 336 as input shape? Will the result be better?
Hello! When I try to implement multiple questions for multiple videos, there's some errors:
It seems that the current prompt doesn't contain the previous answer, and there's an error in the llava/model/llava_arch.py
Would you provide an inference demo for multiple questions && multiple videos? Thanks!
Also, here is my multiple questions inference demo for multiple videos, please point out the problem.
`while True:
try:
video = input('Input video path: ')
except:
print('Input error, try again.')
continue
else:
if video == 'exit':
print('Goodbye!')
break
if not os.path.exists(video):
print('{} not exist, try again.'.format(video))
continue
video_tensor = video_processor(video, return_tensors='pt')['pixel_values']
if type(video_tensor) is list:
tensor = [video.to(model.device, dtype=torch.float16) for video in video_tensor]
else:
tensor = video_tensor.to(model.device, dtype=torch.float16)
while True:
try:
inp = input('User: ')
except:
print('Input error, try again.')
continue
else:
if inp == 'exit':
break
else:
# print(f"{roles[1]}: {inp}")
inp = DEFAULT_X_TOKEN['VIDEO'] + '\n' + inp
conv.append_message(conv.roles[0], inp)
conv.append_message(conv.roles[1], None)
prompt = conv.get_prompt() # 同一视频,多个问题时,不包含上个问题答案;不同视频时,之前的提问内容不清空
input_ids = tokenizer_X_token(prompt, tokenizer, X_TOKEN_INDEX['VIDEO'], return_tensors='pt').unsqueeze(0).cuda()
stop_str = conv.sep if conv.sep_style != SeparatorStyle.TWO else conv.sep2
keywords = [stop_str]
stopping_criteria = KeywordsStoppingCriteria(keywords, tokenizer, input_ids)
with torch.inference_mode():
output_ids = model.generate(
input_ids,
images=[tensor, key],
do_sample=True,
temperature=0.1,
max_new_tokens=1024,
use_cache=True,
stopping_criteria=[stopping_criteria])
outputs = tokenizer.decode(output_ids[0, input_ids.shape[1]:]).strip()
print(outputs)`
If I read your paper correctly, you're only using 8 frames of a video? Would it be possible to specify which 8 frames to use instead of just using the first 8 frames?
"CLI Reasoning" example videos in reverse order :P
I found code here: https://github.com/PKU-YuanGroup/Video-LLaVA/blob/main/llava/model/llava_arch.py#L326
will cost a lot of time when batch size sets to 32.
maybe change the data sample func and only image or video data is used in each batch will help increase training speed.
I tried running the evaluation code on your model checkpoint, but I cannot reproduce the results noted in your paper.
Can you help me with this?
Below are the results that I got.
Video-LLaVA-7B
all 56.2 86.2 48.4
llava_bench_complex 71.8 85.4 61.2
llava_bench_conv 45.6 85.9 39.1
llava_bench_detail 39.8 88.0 35.0
Below are the results from the paper.
We used the same data as in git and used the officially provided training weights, also evaluated using gpt3.5, but only achieved an accuracy of 47.9/3.1 (vs. 70.0) on the TGIF-QA task.BTW, on the other three QA tasks, were able to obtain metrics similar to those in the paper. What could be the cause of this?
the inference endpoint in huggingface isn't working with the videollava model
here's the stack trace:
2023/12/05 17:27:45 ~ 2023-12-05 22:27:45,803 | INFO | No custom pipeline found at /repository/handler.py
2023/12/05 17:27:45 ~ 2023-12-05 22:27:45,803 | INFO | Initializing model from directory:/repository
2023/12/05 17:27:45 ~ INFO | Using device GPU
2023/12/05 17:27:45 ~ File "/app/huggingface_inference_toolkit/handler.py", line 17, in init
2023/12/05 17:27:45 ~ File "/opt/conda/lib/python3.9/site-packages/starlette/routing.py", line 705, in lifespan
ch28b 2023-12-05T22:27:45.805Z
2023/12/05 17:27:45 ~ await handler()
2023/12/05 17:27:45 ~ await self._router.startup()
2023/12/05 17:27:45 ~ Application startup failed. Exiting.
2023/12/05 17:27:45 ~ File "/opt/conda/lib/python3.9/site-packages/transformers/pipelines/init.py", line 705, in pipeline
2023/12/05 17:27:45 ~ self.pipeline = get_pipeline(model_dir=model_dir, task=task)
2023/12/05 17:27:45 ~ async with self.lifespan_context(app) as maybe_state:
2023/12/05 17:27:45 ~ File "/app/webservice_starlette.py", line 57, in some_startup_task
2023/12/05 17:27:45 ~ File "/opt/conda/lib/python3.9/site-packages/starlette/routing.py", line 682, in startup
2023/12/05 17:27:45 ~ hf_pipeline = pipeline(task=task, model=model_dir, device=device, **kwargs)
2023/12/05 17:27:45 ~ return HuggingFaceHandler(model_dir=model_dir, task=task)
2023/12/05 17:27:45 ~ inference_handler = get_inference_handler_either_custom_or_default_handler(HF_MODEL_DIR, task=HF_TASK)
2023/12/05 17:27:45 ~ File "/opt/conda/lib/python3.9/site-packages/transformers/models/auto/configuration_auto.py", line 710, in getitem
2023/12/05 17:27:45 ~ File "/app/huggingface_inference_toolkit/handler.py", line 45, in get_inference_handler_either_custom_or_default_handler
2023/12/05 17:27:45 ~ File "/opt/conda/lib/python3.9/site-packages/starlette/routing.py", line 584, in aenter
2023/12/05 17:27:45 ~ File "/app/huggingface_inference_toolkit/utils.py", line 261, in get_pipeline
2023/12/05 17:27:45 ~ raise KeyError(key)
2023/12/05 17:27:45 ~ config_class = CONFIG_MAPPING[config_dict["model_type"]]
2023/12/05 17:27:45 ~ config = AutoConfig.from_pretrained(model, _from_pipeline=task, **hub_kwargs, **model_kwargs)
2023/12/05 17:27:45 ~ Traceback (most recent call last):
2023/12/05 17:27:45 ~ KeyError: 'llava'
2023/12/05 17:27:45 ~ File "/opt/conda/lib/python3.9/site-packages/transformers/models/auto/configuration_auto.py", line 998, in from_pretrained
I've run the requirements installation, but the model spits out random words when I run the video examples from gradio.
The image captioning works fine.
The hugging face demo for video captioning works fine as well.
Hello! I met this error when I was preparing the environment on 3090 GPUs, is this due to the low driver version of the GPUs? My driver version is 470.141.03 with CUDA 11.7.
UserWarning: CUDA initialization: Unexpected error from cudaGetDeviceCount(). Did you run some cuda functions before calling NumCudaDevices() that might have already set an error? Error 804: forward compatibility was attempted on non supported HW (Triggered internally at ../c10/cuda/CUDAFunctions.cpp:109.)
RuntimeError: Unexpected error from cudaGetDeviceCount(). Did you run some cuda functions before calling NumCudaDevices() that might have already set an error? Error 804: forward compatibility was attempted on non supported HW
When I ask, "is there any <some-object> in the video?", the model always answers "yes".
A:她们乒乓球打得很好,她们的手指灵活,乒乓球一次都没有错过目标。
from:https://replicate.com/nateraw/video-llava?prediction=p4qnq7dbin6msnwk77b3odfjna
There are many different types of QA in the TGIF. Will you design prompts for each type? Or let the model randomly generate?
Thank you very much for your great work, when I run locally,Igot this message:
but,Different images, the same output result,why?
What does the value you reported stand for. Did you calculate F1/accuracy for each split and then report the average score or just merge the three splits into one and report the F1/accuracy?
Hi, sometimes the model generates repeated text output until the maximum lengths. How to deal with this issue? Thanks.
在Video-LLaVA/llava/model/llava_arch.py 文件中364-367行代码的理解
这段话中,一个sample是一个model 是指的什么意思呢,对应 一个batch 中一个sample吗?
cur_X_idx += 1 ############## 注意这里跳过了,如果一个sample是一个modal,那么就跳过1个全zero的modal,如果一个sample对应多个modal,这里的训练逻辑不对!!! ###### 但似乎不影响1个sample的inference ###### 一个text对应视频和图片,直接走下边了。只有1个text,传入none或者1个/2个全zero都无所谓,反正没有下一个数据了。 continue
How to set up a single round conversation without adding history information?
I notice that in all finetune bash (finetune.sh finetune_lora.sh finetune_sqa.sh and so on), there is one line:
IMPORTANT: this is the training script for the original LLaVA, NOT FOR LLaVA V1.5!
What does this line mean? Does it means that I can not simply run finetune.sh to finetune Video-LLaVA-7B ?
Moreover, if I want to ft released Video-LLaVA-7B directly, can you give me a short guide readme? Hope to have a conversation with you.
Hi,
Have you tested your finetuned model on some QA datasets, which can be used to compare with some strong baseline models, such as BLIP-2?
Thank you for your great work! I tried some of the sample videos provided, but the model doesn't seem to support multi-turn conversations. The error is triggered when I try to send the second question. Does it only support a single round at this stage?
good job! thanks for you gays!
can you provide instructions like this:
https://github.com/mbzuai-oryx/Video-ChatGPT/blob/main/docs/offline_demo.md
Thank you very much for your excellent work.
When I was using the pretrain script, I found that the --video_tower LanguageBind/LanguageBind_Video_merge and --image_tower LanguageBind/LanguageBind_Image on hf repo does not contain the configuration file for the image processor, which leads to code errors.
How to solve this problem?
Thank you for the great work. Is it possible to provide a download link of processed data besides baidu nextdisk?
Dear author,
Thanks for releasing the amazing code. I'm trying to train the model using A100 (40G).
I loaded the pre-trained mm_projector.bin
and run the finetune.sh
script with video data.
However, even I decreased the per_device_train_batch_size
into 1, I still got CUDA Out of memory.
I noticed that the default setting is 16.
So I wonder is there something wrong?
Looking forward to hearing back from you.
Thanks!
Thanks so much! Google Colab offers a T4 with ~14ish GB of Graphics Memory, is there a way to fit this in there?
Thanks for your great job, I met a question:The number of iterations is half what it should be.
The number of dataset is 665k+98K; num_train_epochs=1.0; per_device_train_batch_size=16;gpu=8.
Iteration should be (665+98)*1.0/16/8=5.96k
However It show iteration is 2.48k
Looking forward to your reply, thanks in advance
我采用您代码中可以本地调用模型运行的方式,下载好llava7b、languageBindImage和LanguageBindVideo,但是报如下错误
Some weights of the model checkpoint at LanguageBind/Video-LLaVA-7B were not used when initializing LlavaLlamaForCausalLM: ['model.video_tower.video_tower.encoder.layers.19.temporal_layer_norm1.weight', 'model.image_tower.image_tower.encoder.layers.13.mlp.fc1.bias', 'model.video_tower.video_tower.encoder.layers.11.temporal_attn.k_proj.weight', 'model.image_tower.image_tower.encoder.layers.19.mlp.fc1.weight', 'model.video_tower.video_tower.encoder.layers.2.temporal_attn.k_proj.weight', 'model.video_tower.video_tower.encoder.layers.17.temporal_attn.k_proj.bias', 'model.image_tower.image_tower.encoder.layers.12.self_attn.out_proj.weight', 'model.video_tower.video_tower.encoder.layers.21.self_attn.out_proj.weight'
还有一段代码报错,最后输出
LlavaLlamaForCausalLM: ['model.video_tower.video_tower.encoder.layers.19.temporal_layer_norm1.weight', 'model.image_tower.image_tower.encoder.layers.13.mlp.fc1.bias', 'model.video_tower.video_tower.encoder.layers.11.temporal_attn.k_proj.weight', 'model.image_tower.image_tower.encoder.layers.19.mlp.fc1.weight', 'model.video_tower.video_tower.encoder.layers.2.temporal_attn.k_proj.weight', 'model.video_tower.video_tower.encoder.layers.17.temporal_attn.k_proj.bias', 'model.image_tower.image_tower.encoder.layers.12.self_attn.out_proj.weight', 'model.video_tower.video_tower.encoder.layers.21.self_attn.out_proj.weight'
When I tried to deploy the project on hf locally, I couldn't connect to huggingface, so I pre-downloaded LanguageBind_image, video-llava-7B and LanguageBind_video_image locally, and set model_path = 'local_path/LanguageBind/ in app.py Video-LLaVA-7B/',
And change mm_image_tower and mm_video_tower in llava7B's config.json to local paths. But eventually I got this error at runtime.
local_path/LanguageBind/LanguageBind_Image does not appear to have a file named preprocessor_config.json. Checkout 'https://huggingface.co//local_path/LanguageBind/LanguageBind_Image/main' for available files
I think this error means that the preprocessor_config.json file was not found in LanguageBind_Image. However, I did not find this file in the corresponding location of huggingface. Did I make a mistake?
这个是我的bash,大佬帮忙看下~~谢谢
DATA_ROOT="llava_all_image_video"
HF_DATASETS_OFFLINE=1 TRANSFORMERS_OFFLINE=1 CUDA_VISIBLE_DEVICES=0,1,2,3 deepspeed llava/train/train_mem.py \
--deepspeed ./scripts/zero2.json \
--model_name_or_path pretrain_models/vicuna-7b-v1.5 \
--version v1 \
--data_path ./train_anno.json \
--video_folder ${DATA_ROOT} \
--image_folder ${DATA_ROOT} \
--X "Video" "Image" \
--video_tower pretrain_models/LanguageBind_Video_merge \
--image_tower pretrain_models/LanguageBind_Image \
--pretrain_mm_mlp_adapter pretrain_models/Video-LLaVA-Pretrain-7B/mm_projector.bin \
--mm_projector_type mlp2x_gelu \
--mm_vision_select_layer -2 \
--mm_use_x_start_end False \
--mm_use_x_patch_token False \
--image_aspect_ratio pad \
--group_by_modality_length True \
--bf16 True \
--output_dir ./checkpoints/Video-LLaVA-7B \
--num_train_epochs 1 \
--per_device_train_batch_size 1 \
--per_device_eval_batch_size 4 \
--gradient_accumulation_steps 2 \
--evaluation_strategy "no" \
--save_strategy "steps" \
--save_steps 50000 \
--save_total_limit 1 \
--learning_rate 2e-5 \
--weight_decay 0. \
--warmup_ratio 0.03 \
--lr_scheduler_type "cosine" \
--logging_steps 1 \
--tf32 True \
--model_max_length 1024 \
--gradient_checkpointing True \
--dataloader_num_workers 8 \
--lazy_preprocess True \
--report_to tensorboard \
--cache_dir "./cache_dir"
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 6.30 GiB (GPU 0; 39.59 GiB total capacity; 32.86 GiB already allocated; 4.76 GiB free; 32.88 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
I try to replace the zero2.json with zero3.json in the pretraining stage. But the model hangs and cannot train like zero2. Is it normal?
你好,训练这个模型需要多大的计算资源?
Dear authors,
I wonder why the mm_use_x_start_end
and mm_use_x_start_end
are set to False?
If not use them, the tokenizer will tokenize tokens like <im_start>
into multiple tokens, would that cause trouble?
Could you help explain this?
Thanks!
Please support streaming responses.
Thanks for your contribution!
I tried to reproduce your result (Zero-shot VideoQA on MSVD dataset) with the pretrained weight https://huggingface.co/LanguageBind/Video-LLaVA-7B/tree/main.
But the result is completely different from your paper.
(Reproduced result is shown below)
Can you check this?
Loading checkpoint shards: 0%| | 0/2 [00:00<?, ?it/s]Killed
killed
Through analyzing your training script, it seems to me that the pre-processed visual features you stored on Baidu's cloud storage service have already undergone the project layer, is that correct? In other words, these pre-processed visual features are not the pixel-based data prior to extraction via the LanguageBind visual feature extractor mentioned in your paper. I am not sure whether my understanding is accurate, so please correct me if I'm wrong. Thank you.
is there anyway to run inferencing while using metal gpu offloading?
deepspeed_load_checkpoint(self.model_wrapped, resume_from_checkpoint)
File "/home/work/anaconda3/envs/videollava/lib/python3.10/site-packages/transformers/deepspeed.py", line 389, in deepspeed_load_checkpoint
raise ValueError(f"Can't find a valid checkpoint at {checkpoint_path}")
ValueError: Can't find a valid checkpoint at ./checkpoints/Video-LLaVA-Pretrain-7B/checkpoint-1500
此路径下有两个文件 config.json mm_projector.bin
As I mentioned in a previous issue, I am unable to reproduce the MSVD Video-QA results, which are significantly different from the reported results in the paper displayed below.
Although you suggested referring to this issue, it is NOT relevant to my case because I tested this using your codebase (and, I checked it more than twice).
I am sharing my inferenced results on MSVD Video-QA link predicted with the shared checkpoint you provided Video-LLaVA-7B.
Could you please share your predicted results on MSVD Video-QA as well?
The majority of this project is released under the Apache 2.0 license as found in the LICENSE file.
The service is a research preview intended for non-commercial use only, subject to the model License of LLaMA, Terms of Use of the data generated by OpenAI, and Privacy Practices of ShareGPT. Please contact us if you find any potential violation.
Could you confirm what the service is referring to? The gradio UI or is the model under Apache 2.0 license which allows for commercial use afaik.
like dict as follows. Is it to just finetune LLM ?
{'id': '740000',
'model': '',
'conversations': [{'from': 'human',
'value': '将下面的生活场景润色成动画片的对话,句数不少于8句,要求内容有教育意义\n家庭晚会\n爸爸:今晚我们举办家庭晚会,大家有什么节目表演吗?\n妈妈:我会唱一首歌,希望大家喜欢。\n哥哥:我会弹吉他,可以伴奏你唱歌。\n妹妹:我想跳一支舞,可以吗?'},
{'from': 'gpt',
'value': '(画面:家庭客厅,爸爸、妈妈、哥哥、妹妹四人围坐在沙发上)\n\n爸爸:今晚我们一家人一起举办家庭晚会,大家有什么节目表演吗?\n\n妈妈:我会唱一首歌,希望大家喜欢。 (妈妈微笑着说)\n\n哥哥:我会弹吉他,可以伴奏你唱歌。(哥哥翻起吉他,摆了一个酷酷的姿势)\n\n妹妹:我想跳一支舞,可以吗?(妹妹脸上带着期待的表情)\n\n爸爸:当然可以,你想跳哪一种舞呢?\n\n妹妹:我想跳芭蕾舞,因为我一直很喜欢看芭蕾舞演出。\n\n哥哥:那我们可以一起编排一个小节目,让妹妹跳芭蕾舞,我伴奏,妈妈唱歌。\n\n妹妹:太好了,我要开始练习了。(妹妹兴奋地跑出房间)\n\n爸爸:看到你们这么积极,我也想表演一个小魔术,让大家惊喜。\n\n妈妈:晚会不仅是娱乐,也是学习和交流的机会,我们可以通过互相表演,展示自己的才艺,同时也能增进家庭的感情,提高家人之间的沟通和合作能力。\n\n(画面:一家四口团结一致,笑容满面,享受着愉快的家庭晚会)\n\n教育意义:家庭晚会是一种增强家庭凝聚力、提高沟通能力和合作能力的好方法。通过家庭晚会,家人们可以分享自己的才艺、交流心得,加深彼此之间的了解和感情。在这个过程中,家长可以向孩子展示自己的兴趣和爱好,同时也可以从孩子身上学到新鲜事物,增加彼此之间的互动和交流。'}]
Dear authors,
Thanks for your great work, which is of great value to the community.
I would like to know how many GPUs (what type) you use to train the network. And roughly training time?
I am looking forward to hearing back from you.
Best
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.