Giter VIP home page Giter VIP logo

videotree's Introduction

VideoTree: Adaptive Tree-based Video Representation for LLM Reasoning on Long Videos

This is the official implementation for VideoTree

Project Website arXiv

University of North Carolina at Chapel Hill

We introduce VideoTree, a query-adaptive and hierarchical framework for long-video understanding with LLMs. Specifically, VideoTree dynamically extracts query-related information from the input video and builds a tree-based video representation for LLM reasoning.

teaser image

vis image

Installation

Install environment.

Python 3.8 or above is required.

git clone https://github.com/Ziyang412/VideoTree.git
cd VideoTree

python3 -m venv videetree_env
source activate videetree_env/bin/activate
pip install openai
pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
pip install pandas
pip install transformers==4.28.1
pip install accelerate

Download dataset annotations and extracted captions.

Download data.zip from the File LLoVi provided.

unzip data.zip

You could extract captions for EgoSchema at ./data. It also contains dataset annotations.

Specifically, LaViLa base model is leveraged to extract EgoSchema captions at 1 FPS.

Download EgoSchema Videos. Please follow EgoSchema to download the orginal EgoSchema videos. After downloading, please extract the videos into 1 FPS video frames (save in image format for faster loading speed). Please save in the format of ./data/egoschema_frames/{video_id}/{frame_id}.jpg. Then, to further speed up the tree building process, we extract the visual features for each frame using EVA-CLIP-8B and save the features in ./data/egoschema_features/{video_id}.pt.

python data_extraction/extract_images.py
python data_extraction/extract_features.py

Update Kmeans-pytorch

Since the orginal Kmeans-pytorch package doesn't set a iteration limit and will cause perpetual loop issue, we update the init file of the original kmeans-pytorch package.

git clone https://github.com/subhadarship/kmeans_pytorch
cd kmeans_pytorch

Please replace the init file in "kmeans_pytorch" folder with the file we provide in "./kmeans_pytorch" folder (this repo). And run the following command.

pip install --editable .

Future plans

Due to the limit of time, we are still updating the codebase. We will also incorporate the scipts/captions for NeXT-QA and IntentQA in the future.

Experiments

Adaptive Breath Exapnsion

Please update the feature, asgs (in util.py) and output path before running the code.

sh scripts/breath_expansion.sh

Relevance-based Depth Expansion

Please update the feature, the output of last step (the relevance output path and first level cluster information) and output path before running the code.

python depth_expansion.py

LLM Reasoning

Please update the tree node index file (output of last step), data files and output path before running the code.

sh scripts/egoschema_qa.sh

Debug

--save_info: save more information, e.g. token usage, detailed prompts, etc.
--num_examples_to_run: how many examples to run. -1 (default) to run all.
--start_from_scratch: ignore existing output files. Start from scratch.

Acknowledgments

We thank the developers of LLoVi, LifelongMemory, EVA-CLIP, Kmeans-pytorch and SKlearn Clustering for their public code release. We also thank the authors of VideoAgent for the helpful discussion.

Reference

Please cite our paper if you use our models in your works:

@article{wang2024videotree,
  title={VideoTree: Adaptive Tree-based Video Representation for LLM Reasoning on Long Videos},
  author={Wang, Ziyang and Yu, Shoubin and Stengel-Eskin, Elias and Yoon, Jaehong and Cheng, Feng and Bertasius, Gedas and Bansal, Mohit},
  journal={arXiv preprint arXiv:2405.19209},
  year={2024}
}

videotree's People

Contributors

yui010206 avatar ziyang412 avatar

Stargazers

Kumara Kahatapitiya avatar \_/ avatar Saikat Kumar Ghosh  avatar  avatar Ziwei Cui avatar Orr Zohar avatar  avatar Xiuyu Li avatar  avatar  avatar ZyKNvice avatar Haodong Chen avatar Jinrui Zhang avatar  avatar  avatar Yudi Shi avatar Yunlong (Yolo) Tang avatar Jingyang Lin avatar Yi Liu avatar  avatar SeshurajuP avatar Shuming Liu avatar Jose Cohenca avatar  avatar  avatar Shuai Liu avatar Bowen Yuan avatar Hm Xiong avatar  avatar Yongxin Guo avatar Akhil Songa avatar Jaehong Yoon avatar Jihan Yang avatar  avatar lntzm avatar Wenzheng Zeng avatar Jinfa Huang avatar Xingchen Zhao avatar  avatar 爱可可-爱生活 avatar Andrew Carr avatar Peter Baylies avatar  avatar Steffen Röcker avatar  avatar Alexia Xu avatar  avatar  avatar Zhongpai Gao avatar Aaron Han avatar Guanhong Wang avatar  avatar  avatar  avatar

Watchers

Kostas Georgiou avatar  avatar

Forkers

vhzy

videotree's Issues

The scores run on egoschema(subset) are lower than those in the paper.

Hello,
Thank you for your excellent work!
I ran the code from the repository on the egoschema (subset) using the GPT-4-1106-preview model. However, the final score was more than 3 points lower than the score reported in the paper. Are there any parameter settings I might be missing?

KMeans Function Enters Infinite Loop in width_expansion.py with center_shift=nan

Thank you for open-sourcing this excellent work. I am currently reproducing your project and encountered an issue while running the width_expansion.py script. The KMeans function enters an infinite loop, and the output shows center_shift=nan. However, using the KMeans function from the sklearn.cluster library does not produce this error, though the clustering results differ.

Here is the feature pt file that got error:
https://drive.google.com/file/d/1z1yaiLcPUSI8Uh2DvcrFpVQ6cBWZSDFg/view?usp=drive_link

Could you please let me know if you have encountered a similar issue before and provide any guidance on how to resolve this?

Thank you for your assistance.

Issue with Missing "Frame Relevance" Content in LLM Response

Thank you for open-sourcing this excellent work. I am currently reproducing your project and encountered an issue with a specific part of the code.

The function update_relevance_response is supposed to extract relevance information from the response text using the following code:

def update_relevance_response(text):
response = text
relevance_match = re.search(r"frame relevance: [([0-9, ]+)]", response)
if relevance_match:
relevance = list(map(int, relevance_match.group(1).split(',')))
return relevance
However, my response does not contain any "frame relevance" content, which is expected to be provided by the LLM. The specific response I received is:
'prediction: C, center_shift=0.000000, iteration=6, tol=0.000100] explanation: The actions described involve cleaning kitchen items such as'

Without this information, the function does not work as intended and will meet this error(UnboundLocalError: local variable 'relevance' referenced before assignment).

Have you encountered a similar issue before, where the LLM response lacks the "frame relevance" content? If so, could you provide any guidance or suggestions on how to ensure that the LLM includes this information in its response?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.