Giter VIP home page Giter VIP logo

vrgv's Introduction

This is the pytorch implementation of our work at ECCV2020 (Spotlight). teaser The repository mainly includes 3 parts: (1) Extract RoI feature; (2) Train and inference; and (3) Generate relation-aware trajectories.

Notes

Fix issue on unstable result [2021/10/07].

Environment

Anaconda 3, python 3.6.5, pytorch 0.4.1 (Higher version is OK once feature is ready) and cuda >= 9.0. For others libs, please refer to the file requirements.txt.

Install

Please create an env for this project using anaconda3 (should install anaconda first)

>conda create -n envname python=3.6.5 # Create
>conda activate envname # Enter
>pip install -r requirements.txt # Install the provided libs
>sh vRGV/lib/make.sh # Set the environment for detection, make sure you have nvcc

Data Preparation

Please download the data here. The folder ground_data should be at the same directory as vRGV. Please merge the downloaded vRGV folder with this repo.

Please download the videos here and extract the frames into ground_data. The directory should be like: ground_data/vidvrd/JPEGImages/ILSVRC2015_train_xxx/000000.JPEG.

Usage

Feature Extraction. (need about 100G storage! Because I dumped all the detected bboxes along with their features. It can be greatly reduced by changing detect_frame.py to return the top-40 bboxes and save them with .npz file.)

./detection.sh 0 val #(or train)

Sample video features:

cd tools
python sample_video_feature.py

Test. You can use our provided model to verify the feature and environment:

./ground.sh 0 val # Output the relation-aware spatio-temporal attention
python generate_track_link.py # Generate relation-aware trajectories with Viterbi algorithm.
python eval_ground.py # Evaluate the performance

You will get accuracy Acc_R: 24.58%.

Train. If you want to train the model from scratch. Please apply a two-stage training scheme: 1) train a basic model without relation attendance, and 2) load the reconstruction part of the pre-trained model to learn the whole model (with the same lr_rate). For implementation, please turn off/on [pretrain] in line 52 of ground.py, and switch between line 6 & 7 in ground_relation.py for 1st & 2nd stage training respectively. Also, you need to change the model files in line 69 & 70 of ground_relation.py to the best model obtained at the first stage for 2nd-stage training.

./ground.sh 0 train # Train the model with GPU id 0

The results maybe slightly different (+/-0.5%), For comparison, please follow the results reported in our paper.

Result Visualization

Query bicycle-jump_beneath-person person-feed-elephant person-stand_above-bicycle dog-watch-turtle
Result
Query person-ride-horse person-ride-bicycle person-drive-car bicycle-move_toward-car
Result

Citation

@inproceedings{xiao2020visual,
  title={Visual Relation Grounding in Videos},
  author={Xiao, Junbin and Shang, Xindi and Yang, Xun and Tang, Sheng and Chua, Tat-Seng},
  booktitle={European Conference on Computer Vision},
  pages={447--464},
  year={2020},
  organization={Springer}
}

License

NUS © NExT++

vrgv's People

Contributors

doc-doc avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

vrgv's Issues

Inference Error

When I run python.py generate_track_link.py (maybe typo in Readme: generate_track_lick.py), I have the following error KeyError: 'dog-larger-frisbee'. I found that this error is raised because that there is 'dog-larger-frisbee' relation in vrelation_val.json file while the inference result do not have this relation. For example, in ground truth, ILSVRC2015_train_00005004 has three relations "dog-stand_behind-frisbee", "dog-larger-frisbee" and "frisbee-front-dog", while in inference results, it only contains the "dog-stand_behind-frisbee". Could you please offer some advice?

About the feature extraction

Hi, when i run ./detection.sh to extract features, I have obtain discontinuous results. For example, there are 560 frames in ILSVRC2015 train 00008004. However, only 512 .pkl files are extracted with the rest missed. Could you please offer some advice?
image

convert video to images

Hi, thanks for your interesting work! I want to follow your work, and firstly reimplement your result. I wonder how to get the codes that convert each raw video to images. Is it the same with the training and validation sets? Thank you!

Train Error

Thanks for sharing the code and I have one problem during training.

When I have extracted the features and merge both the training and validation sets into one directory ../ground_data/vidvrd/frame_feature, the ./ground.sh 0 train command raise one error that some .pkl file is not find. I think this is due to different sampling strategies at detection and training stages: in detection, dataloader/util.py: samples = np.round(np.linspace(0, nframe-1, sample_num)) with sample_num = 512, while in training, dataloader/ground_loader.py: get_video_feature() sample_frames = np.round(np.linspace(0, frame_count - 1, self.frame_steps)), here self.frame_steps = 120. Not all the sampled frames are pre-extracted before. Could you please offer some advice?

faster_rcnn model path

Could you please provide the the model path of "models/pretrained_models/res101/coco/faster_rcnn_1_10_14657.pth" in your code ?

Can't reimplement the final result

I can't reimplement the final result (only 20, 14, 7 for s/o/all with threshold 0.5) as your paper reported. I convert videos to images with

ffmpeg -i xxx.mp4 -r video_fps -q:v 2 xxx/%06d.jpeg

Is there anything different from yours?

I have 2 more questions:

  1. Why not shuffle train set?
  2. You'd better mention that the evaluation code can only work when batchsize=1 at the inferent stage.

CUDA issues for feature extraction

Hi, I am getting this error while feature extraction, "RuntimeError: CuDNN error: CUDNN_STATUS_EXECUTION_FAILED", I am using CUDA 11, is that an issue?

Cannot reimplement the result reported in the paper

Hi, thanks for your work and this proposed interesting new task.
However, the final result I reimplement following the readme instruction as shown below is far from the reported result in the paper.
Acc_S Acc_O Acc_R
8.87 8.74 1.18
As I change some places so that I can run the code, I just want to ask if there are any places that I change wrongly?
Firstly, I do not install bert-score as it shows it is not compatible with torch 0.4.1, can I know where is bert-score used if this is crucial as I don't find its usage
Secondly, during feature extraction, I uncomment self.train() in

train_loader = '' # self.train()
and move the extracted feature from ../ground_data/frame_feature/ to ../ground_data/vidvrd/frame_feature/ based on my understanding in order for me to run the training code successfully. I am not sure whether these two changes are made correctly and if not what changes should I make so that I can run the code successfully.
I don't think I make any other changes besides the above-mentioned two and I get the result which is very low as shown above.
Can I know how should I correct my implementation so that I can get the same result as shown in the paper?
Many thanks

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.