Giter VIP home page Giter VIP logo

video-focalnets's Introduction

Video-FocalNets: Spatio-Temporal Focal Modulation for Video Action Recognition [ICCV 2023]

Syed Talal Wasim*, Muhammad Uzair Khattak*, Muzammal Naseer, Salman Khan, Mubarak Shah, Fahad Shahbaz Khan

*Joint first authors

Website paper


Abstract: Recent video recognition models utilize Transformer models for long-range spatio-temporal context modeling. Video transformer designs are based on self-attention that can model global context at a high computational cost. In comparison, convolutional designs for videos offer an efficient alternative but lack long-range dependency modeling. Towards achieving the best of both designs, this work proposes Video-FocalNet, an effective and efficient architecture for video recognition that models both local and global contexts. Video-FocalNet is based on a spatio-temporal focal modulation architecture that reverses the interaction and aggregation steps of self-attention for better efficiency. Further, the aggregation step and the interaction step are both implemented using efficient convolution and element-wise multiplication operations that are computationally less expensive than their self-attention counterparts on video representations. We extensively explore the design space of focal modulation-based spatio-temporal context modeling and demonstrate our parallel spatial and temporal encoding design to be the optimal choice. Video-FocalNets perform favorably well against the state-of-the-art transformer-based models for video recognition on three large-scale datasets (Kinetics-400, Kinetics-600, and SS-v2) at a lower computational cost.

Table of Contents

๐Ÿš€ News

  • (July 13, 2023)
    • Training and evaluation codes for Video-FocalNets, along with pretrained models are released.

Overview

Overall Architecture

(a) The overall architecture of Video-FocalNets: A four-stage architecture, with each stage comprising a patch embedding and a number of Video-FocalNet blocks. (b) Single Video-FocalNet block: Similar to the transformer blocks, we replace self-attention with Spatio-Temporal Focal Modulation.


Overall Architecture Performance Comparison

The Spatio-Temporal Focal Modulation layer: A spatio-temporal focal modulation block that independently models the spatial and temporal information.

Comparison for Top-1 Accuracy vs GFlops/view on Kinetics-400.

Visualization: First and Last layer Spatio-Temporal Modulator

Visualization Cutting Apple

Visualization Scuba Diving

Visualization Threading Needle

Visualization Walking the Dog

Visualization Water Skiing

Environment Setup

Please follow INSTALL.md for installation.

Dataset Preparation

Please follow DATA.md for data preparation.

Model Zoo

Kinetics-400

Model Depth Dim Kernels Top-1 Download
Video-FocalNet-T [2,2,6,2] 96 [3,5] 79.8 ckpt
Video-FocalNet-S [2,2,18,2] 96 [3,5] 81.4 ckpt
Video-FocalNet-B [2,2,18,2] 128 [3,5] 83.6 ckpt

Kinetics-600

Model Depth Dim Kernels Top-1 Download
Video-FocalNet-B [2,2,18,2] 128 [3,5] 86.7 ckpt

Something-Something-v2

Model Depth Dim Kernels Top-1 Download
Video-FocalNet-B [2,2,18,2] 128 [3,5] 71.1 ckpt

Diving-48

Model Depth Dim Kernels Top-1 Download
Video-FocalNet-B [2,2,18,2] 128 [3,5] 90.8 ckpt

ActivityNet-v1.3

Model Depth Dim Kernels Top-1 Download
Video-FocalNet-B [2,2,18,2] 128 [3,5] 89.8 ckpt

Evaluation

To evaluate pre-trained Video-FocalNets on your dataset:

python -m torch.distributed.launch --nproc_per_node <num-of-gpus-to-use>  main.py  --eval \
--cfg <config-file> --resume <checkpoint> \
--opts DATA.NUM_FRAMES 8 DATA.BATCH_SIZE 8 TEST.NUM_CLIP 4 TEST.NUM_CROP 3 DATA.ROOT path/to/root DATA.TRAIN_FILE train.csv DATA.VAL_FILE val.csv

For example, to evaluate the Video-FocalNet-B with a single GPU on Kinetics400:

python -m torch.distributed.launch --nproc_per_node 1  main.py  --eval \
--cfg configs/kinetics400/video_focalnet_base.yaml --resume video-focalnet_base_k400.pth \
--opts DATA.NUM_FRAMES 8 DATA.BATCH_SIZE 8 TEST.NUM_CLIP 4 TEST.NUM_CROP 3 DATA.ROOT path/to/root DATA.TRAIN_FILE train.csv DATA.VAL_FILE val.csv

Alternatively, the DATA.ROOT, DATA.TRAIN_FILE, and DATA.VAL_FILE paths can be set directly in the config files provided in the configs directory. According to our experience and sanity checks, there is a reasonable random variation of about +/-0.3% top-1 accuracy when testing on different machines.

Additionally, the TRAIN.PRETRAINED_PATH can be set (either in the config file or bash script) to provide a pretrained model to initialize the weights. To initialize from the ImageNet-1K weights please refer to the FocalNets repository and download the FocalNet-T-SRF, FocalNet-S-SRF or FocalNet-B-SRF to initialize Video-FocalNet-T, Video-FocalNet-S or Video-FocalNet-B respectively. Alternatively, one of the provided pretrained Video-FocalNet models can also be utilized to initialize the weights.

Training

To train a Video-FocalNet on a video dataset from scratch, run:

python -m torch.distributed.launch --nproc_per_node <num-of-gpus-to-use>  main.py \
--cfg <config-file> --batch-size <batch-size-per-gpu> --output <output-directory> \
--opts DATA.ROOT path/to/root DATA.TRAIN_FILE train.csv DATA.VAL_FILE val.csv

Alternatively, the DATA.ROOT, DATA.TRAIN_FILE, and DATA.VAL_FILE paths can be set directly in the config files provided in the configs directory. We also provide bash scripts to train Video-FocalNets on various datasets in the scripts directory.

Additionally, the TRAIN.PRETRAINED_PATH can be set (either in the config file or bash script) to provide a pretrained model to initialize the weights. To initialize from the ImageNet-1K weights please refer to the FocalNets repository and download the FocalNet-T-SRF, FocalNet-S-SRF or FocalNet-B-SRF to initialize Video-FocalNet-T, Video-FocalNet-S or Video-FocalNet-B respectively. Alternatively, one of the provided pretrained Video-FocalNet models can also be utilized to initialize the weights.

Citation

If you find our work, this repository, or pretrained models useful, please consider giving a star โญ and citation.

@InProceedings{Wasim_2023_ICCV,
    author    = {Wasim, Syed Talal and Khattak, Muhammad Uzair and Naseer, Muzammal and Khan, Salman and Shah, Mubarak and Khan, Fahad Shahbaz},
    title     = {Video-FocalNets: Spatio-Temporal Focal Modulation for Video Action Recognition},
    booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)},
    year      = {2023},
}

Contact

If you have any questions, please create an issue on this repository or contact at [email protected] or [email protected].

Acknowledgements

Our code is based on FocalNets, XCLIP and UniFormer repositories. We thank the authors for releasing their code. If you use our model, please consider citing these works as well.

video-focalnets's People

Contributors

dependabot[bot] avatar muzairkhattak avatar talalwasim avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

video-focalnets's Issues

Visualization Code

Dear Authors,

Thank you for releasing the excellent codebase. Can you please also point this out in the repository or post the code to generate visualizations for the last layer to show the focus regions of the model?

Dependencies problems

Hello! Thank you for your wonderful work! I was trying to use your model for my action recognition task in google colab, but have faced problems with dependencies.
I couldn't use conda and changed all "conda" command to "pip" e.g. "conda install pytorch==1.11.0 torchvision==0.12.0 torchaudio==0.11.0 cudatoolkit=11.3 -c pytorch" -> "pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu113". When I ran script I saw problem with importing mmcv. I couldn't find which version of mmcv I needed.
Can you help me with this problems? Thank you!

Reproduce K400 with Video-FocalNet-T

Hello. Thank you for sharing this great work. I was trying to reproduce the evaluation results on K400 test set using Video-FocalNet-T, where it is mentioned in the readme that the top1-acc is 79.8% while I'm getting around 65%.

model: video-focalnet-t
clip: 4
frame_rate: 4
input_frame: 8
input_size: 224

Here is the code we have tried to evaluate v-focalnet on K400-test set (19796 samples). Also, please note, in this code, no crop is used though, but I have also tried crop=3 but unfortunately no luck. However, the gap between the reported score and computed score is too high.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.