Giter VIP home page Giter VIP logo

video-focalnets-regression's Introduction

Video-FocalNets: Spatio-Temporal Focal Modulation for Video Action Recognition [ICCV 2023]

Syed Talal Wasim*, Muhammad Uzair Khattak*, Muzammal Naseer, Salman Khan, Mubarak Shah, Fahad Shahbaz Khan

*Joint first authors

Website paper


Abstract: Recent video recognition models utilize Transformer models for long-range spatio-temporal context modeling. Video transformer designs are based on self-attention that can model global context at a high computational cost. In comparison, convolutional designs for videos offer an efficient alternative but lack long-range dependency modeling. Towards achieving the best of both designs, this work proposes Video-FocalNet, an effective and efficient architecture for video recognition that models both local and global contexts. Video-FocalNet is based on a spatio-temporal focal modulation architecture that reverses the interaction and aggregation steps of self-attention for better efficiency. Further, the aggregation step and the interaction step are both implemented using efficient convolution and element-wise multiplication operations that are computationally less expensive than their self-attention counterparts on video representations. We extensively explore the design space of focal modulation-based spatio-temporal context modeling and demonstrate our parallel spatial and temporal encoding design to be the optimal choice. Video-FocalNets perform favorably well against the state-of-the-art transformer-based models for video recognition on three large-scale datasets (Kinetics-400, Kinetics-600, and SS-v2) at a lower computational cost.

Table of Contents

๐Ÿš€ News

  • (July 13, 2022)
    • Training and evaluation codes for Video-FocalNets, along with pretrained models are released.

Overview

Overall Architecture

(a) The overall architecture of Video-FocalNets: A four-stage architecture, with each stage comprising a patch embedding and a number of Video-FocalNet blocks. (b) Single Video-FocalNet block: Similar to the transformer blocks, we replace self-attention with Spatio-Temporal Focal Modulation.


Overall Architecture Performance Comparison

The Spatio-Temporal Focal Modulation layer: A spatio-temporal focal modulation block that independently models the spatial and temporal information.

Comparison for Top-1 Accuracy vs GFlops/view on Kinetics-400.

Visualization: First and Last layer Spatio-Temporal Modulator

Visualization Cutting Apple

Visualization Scuba Diving

Visualization Threading Needle

Visualization Walking the Dog

Visualization Water Skiing

Environment Setup

Please follow INSTALL.md for installation.

Dataset Preparation

Please follow DATA.md for data preparation.

Model Zoo

Kinetics-400

Model Depth Dim Kernels Top-1 Download
Video-FocalNet-T [2,2,6,2] 96 [3,5] 79.8 ckpt
Video-FocalNet-S [2,2,18,2] 96 [3,5] 81.4 ckpt
Video-FocalNet-B [2,2,18,2] 128 [3,5] 83.6 ckpt

Kinetics-600

Model Depth Dim Kernels Top-1 Download
Video-FocalNet-B [2,2,18,2] 128 [3,5] 86.7 ckpt

Something-Something-v2

Model Depth Dim Kernels Top-1 Download
Video-FocalNet-B [2,2,18,2] 128 [3,5] 71.1 ckpt

Diving-48

Model Depth Dim Kernels Top-1 Download
Video-FocalNet-B [2,2,18,2] 128 [3,5] 90.8 ckpt

ActivityNet-v1.3

Model Depth Dim Kernels Top-1 Download
Video-FocalNet-B [2,2,18,2] 128 [3,5] 89.8 ckpt

Evaluation

To evaluate pre-trained Video-FocalNets on your dataset:

python -m torch.distributed.launch --nproc_per_node <num-of-gpus-to-use>  main.py  --eval \
--cfg <config-file> --resume <checkpoint> \
--opts DATA.NUM_FRAMES 16 DATA.BATCH_SIZE 8 TEST.NUM_CLIP 4 TEST.NUM_CROP 3 DATA.ROOT path/to/root DATA.TRAIN_FILE train.csv DATA.VAL_FILE val.csv

For example, to evaluate the Video-FocalNet-B with a single GPU on Kinetics400:

python -m torch.distributed.launch --nproc_per_node 1  main.py  --eval \
--cfg configs/kinetics400/video_focalnet_base.yaml --resume video-focalnet_base_k400.pth \
--opts DATA.NUM_FRAMES 16 DATA.BATCH_SIZE 8 TEST.NUM_CLIP 4 TEST.NUM_CROP 3 DATA.ROOT path/to/root DATA.TRAIN_FILE train.csv DATA.VAL_FILE val.csv

Alternatively, the DATA.ROOT, DATA.TRAIN_FILE, and DATA.VAL_FILE paths can be set directly in the config files provided in the configs directory. According to our experience and sanity checks, there is a reasonable random variation of about +/-0.3% top-1 accuracy when testing on different machines.

Additionally, the TRAIN.PRETRAINED_PATH can be set (either in the config file or bash script) to provide a pretrained model to initialize the weights. To initialize from the ImageNet-1K weights please refer to the FocalNets repository and download the FocalNet-T-SRF, FocalNet-S-SRF or FocalNet-B-SRF to initialize Video-FocalNet-T, Video-FocalNet-S or Video-FocalNet-B respectively. Alternatively, one of the provided pretrained Video-FocalNet models can also be utilized to initialize the weights.

Training

To train a Video-FocalNet on a video dataset from scratch, run:

python -m torch.distributed.launch --nproc_per_node <num-of-gpus-to-use>  main.py \
--cfg <config-file> --batch-size <batch-size-per-gpu> --output <output-directory> \
--opts DATA.ROOT path/to/root DATA.TRAIN_FILE train.csv DATA.VAL_FILE val.csv

Alternatively, the DATA.ROOT, DATA.TRAIN_FILE, and DATA.VAL_FILE paths can be set directly in the config files provided in the configs directory. We also provide bash scripts to train Video-FocalNets on various datasets in the scripts directory.

Additionally, the TRAIN.PRETRAINED_PATH can be set (either in the config file or bash script) to provide a pretrained model to initialize the weights. To initialize from the ImageNet-1K weights please refer to the FocalNets repository and download the FocalNet-T-SRF, FocalNet-S-SRF or FocalNet-B-SRF to initialize Video-FocalNet-T, Video-FocalNet-S or Video-FocalNet-B respectively. Alternatively, one of the provided pretrained Video-FocalNet models can also be utilized to initialize the weights.

Citation

If you find our work, this repository, or pretrained models useful, please consider giving a star โญ and citation.

@article{wasim2023videofocalnets,
    title={Video-FocalNets: Spatio-Temporal Focal Modulation for Video Action Recognition},
    author={Syed Talal Wasim and Muhammad Uzair Khattak and Muzammal Naseer and Salman Khan and Mubarak Shah and Fahad Shahbaz Khan},
    journal={arXiv:2307.06947},
    year={2023}
}

Contact

If you have any questions, please create an issue on this repository or contact at [email protected] or [email protected].

Acknowledgements

Our code is based on FocalNets, XCLIP and UniFormer repositories. We thank the authors for releasing their code. If you use our model, please consider citing these works as well.

video-focalnets-regression's People

Contributors

muzairkhattak avatar talalwasim avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.