Learning Prompt-Enhanced Context features for Weakly-Supervised Video Anomaly Detection

Authors: Yujiang Pu, Xiaoyu Wu, Shengjin Wang

Abstract

Video anomaly detection under weak supervision is challenging due to the absence of frame-level annotations during the training phase. Previous work has employed graph convolution networks or self-attention mechanisms to model temporal relations, along with multiple instance learning (MIL)-based classification loss to learn discriminative features. However, most of them utilize multi-branches to capture local and global dependencies separately, leading to increased parameters and computational cost. Furthermore, the binarized constraint of the MIL-based loss only ensures coarse-grained interclass separability, ignoring fine-grained discriminability within anomalous classes. In this paper, we propose a weakly supervised anomaly detection framework that emphasizes efficient context modeling and enhanced semantic discriminability. To this end, we first construct a temporal context aggregation (TCA) module that captures complete contextual information by reusing similarity matrix and adaptive fusion. Additionally, we propose a prompt-enhanced learning (PEL) module that incorporates semantic priors into the model by utilizing knowledge-based prompts, aiming at enhancing the discriminative capacity of context features while ensuring separability between anomaly sub-classes. Furthermore, we introduce a score smoothing (SS) module in the testing phase to suppress individual bias and reduce false alarms. Extensive experiments demonstrate the effectiveness of various components of our method, which achieves competitive performance with fewer parameters and computational effort on three challenging benchmarks: the UCF-crime, XD-violence, and ShanghaiTech datasets. The detection accuracy of some anomaly sub-classes is also improved with a great margin.

[pdf] [supp] [video]

Contents

1. Introduction
2. Requirements
3. Datasets
4. Quick Start
5. Results and Models
6. Acknowledgement
7. Citation

Introduction

This repo is the official implementation of "Learning Prompt-Enhanced Context features for Weakly-Supervised Video Anomlay Detection" (under review). The original paper can be found here. We also submitted a supplementary document with a demo video for peer review. Please feel free to contact me if you have any questions.

Requirements

The code requires python>=3.8 and the following packages:

torch==1.8.0
torchvision==0.9.0
numpy==1.21.2
scikit-learn==1.0.1
scipy==1.7.2
pandas==1.3.4
tqdm==4.63.0
xlwt==2.5

The environment with required packages can be created directly by running the following command:

conda env create -f environment.yml

Datasets

For the UCF-Crime and XD-Violence datasets, we use off-the-shelf features extracted by Wu et al. For the ShanghaiTech dataset, we used this repo to extract I3D features (highly recommended:+1:).

Dataset	Origin Video	I3D Features
UCF-Crime	homepage	download link
XD-Violence	homepage	download link
ShanghaiTech	homepage	download link

Before the Quick Start, please download above features and change feat_prefix in config.py to your local path.

Quick Start

Please change the hyperparameters in config.py if necessary, where we keep default settings as mentioned in our paper. The example of configs for UCF-Crime is shown as follows:

dataset = 'ucf-crime'
model_name = 'ucf_'
metrics = 'AUC'  # the evaluation metric
feat_prefix = '/data/pyj/feat/ucf-i3d'  # the prefix path of the video features
train_list = './list/ucf/train.list'  # the split file of training set
test_list = './list/ucf/test.list'  #  the split file of test/infer set
token_feat = './list/ucf/ucf-prompt.npy'  # the prompt feature extracted by CLIP
gt = './list/ucf/ucf-gt.npy'  # the ground-truth of test videos

# TCA settings
win_size = 9  # the local window size
gamma = 0.6  # initialization for DPE
bias = 0.2  # initialization for DPE 
norm = True  # whether adaptive fusion uses normalization

# CC settings
t_step = 9  # the kernel size of causal convolution

# training settings
temp = 0.09  # the temperature for contrastive learning
lamda = 1  # the loss weight
seed = 9  # random seed

# test settings
test_bs = 10  # test batch size
smooth = 'slide'  # the type of score smoothing ['None', 'fixed': 10, slide': 7]
kappa = 7  # the smoothing window
ckpt_path = './ckpt/ucf__8636.pkl'

Run the following command for training:

python main.py --dataset 'ucf' --mode 'train'  # dataset:['ucf', 'xd', 'sh']  mode:['train', 'infer']

Run the following command for test/inference:

python main.py --dataset 'ucf' --mode 'infer'  # dataset:['ucf', 'xd', 'sh']  mode:['train', 'infer']

Results and Models

Below are the results with score smoothing in the testing phase. Note that our experiments are conducted on a single Tesla A40 GPU, and different torch or cuda versions can lead to slightly different results.

Dataset	AUC (%)	AP (%)	FAR (%)	ckpt	log
UCF-Crime	86.76	33.99	0.47	link	link
XD-Violence	94.94	85.59	0.57	link	link
ShanghaiTech	98.14	72.56	0.00	link	link

Acknowledgement

Our codebase mainly refers to XDVioDet and CLIP. We greatly appreciate their excellent contribution with nicely organized code!

Citation

If this repo works positively for your research, please consider citing our paper. Thanks all!

@article{pu2023learning,
  title={Learning Prompt-Enhanced Context Features for Weakly-Supervised Video Anomaly Detection},
  author={Pu, Yujiang and Wu, Xiaoyu and Wang, Shengjin},
  journal={arXiv preprint arXiv:2306.14451},
  year={2023}
}

yatoyun / pel4vad Goto Github PK

pel4vad's Introduction

Learning Prompt-Enhanced Context features for Weakly-Supervised Video Anomaly Detection

Abstract

Introduction

Requirements

Datasets

Quick Start

Results and Models

Acknowledgement

Citation

pel4vad's People

Contributors

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent