Giter VIP home page Giter VIP logo

hmdb51-two-stream-action-recognition's Introduction

VIDEO CLASSIFICATION: HUMAN ACTION RECOGNITION ON HMDB51 DATASET

Contributors Forks Stargazers Issues MIT License LinkedIn

We use spatial (ResNet-50 finetuned) and temporal stream cnn (stacked Optical Flows) under the Keras framework to perform Video-Based Human Action Recognition on HMDB-51 dataset.

References

Abstract

HMDB51 is a large database of videos for human action recognition, which is the main task of our project.

By Human Action Recognition (HAR) we identify that task which involves the classification of one or more actions performed by one or more human subjects. The strong advancement of computer vision and the large amount of video available, have made video classification methods the most common approach to tthe problem. The specific task was Simple Action Recognition, the training of a model that classifies a single global action (performed by one or more subjects) associated with a short video input.

HMDB-51 this contains 51 distinct action categories, each associated with at least 101 clips, for a total of 6766 annotated clips, extracted mainly from movies and youtube videos. The classes of actions can be grouped into: 1) general facial actions such as smiling 2) facial actions with object manipulation (such as eating) 3) general body movements (such as jumping) 4) body movements involving interaction with objects (such as combing) and 5) finally body movements with human interaction such as (such as hugging).

The data preparation process that preceded the training of the models first involved 1) a split of the dataset into training and test set using the split suggested by the authors which ensures that a) clips from the same video are not present in both the train and test set b) that for each class there are exactly 70 train and 30 test clips and c) and that there is maximization of the relative proportion balance of the meta tags. 2) Frame extraction and optical flow estimation follow. By optical flow we define the apperent motion pattern of objects in a scene between two consecutive frames caused by object or camera motion that coincides with a field of two-dimensional point-shift vectors between one frame and the next. Dual TV-L1, an algorithm for dense optical flows (which then computes the motion vector for each pixel), found in the OpenCV library, was used for estimation, and for each flow the horizontal and vertical component.

Requirements

  • python 3.10.7
  • ipython 8.6.0
  • imageio 2.22.4
  • keras 2.11.0
  • matplotlib 3.6.2
  • numpy 1.22.1
  • pandas 1.3.5
  • Pillow 9.3.0
  • scipy 1.9.3
  • tensorflow 2.10.0
  • tqdm 4.64.1
  • opencv 4.5.1

Preparing HMDB-51

Step 0. Prepare Folders

First of all, create three empty folders: data/hmdb51/videos,data/hmdb51/rawframes and Models.

Step 1. Prepare Annotations (training/test splits)

Download annotations from the official website and extract them in data/hmdb51/annotations

Step 2. Prepare Videos

Download and extract the dataset from HMDB-51 into the data/hmdb51/videos folder:

Step 3. Extract or Download RGB and Flow Frames (requires CUDA-enabled Open-CV)

Option 1: generate them with the dedicated script

If you have plenty of SSD space, then we recommend extracting frames there for better I/O performance. Run the following script: data/rawframes&opticalflow_extraction.py

Option 2: download our pre-generated frames

  1. Download the .zip file from Google Drive
  2. Extract the .zip file content in data/hmdb51/rawframes

Step 4. Generate File List

Generate file list in the format of rawframes and video (training and test video lists and path)

Run the following script: data/video_train_test_split_list_generator.py

Model

Data augmentation

  • Both streams apply the same data augmentation technique such as random cropping and random horizontal flipping. Temporally, we pick the starting frame among those early enough to guarantee a desired number of frames.

Spatial-stream cnn (finetuned ResNet-50)

  • As mention before, we use ResNet-50 first pre-trained with ImageNet then fine-tuning on our HMDB-51 spatial rgb image dataset.

  • 17 equidistant frames are sampled from a video.

  • Run the following script: spatial_stream_cnn_finetuned.ipynb

Temporal-stream cnn

  • We train the temporal-stream cnn from scratch. In every mini-batch, we randomly select 128 (batch size) videos from training videos and futher randomly select 1 optical flow stack in each video. We follow the reference paper and use 10 x-channels and 10 y-channels for each optical flow stack, resulting in a input shape of (224, 224, 20).

  • Input data of motion cnn is a stack of optical flow images which contained 10 x-channel and 10 y-channel images, So it's input shape is (20, 224, 224) which can be considered as a 20-channel image.

  • Multiple workers are utilized in the data generator for faster training.

  • Run the following script: motion_stream.ipynb

Testing

  • We fused the two streams by averaging the softmax scores.

Two evaluation strategies were used:

  • [1] "One frame per video" method For each batch element, a random video was selected, and for each video/element a single frame is selected and given as input to the spatial stream. The selected frame is also used as the initial frame to obtain the stacked optical flows of 10 consecutive frames, which is given as input to the temporal stream. Softmax scores are averaged. The prediction is compared with the label.

    • Run the following script: two_stream_fusion_validate.ipynb
  • [2] "Whole video" method: An arbitrary video is selected. All frames of the video are given as input to the spatial stream, the stream prediction is obtained as the average of the probabilities of all frames. All possible stacked optical flows of the video (the highest frame that can be used as a start frame is total_frame - 10) are given as input to the motion stream, the stream prediction is obtained as the average of the probabilities of all stacked optical flows. The final prediction is obtained as the average of the probabilities of the two streams.

    • Run the following script: two_stream_fusion_validate_full_video.ipynb

Results

Network One frame Whole video
Spatial ResNet-50 39.0% 45.0%
Temporal 25.8% NA%
Fusion 38.0% NA%

hmdb51-two-stream-action-recognition's People

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

Forkers

kokainn90

hmdb51-two-stream-action-recognition's Issues

Error when Extracting RGB and flow images

Hello, I am trying to replicate your method and i encountered some error when preparing the dataset.

First, i downloaded your extracted raw frames, but after extracting it, it is locked and does not allow me access to the files..

Then i tried extracting the raw frames, but when I run this code to extract the frames, i get the error below
data/rawframes&opticalflow_extraction.py

Here is the error message

get videos list done! 
mapping ...
{} read error! 
{} read error! 
{} read error! 
{} read error! 
{} read error! 
{} read error! 
{} read error! 
{} read error! 
{} read error! 
{} read error! 
{} read error! 
{} read error! 
{} read error! 
{} read error! 
{} read error! 
{} read error! 
{} read error! 
{} read error! 
{} read error! 
{} read error! 
{} read error! 
Inizialize capturing of BaseballSwingAnalysis_swing_baseball_f_nm_np1_fr_med_15.avi.
{} read error! 
{} read error! 
multiprocessing.pool.RemoteTraceback: 
"""
Traceback (most recent call last):
  File "/home/arptracker/.local/lib/python3.10/site-packages/skvideo/io/ffmpeg.py", line 276, in _read_frame_data
    assert len(arr) == framesize
AssertionError

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/arptracker/Vincent/action_recognition/HMDB-51/Action_Recognition_Two_Stream_HMDB51/rawframes.py", line 70, in dense_flow
    videocapture=skvideo.io.vread(video_path)
  File "/home/arptracker/.local/lib/python3.10/site-packages/skvideo/io/io.py", line 148, in vread
    for idx, frame in enumerate(reader.nextFrame()):
  File "/home/arptracker/.local/lib/python3.10/site-packages/skvideo/io/ffmpeg.py", line 303, in nextFrame
    yield self._readFrame()
  File "/home/arptracker/.local/lib/python3.10/site-packages/skvideo/io/ffmpeg.py", line 287, in _readFrame
    s = self._read_frame_data()
  File "/home/arptracker/.local/lib/python3.10/site-packages/skvideo/io/ffmpeg.py", line 281, in _read_frame_data
    raise RuntimeError("%s" % (err1,))
RuntimeError

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/arptracker/anaconda3/lib/python3.10/multiprocessing/pool.py", line 125, in worker
    result = (True, func(*args, **kwds))
  File "/home/arptracker/anaconda3/lib/python3.10/multiprocessing/pool.py", line 48, in mapstar
    return list(map(*args))
  File "/home/arptracker/Vincent/action_recognition/HMDB-51/Action_Recognition_Two_Stream_HMDB51/rawframes.py", line 72, in dense_flow
    print('{} read error! ').format(video_name)
AttributeError: 'NoneType' object has no attribute 'format'
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/arptracker/Vincent/action_recognition/HMDB-51/Action_Recognition_Two_Stream_HMDB51/rawframes.py", line 191, in <module>
    pool.map(dense_flow,zip(video_list,flows_dirs,flows_dirs_cls,[step]*len(video_list),[bound]*len(video_list)))
  File "/home/arptracker/anaconda3/lib/python3.10/multiprocessing/pool.py", line 367, in map
    return self._map_async(func, iterable, mapstar, chunksize).get()
  File "/home/arptracker/anaconda3/lib/python3.10/multiprocessing/pool.py", line 774, in get
    raise self._value

I will appreciate if

  1. You can share the link with the extracted raw frames that is not locked or
  2. Help me know how i can address the above error.

Thanks.

how long does it take to capture avi file?

Hello,
I was confused that data/rawframes&opticalflow_extraction.py takes so long (>15min, my gpu is 3060, ram is 14gb) and it just stay here everytime running it.
image
BTW, could you open a discussion in the repo?
Thank you very much for your help!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.