Giter VIP home page Giter VIP logo

adversarial-video-summarization-pytorch's Introduction

Introduction

We want to qualitatively estimate the visual diversity within our drive data. For measuring scene diversity we want to use visual semantic similarity of the drives. F.example drives in high traffic density vs drives of vehicle waiting at traffic lights.

Training/Validation Data

We use pre-trained ResNet50 features extracted from BDD100K videos as our per-frame visual representation. To reduce computation we downsample the BDD100K videos(30fps@1280x720) as 5fps@640x360. This generates TMAX~200 vectors of D=2048 dimensions per video. We use temporal windows of length T=64 randomly sampled from TMAX positions.

Video representations

We use a LSTM Autoencoder to model video representation generator. The core idea uses this paper. An encoder LSTM reads in input visual features of shape [T, D] and generate a summary vector (or thought vector) of shape S=128. The decoder LSTM reads in the thought vector and reproduces the input visual features. We regress the reproduced visual features against the input visual features with MSE. The core idea being that the visual features which are redundant between frames are compressed with the Autoencoder. T, S, D are hyper-parameters we control to affect model complexity/performance/runtime. The Autoencoder trained at this stage forms the eLSTM and dLSTM for the next stage.

Video summarisation

The core idea use this paper TODO

Getting Started

In the paper it was mentioned that the parameters of eLSTM and dLSTM were initialize with the parameters of a pre-trained recurrent autoencoder model trained on feature sequences from original videos. They find out that this helps to improve the overall accuracy and also results in faster convergence.
So the project has 2 main parts:

  • Train Lstm-Autoencoder using original Resnet-50 video features
  • Train the summarizer using the pre-trained weights from first step.

Step1

  1. Train encoder LSTM (bidirectional = False)

python auto_encoder/train_encoder.py —train_features_list <train_features_list_path> --log_dir <save_logs_dir_path> —model_save_dir <path_to_model_dir>
  1. Train encoder LSTM (bidirectional = True)

python auto_encoder/train_decoder.py —train_features_list <train_features_list_path> --log_dir <save_logs_dir_path> —model_save_dir <path_to_model_dir>

learning_rate = 1e-4
batch_size = 256
num_workers = 12
n_epochs = 300
save_interval = 1000 step

If you want to change this values you can add the variables as command line arguments.

  1. Build features index

python scripts/build_index.py —model_path <path_to_model> —features_list <path to text file containing resnet features files list> —index_path <path_where_to_save_index>

this script will help to store all encoded features in a database so we test our model and query by video features. 4. #### Query for similar scenarios ####

python scripts/matcher.py --index_path <path_to_index> --model_path <path_to_model> -features_query <path_to_video_features_file>

ex: python scripts/matcher.py -i drives-index.pck -m encoder_lstm.pth-43000 -q 000000_016839bf-0247-432f-8af6-5d33a12a0341-video.npy The code will return top 5 similar videos. you can adjust the number of returned videos.
In order to compare video features and determine how similar they are, a similarity measure is required.
Here are the most 5 used similarity distances:

  • Euclidean: Arguably the most well known and must used distance metric. The euclidean distance is normally described as the distance between two points “as the crow flies”.
  • Manhattan: Also called “Cityblock” distance. Imagine yourself in a taxicab taking turns along the city blocks until you reach your destination.
  • Chebyshev: The maximum distance between points in any single dimension.
  • Cosine: We won’t be using this similarity function as much until we get into the vector space model, tf-idf weighting, and high dimensional positive spaces, but the Cosine similarity function is extremely important. It is worth noting that the Cosine similarity function is not a proper distance metric — it violates both the triangle inequality and the coincidence axiom.
  • Hamming(city block): Given two (normally binary) vectors, the Hamming distance measures the number of “disagreements” between the two vectors. Two identical vectors would have zero disagreements, and thus perfect similarity.

For This index, we used Chebyshev distance

Step2

Train summarizer using Lstm-autoencoder pre-trained weights

License

Copyright © 2019 MoabitCoin

Distributed under the MIT License (MIT).

adversarial-video-summarization-pytorch's People

Contributors

emnamor avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

adversarial-video-summarization-pytorch's Issues

Did you succeed to train and test the model?

Hello,
thanks for sharing your code. It seems good to have the pre-trained LSTM autoencoder model in your code.
I tried to train this model using different code, but it is so difficult to train and test the model.
how about your result? Are the losses converged? If possible, can you upload the results of your experiment?
And in the s_p part, it seems to multiply every element of the feature by a uniform sample. Shouldn't s_p have one scalar value for each frame?

Thank you

the dataset is too large

Hi,
I want to train this model now, But I found the dataset is too large (1.8TB), where can I get the data with several GB?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.