Giter VIP home page Giter VIP logo

msva's Introduction

PWC

This is the official GitHub page for the paper:

Junaid Ahmed Ghauri, Sherzod Hakimov, and Ralph Ewerth: "Supervised Video Summarization via Multiple Feature Sets with Parallel Attention". In: *In the Proceedings of IEEE International Conference on Multimedia and Expo (ICME) 2021.

The paper is available on:

MSVA (Multi Source Visual Attention)

MSVA is a deep learning model for supervised video summarization. In this research, we address this research gap and investigate how different feature types, i.e., static and motion features, can be integrated in a model architecture for video summarization.

msva model

Get started (Requirements and Setup)

Python version >= 3.6

# clone the repository
git clone [email protected]:VideoAnalysis/MSVA.git
cd MSVA
conda create -n msva python=3.6
conda activate msva  
pip install -r requirements.txt

Dataset

Extracted features for the datasets can be downloaded as,

wget -O datasets.tar https://zenodo.org/record/4682137/files/msva_video_summarization.tar
tar -xvf datasets.tar

Dataset Files Structure

Dataset directory details to unerstand files heirarchies:

/datasets
   /kinetic_features	(features extracted using pretrained I3D model)	
      /summe
         /FLOW
            /features
               /*.npy   (files containing extracted features)
         /RGB
            /features
               /*.npy
         /targets	(labels synchronized with /object_features labels)
               /*.npy  
      /tvsum
         /FLOW
            /features
               /*.npy 
         /RGB
            /features
               /*.npy 
         /targets
               /*.npy
   /object_features
      /eccv16_dataset_summe_google_pool5.h5
      /eccv16_dataset_tvsum_google_pool5.h5
      /readme.txt

h5 files structure (object_features)
/key
    /features                 2D-array with shape (n_steps, feature-dimension)
    /gtscore                  1D-array with shape (n_steps), stores ground truth improtance score (used for training, e.g. regression loss)
    /user_summary             2D-array with shape (num_users, n_frames), each row is a binary vector (used for test)
    /change_points            2D-array with shape (num_segments, 2), each row stores indices of a segment
    /n_frame_per_seg          1D-array with shape (num_segments), indicates number of frames in each segment
    /n_frames                 number of frames in original video
    /picks                    posotions of subsampled frames in original video
    /n_steps                  number of subsampled frames
    /gtsummary                1D-array with shape (n_steps), ground truth summary provided by user (used for training, e.g. maximum likelihood)
    /video_name (optional)    original video name, only available for SumMe dataset	
	

Training and Crossfold Validation

# for training, crossfold validation according to default parameters "parameters.json".
python train.py -params parameters.json

Inference with extracted features for summary scores

python inference.py -dataset "summe" -video_name "Air_Force_One" -model_weight "model_weights/summe_random_non_overlap_0.5359.tar.pth"

inference

Experimental Configuration

Update parameters.json for desired experimental parameters.s.

{"verbose":False,
"train":True,
"use_cuda":True,
"cuda_device":0,
"max_summary_length":0.15,
"weight_decay":0.00001,
"lr":[0.00005],
"epochs_max":300,
"train_batch_size":5,
"fusion_technique":'inter',
"method":'mean',
"sample_technique":'sub',
"stack":'v',
"name_anchor":"inter_add_aperture_250",
"output_dir" : "./results/",
"apertures":[250],
"combis":[[1,1,1]],
"feat_input":[1024],
"object_features":['datasets/object_features/eccv16_dataset_summe_google_pool5.h5',
			   'datasets/object_features/eccv16_dataset_tvsum_google_pool5.h5'],
"kinetic_features":"./datasets/kinetic_features/",
"splits":['splits/tvsum_splits.json',
 'splits/summe_splits.json',
 'splits/summe_random_non_overlap_splits.json',
 'splits/tvsum_random_non_overlap_splits.json'],
"feat_input":{"feature_size":365,"L1_out":365,"L2_out":365,"L3_out":512,"pred_out":1,"apperture":250,"dropout1":0.5,"att_dropout1":0.5,"feature_size_1_3":1024,"feature_size_4":365}}

Other Options for Configuration Parameters

"verbose" // True or False : if you want to see detailed running logs or not
"use_cuda" // True or False : if code should execute with GPU or CPU
"cuda_device" // 0, GPU index which will be running the deeplearning code.
"max_summary_length" //  0.15 is the summary length default set in experiments in early work for bench mark dataset
"weight_decay" //  0.00001 weight decay in torch adam optimizer 
"lr" // [0.00005] as learning rate during optimization 
"epochs_max" // maximum number of epochs in training
"train_batch_size"// 5, the trainign batch size, you can vary this to experiment with. 
"fusion_technique" // fusion technique cabe be 'early', 'inter' or 'late'
"method" //  this method reffers to early fusion operation. It can be 'min' for minimum, 'max' for maximum or 'mean' to take average of all. 
"sample_technique" // this can be 'sub' for sub sample or 'up' for up sample as interpolation when features are not matching the shape
"stack" // in early fusion you want 'v' for vertical stack or 'h' for horizontal stack of features of all sources 
"name_anchor" // this is just a name you want to add in to the models name and result files saved during train or test like "inter_add_aperture_250"
"output_dir" // output directory where you want to save the results like "./results/"
"apertures" // aperture size you want to experiment with like [250] but it can be a list you want to treat as hyperparameter optimization like [50, 100, 150, 200, 250, 300, 350, 400]
"combis" // combination of features you want to experiment with for example [[1,1,1]] means all three sources but it can be list of combination to see different combination roles like [[1,0,0],[1,1,1],[1,0,1],[1,1,0],[0,1,1],[0,1,0],[0,0,1]]

Citation

@article{ghauri2021MSVA, 
   title={SUPERVISED VIDEO SUMMARIZATION VIA MULTIPLE FEATURE SETS WITH PARALLEL ATTENTION},
   author={Ghauri, Junaid Ahmed and Hakimov, Sherzod and Ewerth, Ralph}, 
   Conference={IEEE International Conference on Multimedia and Expo (ICME)}, 
   year={2021} 
}

For orignal source of these datasets including videos, Follow: “SumMe, Creating Summaries from User Videos, ECCV 2014” “TVSum , TVSum: Summarizing web videos using titles, 2015

msva's People

Contributors

junaid112 avatar sherzod-hakimov avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

msva's Issues

Which F1-Score is reported in the paper?

Hello,

Thank you for making the code open-source. :)

I am trying to reproduce your results, and while looking into the code, it seems there are two types of split files for the non-overlapping splits: random and ordered. Could you tell me which F1 score is reported in the paper?

Best,
Noga

extract object features

Hi, I am trying to extract the object features using your code in https://github.com/VideoAnalysis/EDUVSUM/tree/master/src.

According to your paper, you are using the googleNet trained with imagenet. I assume that you are extracting features using the model "modelInceptionV3" as in the codes. However, the feature shape of " inceptionv3_feature = modelInceptionV3.predict(frmRz299)" is (8,8, 2048). I tried to change the model initialization code to
"modelInceptionV3=InceptionV3(weights='imagenet', pooling='avg', include_top=False)' to get a 2048 feature vector.
However, the object feature vector length is 1024 in the MSVA codes, and I noticed that the values of features from the feature extraction code is quite different from that in the MSVA codes. For the former, the feature values can be larger than 1, but in the latter, the value seemed to be normalized to [0,1] range.

Have I missed something?

Issue regarding the last step of self attention (weighted sum step)

Hi, I noticed that the last step of the self-attention calculation doesn't seem so right:

att_weights_ = nn.functional.softmax(logits, dim=-1)       
weights = self.dropout(att_weights_)     
y = torch.matmul(V.transpose(1,0), weights).transpose(1,0)

So here the softmax probability is calculated along the dim -1, which is the column direction.
But then the weighted sum is taken along the row direction according to this line

y = torch.matmul(V.transpose(1,0), weights).transpose(1,0)

I think we should do something like this

y = torch.matmul(weights,V)

How do you think?
I hope I'm the one to be corrected.

Request for changes in `train.py` and `knapsack.py`

ThankYou MSVA guys, for making the code public and most importantly, the splits public.


After trying to reproduce your code, I encountered two errors (Line 314)

train_val_loss_score.append([loss,np.mean(avg_loss[:, 0]),val_fscore,test_loss, video_scores,kt,sp])

Appending the loss and test_loss (both being torch.cuda.Tensor) creates an issue (can't convert cuda tensor to numpy array) while converting them to numpy array at Line 538.

Additional INFO:

According to current version of ortools the code will be:

from ortools.algorithms.python import knapsack_solver

osolver = knapsack_solver.KnapsackSolver(
    knapsack_solver.KNAPSACK_DYNAMIC_PROGRAMMING_SOLVER,
    'test')

def knapsack_ortools(values, weights, items, capacity ):
    scale = 1000
    values = np.array(values)
    weights = np.array(weights)
    values = (values * scale).astype(np.int)
    weights = (weights).astype(np.int)
    capacity = capacity
    osolver.init(values.tolist(), [weights.tolist()], [capacity])
    computed_value = osolver.solve()
    packed_items = [x for x in range(0, len(weights))
                    if osolver.best_solution_contains(x)]
    return packed_items

Thanks 😺

NOT able to reproduce spearman and kendall tau as reported in the paper.

Hello,

While reproducing the results, I'm unable to get the scores (Spearman and Kendall Tau) that you guys reported in the paper. It's deviating a lot from the reported score.

Can you please provide the actual code for computing it?

As per lines 430 and 431 (shown below)

MSVA/train.py

Lines 430 to 431 in dad26a6

pS=spearmanr(y_pred2,y_true2)[0]
kT=kendalltau(rankdata(-np.array(y_true2)), rankdata(-np.array(y_pred2)))[0]

We tried all different combinations but were unable to reach near what you have mentioned.
Please guide us!

Thanks in advance 😃 .

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.