li-plus / dsnet Goto Github PK
View Code? Open in Web Editor NEWDSNet: A Flexible Detect-to-Summarize Network for Video Summarization
Home Page: https://ieeexplore.ieee.org/document/9275314
License: MIT License
DSNet: A Flexible Detect-to-Summarize Network for Video Summarization
Home Page: https://ieeexplore.ieee.org/document/9275314
License: MIT License
While running inference on TVSum and Summe datasets(while using the pretrained Summe-trained and tvsum-trained models respectively, provided by the authors), at higher Sampling rates(sr), for some of the videos, I am getting blank output video files as dumps(258 byte video file dumped). It seems that the "pred_summ" variable has all "False".
Does that mean that there are no summary candidate frames produced by the model for that particular video at that sr ? Also, the only solution is to reduce the sr or any other variable can be changed to enable the summary creation? Note that the length of videos in both datasets is decent i.e. 1 min - 4 mins long.
Hello sir, I am Santhoshkumar S from Anna University. Currently, I am pursuing final year computer science and engineering. For our final year project, we chose your DSNet project. My question is what is the 30 percent implementation of this project?. Is feature extraction code for input video is available.
hello, I want to extract the feature and train the network. But I don't know the original video name in the eccv16_dataset_tvsum_google_pool5.h5 on the tvsum dataset.
Hello guys.
I'm currently running some experiments on the DSNet, but every time I need to set up the network on a different environment I have problems with the pip install -r requirements.txt
. It happens primarily because of torch-spare
package. It's not compatible with several CUDA versions. Maybe it's a good idea to add to the README building pytorch
and torch-sparse
from source as recommendations.
Some useful links:
Building pytorch
from source -> https://github.com/pytorch/pytorch
Premade wheels for pip
installing pytorch
-> https://download.pytorch.org/whl/torch/
Bulding torch-sparse
from source -> https://github.com/rusty1s/pytorch_sparse
Premade wheels for pip
installing torch-sparse
-> https://pytorch-geometric.com/whl/
Thank you very much for your amazing network.
when i run the train.py
Traceback (most recent call last):
File "train.py", line 73, in
main()
File "train.py", line 62, in main
fscore = trainer(args, split, ckpt_path)
File "/home/yaoyc/DSNet/src/anchor_based/train.py", line 50, in train
gtscore, cps, n_frames, nfps, picks)
File "/home/yaoyc/DSNet/src/helpers/vsumm_helper.py", line 75, in get_keyshot_summ
assert pred.shape == picks.shape
AttributeError: 'str' object has no attribute 'shape'
the error happened,but i dont know whats wrong,i did what the writer said in readme
When measuring the model, you use the f1 score and treat the entire video summary task as a sequence labeling task. But in the process of condensing the video, the positive and negative samples are often extremely unbalanced. So does the f1 score really make sense?
In addition, in your code, you use the average method for tvsum data set , but for other data sets use the maximum value. I want to ask the reason for this.
When I use infer.py to predict my own video, the generated summary video cannot be played, but it can be played when predicting the video provided by the author
could you release your pre-trained model ?
Hi, I was looking for the correct splits of the datasets so in that way I could experiment correctly, but I found this split https://github.com/ok1zjf/VASNet/tree/master/splits and differs with the one you made. They used the same datasets so I was wondering if you know where is the correct one. Thanks!
I tried using the baseline model with LSTM on my version of the dataset. I downloaded the videos and loaded the labels using the make_dataset.py
script. However, the labels in my dataset don't match the original ones. Despite this, I tested the model on this modified dataset using the average of the user_summary annotations as the evaluation labels. The resulting F-score was about 0.30. Then, I tried using the maximum value instead, which gave better results with an F-score of 0.52.
Later, I tried evaluating the model using the gt_score and converting it to shot summaries, similar to our training approach. After evaluation, I got an average F-score of 0.70. But the F1-score varied a lot.
As you can see in the image, the F1-score keeps changing. My question is whether this way of evaluating is not good, and if the unstable F-score indicates a problem.
Loading DSNet model ...
Preprocessing source video ...
#ERROR
THCudaCheck FAIL file=/pytorch/aten/src/THC/THCGeneral.cpp line=383 error=11 : invalid argument
Predicting summary ...
Writing summary video ...
python infer.py anchor-based --ckpt-path ../models/custom/checkpoint/custom.yml.0.pt
--source ../custom_data/videos/EE-bNr36nyA.mp4 --save-path ./output.mp4
Loading DSNet model ...
Traceback (most recent call last):
File "infer.py", line 66, in
main()
File "infer.py", line 18, in main
model.load_state_dict(state_dict)
File "/home/mossad/aeye/DSNet/venv/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1407, in load_state_dict
self.class.name, "\n\t".join(error_msgs)))
RuntimeError: Error(s) in loading state_dict for DSNet:
Unexpected key(s) in state_dict: "fc_ctr.weight", "fc_ctr.bias".
I want get the result ''For the baseline, we removed the interest proposal formulation and only applied a self-attention layer to predict the importance scores.'',How to run the baseline model ?Looking forward to your reply.
In the evaluation phase, you use the features that have been extracted in 'h5py' file. However, when I run 'infer.py' to summary video with raw video in TvSum dataset, the results is completely different from the features of the h5py file. And when using the original video prediction, the result is completely wrong. So I want to ask, is the feature extraction method really like ’src/helpers/video_helper.py‘, using the features extracted by googlenet? Could you provide us with the method of feature extraction of your h5py file?
I have a question about the features:
Do I need additional processing of video frames when using googlenet to extract video features? For example, normalization and other operations, or directly resize the original video frame and use the network to obtain features?
What should I do if I want to use resnet for feature extraction?
Thank you!
As I have say in #12 , when I re-extract the features to train the network, the f1 score of the model is only about 0.3. Is this normal? Has anyone re-experimented feature extraction from the raw video?
I imagined that json in custom_data
folder would model total frames as binary list
for example assume we have 10 frames in the video and the most important segments is from frame [3 >> 6] and [9 >> 10]
then the annotation would be [0,0,1,1,1,1,0,0,1,1]
in other words seems confused about this statement in readme.md file
The user summary of a video is a UxN binary matrix, where U denotes the number of annotators and N denotes the number of frames in the original video
why to replicate frames U
times and what is U
could you release your feature extraction code?
Can you post a tutorial detailing the environment configuration process?
Hi there,
First, thanks for providing your code and boost the open-source mentality of video summarization research.
I have a question about the sets of video used. In this issue you are referring to a validation set f_score
, but neither on the splits appear any validation_keys
nor in the source code I see a validation set, but rather the use of the test set, as validation. Am I missing something, or the usage of a test set for picking the best model is data leakage?
I think that the proper usage is 1:
Training set
: The sample of data used to fit the model.Validation set
: The sample of data used to provide an unbiased evaluation of a model fit on the training dataset while tuning model hyperparameters.Test set
: The sample of data used to provide an unbiased evaluation of a final model fit on the training dataset.To be honest, though, the small number of videos in the video summarization dataset, shouldn't be enough for specifying a validation set
.
Thanks in advance.
George
Hi,
May I ask how Params are calculated in Table 2 ? I used the following code to calculate Params, and the result obtained is significantly different from Table 2's 8.53 million. The result I got was 4.33 million.
import torch
from thop import profile
from src.anchor_based.dsnet import DSNet
model = DSNet('attention', 1024, 128, [4, 8, 16, 32], 8)
input = torch.randn(1, 1024, 1024)
flops, params = profile(model, inputs=(input, ))
print('flops:{}'.format(flops))
print('params:{}'.format(params))
I would like to know how you calculated it and look forward to receiving your reply. Thank you very much!
When I was reproducing your code, I found that the OVP dataset and YouTube dataset are not in the same format as the TVSum dataset and Summe, missing 'change_points', 'n_frames', 'picks', etc., which prevents the model from completing the transfer. and Augmented settings, how should it be solved?
In Table 1 of the paper, I found that the TVSum, YouTube, and OVP datasets have the SAME duration(Min, Max, Avg). Are they correct?
and how could i get the OVP and youtube dataset?
The code could generate video summaries on TVSum and SumMe dataset. If there are some new videos, how could we generate summaries on them?
I found the data file contains "change_points", "n_frame_per_seg", do we have to get these annotations before we can generate summaries for new videos?
the ortools have been updated hence its causing issue while training on custom dataset
please rectify it
thankyou
Hi,
May I ask why there are five .pt file for each datasets in the pre-trained model. I have tried the model on the default customer video and all five file provides quite different results. Which one should I choose? Thanks in advance.
I have run your code for anchor-free model on canonical TVSum, as intructured in the README file. And I get the results similar to ones reported in the paper. To be precise, I get these numbers:
mean: 0.6160917484037374
split0: 0.624260622317302
split1: 0.5672167705637337
split2: 0.6329998280152374
split3: 0.6279028963330486
split4: 0.6280786247893654
But I have noticed that these best f-score numbers in each split are obtained right at the start of training (after a couple of of epochs). This can be observed in the following f-score vs epochs plot, where each colour corresponds to a separate TVSum split.
As you can see the best f-score numbers are not much better than the f-scores obtained with randomly initialized weights of the model. This makes me if the model is indeed learning something meaningful. Any thoughts on this ?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.