Giter VIP home page Giter VIP logo

ms-tcn's People

Contributors

yabufarha avatar yassersouri avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

ms-tcn's Issues

About implementation of KL divergence Loss

Thank you for your contribution! I use torch.nn.KLDivLoss to implement KL Loss, but the loss I get is NAN. So can you tell me the details of KL Loss implemented in MS-TCN.
Thanks.

Size of temporal receptive field

Hello,

I see that the temporal receptive field size of single stage is 2047 as you are using only 10(L) layers. Since you are only passing the output probabilities(without additional features) to the next stage, I think the temporal receptive field size does not increase due to multi-stages.
So, the temporal receptive field size of the entire network is 2047 irrespective of number of stages?

Please clarify my understanding.

Thank you,
Raj

feature modality

From the paper, you mentioned you extracted features from RGB frames using I3D.
Did you include other kinds of modalities (e.g. optical flow, MHI) in your features?

I am a little bit confused because most of the methods you compared use RGB + MHI (Motion History Image). It is really impressive if you beat them using RGB only.

Download Dataset

Hi, thank you for the great work! I have trouble to download the data folder on MEGA, seems I need to pay for it. I'm wondering if you did any preprocessing? Will it still work if I download the dataset from their official websit? Thanks!

Feature extraction and Python3 support

Hi @yabufarha ,

thanks for the good work.

Could you please provide more information on feature extraction? I understand that you've used https://github.com/ahsaniqbal/Kinetics-FeatureExtractor. Is it using information from future frames, i.e., to compute features of frame i, is it smthg like (i-10, ..., i-1, i, i+1, ..., i+10) for a 21-frame temporal window, or is it only looking back?

A second question, since python2 is deprecated as of Jan 2020, I wasn't able to make the things in the feature extraction repo work. Could you provide a recipe for switching to Python3, both for your repo and for the feature extraction repo? I know that would not be easy but any comments and suggestions will be useful.

Best,

Where did the features of the datasets come from?

Hi there, after reading the paper and code, I found that the ms-tcn takes features of video as input. Then I load a feature into a numpy variable, seeing the data is a matrix with shape (2048,n).
Here is my confuse: do the features in the datasets could be transformed into video? Or they are features that extracted from other backbone? What is that extractor?
Looking forward to your reply.

Mask usage

Hi @yabufarha ,

I understand that you are maintaining a binary mask to avoid performing computations for the padded entries.

The second dimension of your mask is indexing class labels:
mask = torch.zeros(len(batch_input), self.num_classes, max(length_of_sequences), dtype=torch.float)

I noticed you are indexing only the first class label in your model's forward pass,
DilatedResidualLayer,
SingleStageModel,
MultiStageModel ,

Could you please explain the reason?

Many thanks.

Qualitative results

The visualization for action segmentation you show in the paper (Fig. 3, 4, 6, etc.) are really cool.
I can generate the frame-level recognition using the code in this repo.
Can you also share the code or script that you use to produce those visualization results?

Thank you so much!

Flow model for I3D

Hi, could you elaborate a bit about which flow model you use for generating optical flow as the input of I3D feature extraction?

Create my own dataset - how to use spatial cnn to create features?

Hi,
I would like to try the algorithm you have developed on my own videos, can you elaborate a bit more how do you build such dataset from MP4 files, for example?

  1. I'm trying to figure out how to create the features folder, from the paper I understood that is some sort of a transition in spatial cnn, how do I do this for mp4 files for example?
  2. How are the ground truth files built, is there a specific tool to create them?
  3. What are the files in the split folder? the bundle files?
    Thanks

What is the meaning of split

Dear authors,

What is the meaning of split? There are many .bundle files in the /data/split, and can you explain this for me?

Thank you!

dimensionality of input features

I noticed the dimensionality of your input I3D features for each video is (2048,number of video frames).

I am confused how the temporal dimension of your inputs is equal to the number of frames as the I3D network is supposed to temporally downsample by a factor of 8? Can you provide more details on how you obtained the I3D features?

Question about the dilated residual layer

Hello,
The dilated residual layer in Fig. 2 includes a 1*1 convolution after the ReLU activation. However, I cannot find the explanations of the role of this 1*1 convolution. Is this 1*1 Conv used to introduce more parameters to improve the expressiveness of the TCN?

Performance on Breakfast

Hello,
I tried to replicate the experimental results as shown in Table 10 in the paper.
The results of GTEA and 50Salads look OK, but the results of Breakfast are much worse than the paper.
I also found that for Breakfast, the performance for each split is very different.
Did you meet that situation?
It would be great if you can share the performance of each split for the Breakfast dataset.

Thank you.

Breakfast Dataset

Hi,
Is there a fine tuned set of parameters for breakfast dataset?
Right now I am using:

num_stages = 4
num_layers = 10
num_f_maps = 64
features_dim = 400
batch_size = 1
lr = 0.0005

But I only get 60% accuracy at epoch 50.

Regards
Bill

Why masking?

Hello @yabufarha ,

I have read through the paper and did not find information about the mask. But there is usage of mask in the code. Could you please let me know why the masking is used in the code?

Thanks

GTEA features

In your paper, you mentioned that you also tried to fine-tune the input features for the GTEA dataset. I just wonder the data shared in this repository is fine-tuned or without fine-tuning.

Thank you

How to get features?

I am trying to train the network with other dataset. But how can i extract frame-wise features from an example include multiple actions by using I3D network.

Code problem

I got this error using PYTHon3, I tried using Python2, but 2 now seems unable to install torch
image

Stages for inference

In your implementation, you use all the stages to calculate the prediction loss, but only use the last stage for the final inference.

I am just curious: what if you fuse the predictions from all stages during inference?
Have you tried that?

Thank you.

Tool for creating video activity annotations like 50salads

Hi,
I want to create activity annotations per frame for my own videos,
As I understood in 50salads they gave an annotation for every frame,
Do you know about any good tool I can use in order to create this annotation file?
Something that I mark segments in video, give them annotation and the output will be a file just like the one of 50salads?

Excluding first frame in loss calculation for every stage?

@yabufarha Thanks for your easy to understand code! I just have a question regarding this line in model.py:

loss += 0.15*torch.mean( torch.clamp( self.mse(F.log_softmax(p[:, :, 1:], dim=1), F.log_softmax(p.detach()[:, :, :-1], dim=1)), min=0, max=16 ) * mask[:, :, 1:] )

I noticed you're using p[:, :, 1:], p[:, :, :-1] and mask[:, :, 1:] instead of entire tensors. I am trying to train this model on a video level(not frame level), using EGTEA+ dataset where I only have one [2048, 1] feature vector for each video and 1 label per video. So if I do the above slicings, I end up with empty tensors(since doing 1: or :-1 removes the only available label). Although I see the loss decreasing when I use full tensor without slicing, I wanted to know, what's the significance of doing it this way?

50Salads performance

I use same envrionment as you.

PyTorch 0.4.1
Python 2.7.12

And here is my result.

split F1@{10} F1@{25} F1@{50} Edit Acc
1 68.8845 66.5362 59.8826 64.3570 76.0544
2 71.8535 69.5652 59.0389 64.6865 75.3585
3 75.8465 74.0406 66.8172 69.6633 80.0396
4 73.3624 70.7424 64.1921 67.9526 81.5158
5 79.9031 78.4504 70.7022 72.4006 86.2126
avg 73.9700 71.8670 64.1266 67.8120 79.8362
F1@{10} F1@{25} F1@{50} Edit Acc
76.3 74.0 64.5 67.9 80.7

It is much worse than yours.F1@{10} and F1@{25}

How to use self-built dataset?

Hello, I am very interested in your work. How to train the network on self-built data set? How to generate a data set for training?

Performance on longer videos

Hi @yabufarha ,

I am working with a dataset from another domain (surgical videos), having at least 80 videos and each video is longer than 30 minutes.

  • What was the typical length of videos that you experimented?

  • do you think the method can work well with longer videos like 30 minutes or 1 hours? What would be the set of potential parameters to adjust? Could you give suggestions?

Questions about fine-tuning dataset

Hi,

First, thanks for such a great contribution in MSTCN.
In the paper, you have mentioned that you tried the fine-tuning of GTEA dataset.
I was wondering that did you tried fine-tuning Breakfast dataset?
If yes, can you share the features or the experiments of it?
If not, is it possible to share some materials on how I can get the fine-tuning Breakfast dataset.

Thank you

Online prediction

Hi @yabufarha ,

let me receive your suggestions for training/evaluating the model for online prediction purposes.

I have a per-frame label in my dataset, similar to the datasets you used in the paper. But my videos are quite long, i.e., from 30 minutes to 2 hrs.

For the specific case of online prediction, I am interested only predicting the label for the current time step, i.e., current frame. But, the model is capable of predicting labels for all time-steps at once. The naive solution would be to use frames[0:current], and repeat for the next current index, and use only the prediction in the last index of the predicted labels. But this brings huge computational burden, i.e., you have to forward every frame from the first time step to current time step to obtain a single prediction. It would computationally get worse as you move forward in time.

Do you have any suggestions for online prediction, including the training step? During training, I am simply performing random sampling (both random offset and length for each sample) to have a fixed number of samples from each video.

Best,

len of npy is different from len of video

Hello, I downloaded 50 salads npy files and label txt files. The length of feature matched with label. I also downloaded official 50 salads dataset(https://cvip.computing.dundee.ac.uk/datasets/foodpreparation/50salads/). I checked the frame number of each video using belowing code. It turned out that the length of feature you provided was 8 frames less than video frame number, for all videos.
Did you directly abandoned the first 8 frames or the last 8 frames?

path = r'.\data\50salad\rgb'
path2 = r'.\50salads\features'

def video_info(file):
    video_capture = cv2.VideoCapture(file)

    if not video_capture.isOpened():
        exit()

    frame_count = int(video_capture.get(cv2.CAP_PROP_FRAME_COUNT))

    frame_width = int(video_capture.get(cv2.CAP_PROP_FRAME_WIDTH))
    frame_height = int(video_capture.get(cv2.CAP_PROP_FRAME_HEIGHT))

    video_capture.release()
    return frame_count, frame_width, frame_height


file_list = os.listdir(path2)
for file in file_list:
    npy = np.load(os.path.join(path2, file+'.npy'))
    video = video_info(os.path.join(path, file+'.avi'))
    print(f"{file}, npy shape{npy.shape}, video info{video}")

output of above code are as following

rgb-01-1, npy shape(2048, 11679), video info(11687, 640, 480)
rgb-01-2, npy shape(2048, 12585), video info(12593, 640, 480)
rgb-02-1, npy shape(2048, 12415), video info(12423, 640, 480)
rgb-02-2, npy shape(2048, 10521), video info(10529, 640, 480)
rgb-03-1, npy shape(2048, 11373), video info(11381, 640, 480)
rgb-03-2, npy shape(2048, 11584), video info(11592, 640, 480)
rgb-04-1, npy shape(2048, 13168), video info(13176, 640, 480)
rgb-04-2, npy shape(2048, 12371), video info(12379, 640, 480)
rgb-05-1, npy shape(2048, 11115), video info(11123, 640, 480)
rgb-05-2, npy shape(2048, 12092), video info(12100, 640, 480)
rgb-06-1, npy shape(2048, 9694), video info(9702, 640, 480)
rgb-06-2, npy shape(2048, 8229), video info(8237, 640, 480)
rgb-07-1, npy shape(2048, 17793), video info(17801, 640, 480)
rgb-07-2, npy shape(2048, 15091), video info(15099, 640, 480)
rgb-09-1, npy shape(2048, 11547), video info(11555, 640, 480)
rgb-09-2, npy shape(2048, 14290), video info(14298, 640, 480)
rgb-10-1, npy shape(2048, 12329), video info(12337, 640, 480)
rgb-10-2, npy shape(2048, 9094), video info(9102, 640, 480)
rgb-11-1, npy shape(2048, 9435), video info(9443, 640, 480)
rgb-11-2, npy shape(2048, 8453), video info(8461, 640, 480)
rgb-13-1, npy shape(2048, 13880), video info(13888, 640, 480)
rgb-13-2, npy shape(2048, 13092), video info(13100, 640, 480)
rgb-14-1, npy shape(2048, 8599), video info(8607, 640, 480)
rgb-14-2, npy shape(2048, 8225), video info(8233, 640, 480)
rgb-15-1, npy shape(2048, 11489), video info(11497, 640, 480)
rgb-15-2, npy shape(2048, 13961), video info(13969, 640, 480)
rgb-16-1, npy shape(2048, 12865), video info(12873, 640, 480)
rgb-16-2, npy shape(2048, 10421), video info(10429, 640, 480)
rgb-17-1, npy shape(2048, 11146), video info(11154, 640, 480)
rgb-17-2, npy shape(2048, 12943), video info(12951, 640, 480)
rgb-18-1, npy shape(2048, 12077), video info(12085, 640, 480)
rgb-18-2, npy shape(2048, 7555), video info(7563, 640, 480)
rgb-19-1, npy shape(2048, 12817), video info(12825, 640, 480)
rgb-19-2, npy shape(2048, 11658), video info(11666, 640, 480)
rgb-20-1, npy shape(2048, 8488), video info(8496, 640, 480)
rgb-20-2, npy shape(2048, 8291), video info(8299, 640, 480)
rgb-21-1, npy shape(2048, 12912), video info(12920, 640, 480)
rgb-21-2, npy shape(2048, 12032), video info(12040, 640, 480)
rgb-22-1, npy shape(2048, 18143), video info(18151, 640, 480)
rgb-22-2, npy shape(2048, 12456), video info(12464, 640, 480)
rgb-23-1, npy shape(2048, 13274), video info(13282, 640, 480)
rgb-23-2, npy shape(2048, 14631), video info(14639, 640, 480)
rgb-24-1, npy shape(2048, 12211), video info(12219, 640, 480)
rgb-24-2, npy shape(2048, 7804), video info(7812, 640, 480)
rgb-25-1, npy shape(2048, 11159), video info(11167, 640, 480)
rgb-25-2, npy shape(2048, 8364), video info(8372, 640, 480)
rgb-26-1, npy shape(2048, 9126), video info(9134, 640, 480)
rgb-26-2, npy shape(2048, 9219), video info(9227, 640, 480)
rgb-27-1, npy shape(2048, 11859), video info(11867, 640, 480)
rgb-27-2, npy shape(2048, 12040), video info(12048, 640, 480)

loss nan

I tried to train the model using 50salads and breakfast datasets, but why is the loss nan? I didn't make any changes to the code.
9%5{97 _HZH6RUP` 6~~1WJ

The url of data can not be reached

Hi, thank you for sharing the code.But recently, as the title shows, I can not open the url which leads to download the data.Can you update the url?Thanks.

Regarding the extraction of I3D video features

Hi, Yazan,

Can I ask one question regarding the I3D video feature extraction?
As I know, I3D produce one feature for a 16-frame clip. Do you use this setting to generate each feature, i.e., extracting one feature for continuous 16 frames?
If so, assuming the index of the frame which I want to generate feature for is t, from which range should I sample the 16-frame clip, [t, t+15], or [t-15, t], or [t-7, t+8]...?

Thank you in advance!

dataset can't download

Hello, when I download several Gs of data, it always stops downloading automatically. Can you give me some advice?

Reason for using mask[:, 0:1, :]

The code itself is very well written and self-explanatory. I realize that masks are used as input frames are different for each video. I noticed a commit done mask intermediate layers for bz>1 and I can't seem to the understand what purpose does this serve? Any insights would be helpful. Thank you !

bug

Hi,I want to reproduce your code, but it seems to have encountered a problem during the running process. Is the code you gave correct? And I am running on the version of pytorch1.0.1, I don't know if this is the reason for my error, please take a look. My error is as follows, I am disturbed.
【Traceback (most recent call last):
File "/Users/polypubki/Downloads/adata/main.py", line 72, in
trainer.train(model_dir, batch_gen, num_epochs=num_epochs, batch_size=bz, learning_rate=lr, device=device)
File "/Users/polypubki/Downloads/adata/model.py", line 79, in train
batch_input, batch_target, mask = batch_gen.next_batch(batch_size)
File "/Users/polypubki/Downloads/adata/batch_gen.py", line 56, in next_batch
batch_target_tensor = torch.ones(len(batch_input), max(length_of_sequences), dtype=torch.long)*(-100)
ValueError: max() arg is an empty sequence】

Length of raw videos is different from the extracted features

Hi, I'd like to do the process on raw videos but found that the length of raw videos and labels (downloaded from the official Breakfast dataset) is different from the extracted I3D features and the corresponding labels.

Did you do any preprocessing on the raw videos before extracting features?

the variable "mask" in Trainer (line 71 in model.py)

I wonder what "mask" in the following codes is used for.

             batch_input, batch_target, mask = batch_gen.next_batch(batch_size)
                batch_input, batch_target, mask = batch_input.to(device), batch_target.to(device), mask.to(device)
                optimizer.zero_grad()
                predictions = self.model(batch_input)

                loss = 0
                for p in predictions:
                    loss += self.ce(p.transpose(2, 1).contiguous().view(-1, self.num_classes), batch_target.view(-1))
                    loss += 0.15*torch.mean(torch.clamp(self.mse(F.log_softmax(p[:, :, 1:], dim=1), F.log_softmax(p.detach()[:, :, :-1], dim=1)), min=0, max=16)*mask[:, :, 1:])

                epoch_loss += loss.item()
                loss.backward()
                optimizer.step()

                _, predicted = torch.max(predictions[-1].data, 1) # predicted indices
                correct += ((predicted == batch_target).float()*mask[:, 0, :].squeeze(1)).sum().item()
                total += torch.sum(mask[:, 0, :]).item()

It seems not to do anything and always be one.

Modifying MSTCN to MSTCN++

Hi. Thank you so much for sharing your wonderful work. I would like to use MSTCN++ for my scientific paper. I already have the code for MSTCN, could you please tell me how can I modify it to be an MSTCN++, for example, how can I add dual dilated layer? If you could kindly share MSTCN++ model, would be great.

Best regards

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.