yabufarha / ms-tcn Goto Github PK

View Code? Open in Web Editor NEW

213.0 213.0 57.0 17 KB

License: Other

Python 100.00%

ms-tcn's People

Contributors

Stargazers

Watchers

ms-tcn's Issues

About implementation of KL divergence Loss

Thank you for your contribution! I use torch.nn.KLDivLoss to implement KL Loss, but the loss I get is NAN. So can you tell me the details of KL Loss implemented in MS-TCN.
Thanks.

Frames number of 50salads which extracted by ffmpeg is different from your gt.

hi, I use ffmpeg to extract 50salads frames. They have 7 frames more than GT of each video.
I cannot find reason.
Please.

Size of temporal receptive field

Hello,

I see that the temporal receptive field size of single stage is 2047 as you are using only 10(L) layers. Since you are only passing the output probabilities(without additional features) to the next stage, I think the temporal receptive field size does not increase due to multi-stages.
So, the temporal receptive field size of the entire network is 2047 irrespective of number of stages?

Please clarify my understanding.

Thank you,
Raj

feature modality

From the paper, you mentioned you extracted features from RGB frames using I3D.
Did you include other kinds of modalities (e.g. optical flow, MHI) in your features?

I am a little bit confused because most of the methods you compared use RGB + MHI (Motion History Image). It is really impressive if you beat them using RGB only.

How to visualize the results

How to visualize the results like video output in readme

Download Dataset

Hi, thank you for the great work! I have trouble to download the data folder on MEGA, seems I need to pay for it. I'm wondering if you did any preprocessing? Will it still work if I download the dataset from their official websit? Thanks!

How to fine-tune the results of gtea (FT) ？

Thank you very much for your work, I want to know how gtea (FT) got it?

Feature extraction and Python3 support

Hi @yabufarha ,

thanks for the good work.

Could you please provide more information on feature extraction? I understand that you've used https://github.com/ahsaniqbal/Kinetics-FeatureExtractor. Is it using information from future frames, i.e., to compute features of frame i, is it smthg like (i-10, ..., i-1, i, i+1, ..., i+10) for a 21-frame temporal window, or is it only looking back?

A second question, since python2 is deprecated as of Jan 2020, I wasn't able to make the things in the feature extraction repo work. Could you provide a recipe for switching to Python3, both for your repo and for the feature extraction repo? I know that would not be easy but any comments and suggestions will be useful.

Best,

Where did the features of the datasets come from?

Hi there, after reading the paper and code, I found that the ms-tcn takes features of video as input. Then I load a feature into a numpy variable, seeing the data is a matrix with shape (2048,n).
Here is my confuse: do the features in the datasets could be transformed into video? Or they are features that extracted from other backbone? What is that extractor?
Looking forward to your reply.

Mask usage

Hi @yabufarha ,

I understand that you are maintaining a binary mask to avoid performing computations for the padded entries.

The second dimension of your mask is indexing class labels:
mask = torch.zeros(len(batch_input), self.num_classes, max(length_of_sequences), dtype=torch.float)

I noticed you are indexing only the first class label in your model's forward pass,
DilatedResidualLayer,
SingleStageModel,
MultiStageModel ,

Could you please explain the reason?

Many thanks.

Qualitative results

The visualization for action segmentation you show in the paper (Fig. 3, 4, 6, etc.) are really cool.
I can generate the frame-level recognition using the code in this repo.
Can you also share the code or script that you use to produce those visualization results?

Thank you so much!

Flow model for I3D

Hi, could you elaborate a bit about which flow model you use for generating optical flow as the input of I3D feature extraction?

Create my own dataset - how to use spatial cnn to create features?

Hi,
I would like to try the algorithm you have developed on my own videos, can you elaborate a bit more how do you build such dataset from MP4 files, for example?

I'm trying to figure out how to create the features folder, from the paper I understood that is some sort of a transition in spatial cnn, how do I do this for mp4 files for example?
How are the ground truth files built, is there a specific tool to create them?
What are the files in the split folder? the bundle files?
Thanks

What is the meaning of split

Dear authors，

What is the meaning of split? There are many .bundle files in the /data/split, and can you explain this for me?

Thank you!

dimensionality of input features

I noticed the dimensionality of your input I3D features for each video is (2048,number of video frames).

I am confused how the temporal dimension of your inputs is equal to the number of frames as the I3D network is supposed to temporally downsample by a factor of 8? Can you provide more details on how you obtained the I3D features?

Question about the dilated residual layer

Hello,
The dilated residual layer in Fig. 2 includes a 1*1 convolution after the ReLU activation. However, I cannot find the explanations of the role of this 1*1 convolution. Is this 1*1 Conv used to introduce more parameters to improve the expressiveness of the TCN?

Performance on Breakfast

Hello,
I tried to replicate the experimental results as shown in Table 10 in the paper.
The results of GTEA and 50Salads look OK, but the results of Breakfast are much worse than the paper.
I also found that for Breakfast, the performance for each split is very different.
Did you meet that situation?
It would be great if you can share the performance of each split for the Breakfast dataset.

Thank you.

The training accuracy is a little bit unstable

Are the results reported in your paper are the average results of multiple runs?

Breakfast Dataset

Hi,
Is there a fine tuned set of parameters for breakfast dataset?
Right now I am using:

num_stages = 4
num_layers = 10
num_f_maps = 64
features_dim = 400
batch_size = 1
lr = 0.0005

But I only get 60% accuracy at epoch 50.

Regards
Bill

Why masking?

Hello @yabufarha ,

I have read through the paper and did not find information about the mask. But there is usage of mask in the code. Could you please let me know why the masking is used in the code?

Thanks

GTEA features

In your paper, you mentioned that you also tried to fine-tune the input features for the GTEA dataset. I just wonder the data shared in this repository is fine-tuned or without fine-tuning.

Thank you

How to get features?

I am trying to train the network with other dataset. But how can i extract frame-wise features from an example include multiple actions by using I3D network.

'dict_values' object has no attribute 'index'

Code problem

I got this error using PYTHon3, I tried using Python2, but 2 now seems unable to install torch

Stages for inference

In your implementation, you use all the stages to calculate the prediction loss, but only use the last stage for the final inference.

I am just curious: what if you fuse the predictions from all stages during inference?
Have you tried that?

Thank you.

Tool for creating video activity annotations like 50salads

Hi,
I want to create activity annotations per frame for my own videos,
As I understood in 50salads they gave an annotation for every frame,
Do you know about any good tool I can use in order to create this annotation file?
Something that I mark segments in video, give them annotation and the output will be a file just like the one of 50salads?

Excluding first frame in loss calculation for every stage?

@yabufarha Thanks for your easy to understand code! I just have a question regarding this line in model.py:

loss += 0.15*torch.mean( torch.clamp( self.mse(F.log_softmax(p[:, :, 1:], dim=1), F.log_softmax(p.detach()[:, :, :-1], dim=1)), min=0, max=16 ) * mask[:, :, 1:] )

I noticed you're using p[:, :, 1:], p[:, :, :-1] and mask[:, :, 1:] instead of entire tensors. I am trying to train this model on a video level(not frame level), using EGTEA+ dataset where I only have one [2048, 1] feature vector for each video and 1 label per video. So if I do the above slicings, I end up with empty tensors(since doing 1: or :-1 removes the only available label). Although I see the loss decreasing when I use full tensor without slicing, I wanted to know, what's the significance of doing it this way?

50Salads performance

I use same envrionment as you.

PyTorch 0.4.1
Python 2.7.12

And here is my result.

split	F1@{10}	F1@{25}	F1@{50}	Edit	Acc
1	68.8845	66.5362	59.8826	64.3570	76.0544
2	71.8535	69.5652	59.0389	64.6865	75.3585
3	75.8465	74.0406	66.8172	69.6633	80.0396
4	73.3624	70.7424	64.1921	67.9526	81.5158
5	79.9031	78.4504	70.7022	72.4006	86.2126
avg	73.9700	71.8670	64.1266	67.8120	79.8362

F1@{10}	F1@{25}	F1@{50}	Edit	Acc
76.3	74.0	64.5	67.9	80.7

It is much worse than yours.F1@{10} and F1@{25}

How to use self-built dataset？

Hello, I am very interested in your work. How to train the network on self-built data set? How to generate a data set for training?

Video frames to features

Could you publish your code that converts videos into features?

Performance on longer videos

Hi @yabufarha ,

I am working with a dataset from another domain (surgical videos), having at least 80 videos and each video is longer than 30 minutes.

What was the typical length of videos that you experimented?
do you think the method can work well with longer videos like 30 minutes or 1 hours? What would be the set of potential parameters to adjust? Could you give suggestions?

Questions about fine-tuning dataset

Hi,

First, thanks for such a great contribution in MSTCN.
In the paper, you have mentioned that you tried the fine-tuning of GTEA dataset.
I was wondering that did you tried fine-tuning Breakfast dataset?
If yes, can you share the features or the experiments of it?
If not, is it possible to share some materials on how I can get the fine-tuning Breakfast dataset.

Thank you

Online prediction

Hi @yabufarha ,

let me receive your suggestions for training/evaluating the model for online prediction purposes.

I have a per-frame label in my dataset, similar to the datasets you used in the paper. But my videos are quite long, i.e., from 30 minutes to 2 hrs.

For the specific case of online prediction, I am interested only predicting the label for the current time step, i.e., current frame. But, the model is capable of predicting labels for all time-steps at once. The naive solution would be to use frames[0:current], and repeat for the next current index, and use only the prediction in the last index of the predicted labels. But this brings huge computational burden, i.e., you have to forward every frame from the first time step to current time step to obtain a single prediction. It would computationally get worse as you move forward in time.

Do you have any suggestions for online prediction, including the training step? During training, I am simply performing random sampling (both random offset and length for each sample) to have a fixed number of samples from each video.

Best,

len of npy is different from len of video

Hello, I downloaded 50 salads npy files and label txt files. The length of feature matched with label. I also downloaded official 50 salads dataset(https://cvip.computing.dundee.ac.uk/datasets/foodpreparation/50salads/). I checked the frame number of each video using belowing code. It turned out that the length of feature you provided was 8 frames less than video frame number, for all videos.
Did you directly abandoned the first 8 frames or the last 8 frames?

path = r'.\data\50salad\rgb'
path2 = r'.\50salads\features'

def video_info(file):
    video_capture = cv2.VideoCapture(file)

    if not video_capture.isOpened():
        exit()

    frame_count = int(video_capture.get(cv2.CAP_PROP_FRAME_COUNT))

    frame_width = int(video_capture.get(cv2.CAP_PROP_FRAME_WIDTH))
    frame_height = int(video_capture.get(cv2.CAP_PROP_FRAME_HEIGHT))

    video_capture.release()
    return frame_count, frame_width, frame_height


file_list = os.listdir(path2)
for file in file_list:
    npy = np.load(os.path.join(path2, file+'.npy'))
    video = video_info(os.path.join(path, file+'.avi'))
    print(f"{file}, npy shape{npy.shape}, video info{video}")

output of above code are as following

rgb-01-1, npy shape(2048, 11679), video info(11687, 640, 480)
rgb-01-2, npy shape(2048, 12585), video info(12593, 640, 480)
rgb-02-1, npy shape(2048, 12415), video info(12423, 640, 480)
rgb-02-2, npy shape(2048, 10521), video info(10529, 640, 480)
rgb-03-1, npy shape(2048, 11373), video info(11381, 640, 480)
rgb-03-2, npy shape(2048, 11584), video info(11592, 640, 480)
rgb-04-1, npy shape(2048, 13168), video info(13176, 640, 480)
rgb-04-2, npy shape(2048, 12371), video info(12379, 640, 480)
rgb-05-1, npy shape(2048, 11115), video info(11123, 640, 480)
rgb-05-2, npy shape(2048, 12092), video info(12100, 640, 480)
rgb-06-1, npy shape(2048, 9694), video info(9702, 640, 480)
rgb-06-2, npy shape(2048, 8229), video info(8237, 640, 480)
rgb-07-1, npy shape(2048, 17793), video info(17801, 640, 480)
rgb-07-2, npy shape(2048, 15091), video info(15099, 640, 480)
rgb-09-1, npy shape(2048, 11547), video info(11555, 640, 480)
rgb-09-2, npy shape(2048, 14290), video info(14298, 640, 480)
rgb-10-1, npy shape(2048, 12329), video info(12337, 640, 480)
rgb-10-2, npy shape(2048, 9094), video info(9102, 640, 480)
rgb-11-1, npy shape(2048, 9435), video info(9443, 640, 480)
rgb-11-2, npy shape(2048, 8453), video info(8461, 640, 480)
rgb-13-1, npy shape(2048, 13880), video info(13888, 640, 480)
rgb-13-2, npy shape(2048, 13092), video info(13100, 640, 480)
rgb-14-1, npy shape(2048, 8599), video info(8607, 640, 480)
rgb-14-2, npy shape(2048, 8225), video info(8233, 640, 480)
rgb-15-1, npy shape(2048, 11489), video info(11497, 640, 480)
rgb-15-2, npy shape(2048, 13961), video info(13969, 640, 480)
rgb-16-1, npy shape(2048, 12865), video info(12873, 640, 480)
rgb-16-2, npy shape(2048, 10421), video info(10429, 640, 480)
rgb-17-1, npy shape(2048, 11146), video info(11154, 640, 480)
rgb-17-2, npy shape(2048, 12943), video info(12951, 640, 480)
rgb-18-1, npy shape(2048, 12077), video info(12085, 640, 480)
rgb-18-2, npy shape(2048, 7555), video info(7563, 640, 480)
rgb-19-1, npy shape(2048, 12817), video info(12825, 640, 480)
rgb-19-2, npy shape(2048, 11658), video info(11666, 640, 480)
rgb-20-1, npy shape(2048, 8488), video info(8496, 640, 480)
rgb-20-2, npy shape(2048, 8291), video info(8299, 640, 480)
rgb-21-1, npy shape(2048, 12912), video info(12920, 640, 480)
rgb-21-2, npy shape(2048, 12032), video info(12040, 640, 480)
rgb-22-1, npy shape(2048, 18143), video info(18151, 640, 480)
rgb-22-2, npy shape(2048, 12456), video info(12464, 640, 480)
rgb-23-1, npy shape(2048, 13274), video info(13282, 640, 480)
rgb-23-2, npy shape(2048, 14631), video info(14639, 640, 480)
rgb-24-1, npy shape(2048, 12211), video info(12219, 640, 480)
rgb-24-2, npy shape(2048, 7804), video info(7812, 640, 480)
rgb-25-1, npy shape(2048, 11159), video info(11167, 640, 480)
rgb-25-2, npy shape(2048, 8364), video info(8372, 640, 480)
rgb-26-1, npy shape(2048, 9126), video info(9134, 640, 480)
rgb-26-2, npy shape(2048, 9219), video info(9227, 640, 480)
rgb-27-1, npy shape(2048, 11859), video info(11867, 640, 480)
rgb-27-2, npy shape(2048, 12040), video info(12048, 640, 480)

loss nan

I tried to train the model using 50salads and breakfast datasets, but why is the loss nan? I didn't make any changes to the code.
$9%5{97 _HZH6RUP` 6~~1WJ$

The url of data can not be reached

Hi, thank you for sharing the code.But recently, as the title shows, I can not open the url which leads to download the data.Can you update the url?Thanks.

Regarding the extraction of I3D video features

Hi, Yazan,

Can I ask one question regarding the I3D video feature extraction?
As I know, I3D produce one feature for a 16-frame clip. Do you use this setting to generate each feature, i.e., extracting one feature for continuous 16 frames?
If so, assuming the index of the frame which I want to generate feature for is t, from which range should I sample the 16-frame clip, [t, t+15], or [t-15, t], or [t-7, t+8]...?

Thank you in advance!

dataset can't download

Hello, when I download several Gs of data, it always stops downloading automatically. Can you give me some advice?

Reason for using mask[:, 0:1, :]

The code itself is very well written and self-explanatory. I realize that masks are used as input frames are different for each video. I noticed a commit done mask intermediate layers for bz>1 and I can't seem to the understand what purpose does this serve? Any insights would be helpful. Thank you !

bug

Hi，I want to reproduce your code, but it seems to have encountered a problem during the running process. Is the code you gave correct? And I am running on the version of pytorch1.0.1, I don't know if this is the reason for my error, please take a look. My error is as follows, I am disturbed.
【Traceback (most recent call last):
File "/Users/polypubki/Downloads/adata/main.py", line 72, in
trainer.train(model_dir, batch_gen, num_epochs=num_epochs, batch_size=bz, learning_rate=lr, device=device)
File "/Users/polypubki/Downloads/adata/model.py", line 79, in train
batch_input, batch_target, mask = batch_gen.next_batch(batch_size)
File "/Users/polypubki/Downloads/adata/batch_gen.py", line 56, in next_batch
batch_target_tensor = torch.ones(len(batch_input), max(length_of_sequences), dtype=torch.long)*(-100)
ValueError: max() arg is an empty sequence】

Length of raw videos is different from the extracted features

Hi, I'd like to do the process on raw videos but found that the length of raw videos and labels (downloaded from the official Breakfast dataset) is different from the extracted I3D features and the corresponding labels.

Did you do any preprocessing on the raw videos before extracting features?

the variable "mask" in Trainer (line 71 in model.py)

I wonder what "mask" in the following codes is used for.

             batch_input, batch_target, mask = batch_gen.next_batch(batch_size)
                batch_input, batch_target, mask = batch_input.to(device), batch_target.to(device), mask.to(device)
                optimizer.zero_grad()
                predictions = self.model(batch_input)

                loss = 0
                for p in predictions:
                    loss += self.ce(p.transpose(2, 1).contiguous().view(-1, self.num_classes), batch_target.view(-1))
                    loss += 0.15*torch.mean(torch.clamp(self.mse(F.log_softmax(p[:, :, 1:], dim=1), F.log_softmax(p.detach()[:, :, :-1], dim=1)), min=0, max=16)*mask[:, :, 1:])

                epoch_loss += loss.item()
                loss.backward()
                optimizer.step()

                _, predicted = torch.max(predictions[-1].data, 1) # predicted indices
                correct += ((predicted == batch_target).float()*mask[:, 0, :].squeeze(1)).sum().item()
                total += torch.sum(mask[:, 0, :]).item()

It seems not to do anything and always be one.

Modifying MSTCN to MSTCN++

Hi. Thank you so much for sharing your wonderful work. I would like to use MSTCN++ for my scientific paper. I already have the code for MSTCN, could you please tell me how can I modify it to be an MSTCN++, for example, how can I add dual dilated layer? If you could kindly share MSTCN++ model, would be great.

Best regards

Download GTEA dataset

Hi,
Could you tell me the url to download the GTEA dataset? I mean, the videos. I can not download the videos from http://ai.stanford.edu/~alireza/GTEA

yabufarha / ms-tcn Goto Github PK

ms-tcn's People

Contributors

Stargazers

Watchers

Forkers

ms-tcn's Issues

Recommend Projects

Recommend Topics

Recommend Org