Giter VIP home page Giter VIP logo

mast's Introduction

MAST: A Memory-Augmented Self-supervised Tracker

This repository contains the code (in PyTorch) for the model introduced in the following paper

MAST: A Memory-Augmented Self-supervised Tracker

Figure

Citation

@InProceedings{Lai20,
  author       = "Zihang Lai and Erika Lu and Weidi Xie",
  title        = "{MAST}: {A} Memory-Augmented Self-Supervised Tracker",
  booktitle    = "IEEE Conference on Computer Vision and Pattern Recognition",
  year         = "2020",
}

Contents

  1. Introduction
  2. Usage
  3. Results
  4. Contacts

Introduction

Recent interest in self-supervised dense tracking has yielded rapid progress, but performance still remains far from supervised methods. We propose a dense tracking model trained on videos without any annotations that surpasses previous self-supervised methods on existing benchmarks by a significant margin (+15%), and achieves performance comparable to supervised methods. In this paper, we first reassess the traditional choices used for self-supervised training and reconstruction loss by conducting thorough experiments that finally elucidate the optimal choices. Second, we further improve on existing methods by augmenting our architecture with a crucial memory component. Third, we benchmark on large-scale semi-supervised video object segmentation(aka. dense tracking), and propose a new metric: generalizability. Our first two contributions yield a self-supervised network that for the first time is competitive with supervised methods on standard evaluation metrics of dense tracking. When measuring generalizability, we show self-supervised approaches are actually superior to the majority of supervised methods. We believe this new generalizability metric can better capture the real-world use-cases for dense tracking, and will spur new interest in this research direction.

Usage

  1. Install dependencies

    pip install -r requirements.txt
    
  2. Download YouTube-VOS and DAVIS-2017 dataset. There is no need of pre-processing.

Dependencies

  • Python3.7
  • PyTorch(1.1.0) Note you need PyTorch 1.1 to get the correct training result. We tried PyTorch 1.4 but it seems that the network fails to converge. Let me know if you know what might be the reason. Testing seems fine under 1.4 though.
  • Pytorch Correlation module (0.0.8) It appears that there are some problem with installing this module. I also failed to install it on one of my machines. Make sure you pass python check.py forward and python check.py backward (this code comes from the ClementPinard's github page) before training or testing.
  • CUDA 10.0
  • OxUvA dataset
  • YouTube-VOS dataset
  • DAVIS-2017

Train

  • Use the following command to train
    python main.py --datapath path-to-kinetics --savepath log-path
    

Test and evaluation

  • Use the following command to generate output for official DAVIS testing code
    python benchmark.py --resume path-to-checkpoint \
                   --datapath path-to-davis \
                   --savepath log-path
    

For benchmark the full model, you need a GPU with ~22G memory for the test. If that's not available to you, you could add a --ref 1 to reduce memory usage to about 16G by using only 4 reference frames or --ref 2 to reduce it to about <6G by only using 1 reference frame. The performance could drop by 5-10 points though. Another option would be downsample the input image. You could also change the code to process memories frame by frame, which could slow down the processing time but it should preserve the accuracy.

  • Then you can test the output with the official Python evaluation code.
    python evaluation_method.py \
                    --task semi-supervised \
                    --results_path log-path/benchmark
                    --davis_path your-davis-path
    

Pretrained model

Google drive

Results

Comparison with other methods on DAVIS-2017 Results on Youtube-VOS and Generalization ability
Video segmentation results on DAVIS-2017

mast's People

Contributors

zlai0 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

mast's Issues

Some details in the paper

Thank you for sharing such a great job!
I have some questions about the implementation details for this paper.

  1. What is the 'soft and hard propagation' In Table 6? Does it mean the propagation function during inference? I am a little confused about the details of this table.
  2. Can you share the function you use to achieve image-feature alignment? Can I use an odd input size and 'align_corner=True' to achieve the feature align?
  3. How is the result when removing the memory-augment tracker?

I would be so grateful if you can help me with these details. Looking forward to the coming code.

Question about aggregating labels in your paper

Hi, thank you for sharing such an awesome project. I have some questions about the details for your paper.

In the step 4 of Algorithm 1 (Section 3.3.1), you said "the output pixel's labels is determined by aggregating the labels of the ROI pixels". My question is how you treat several frames' ROI.

Do you use the algorithm similar as STM. For example, if the Query's size is H * W * C (C is the number of channels), and the size of your restricted attention in all previous T Keys is P. Do you calculate an affinity matrix with size (HW * TPP)?
Or you just do the same thing as Cycle-consistency. That is, a K-NN strategy and it average all previous predictions.

I really appreciate it if you could solve my concerns. Looking forward to your official code. Thank you!

Question about Lab channel dropout

MAST/models/mast.py

Lines 43 to 44 in a57b043

drop_ch_num = int(np.random.choice(np.arange(1, 2), 1))
drop_ch_ind = np.random.choice(np.arange(1,3), drop_ch_num, replace=False)

This is essentially

drop_ch_num = 1
drop_ch_ind = [np.random.choice([1, 2])]

In the paper all three channels of Lab are decorrelated. Is it on purpose that L channel is preserved?

Great work! Thanks in advance.

Issues about RGB 2 lab space

Thanks for sharing codes. When I run the training code, I find an issue as follow:
YTVOSTrainLoader.py", line 28, in lab_preprocess
image = cv2.cvtColor(image, cv2.COLOR_BGR2Lab)
cv2.error: OpenCV(4.3.0) /io/opencv/modules/imgproc/src/color.simd_helpers.hpp:92: error: (-2:Unspecified error) in function 'cv::impl::{anonymous}::CvtHelper<VScn, VDcn, VDepth, sizePolicy>::CvtHelper(cv::InputArray, cv::OutputArray, int) [with VScn = cv::impl::{anonymous}::Set<3, 4>; VDcn = cv::impl::{anonymous}::Set<3>; VDepth = cv::impl::{anonymous}::Set<0, 5>; cv::impl::{anonymous}::SizePolicy sizePolicy = cv::impl::::NONE; cv::InputArray = const cv::_InputArray&; cv::OutputArray = const cv::_OutputArray&]'

Invalid number of channels in input image:
'VScn::contains(scn)'
where
'scn' is 1

I think it may be raised by converting grayscale images to color space. Can you provide some solutions?

test problem

Thanks for your great job ! I hava trained my model successfully. But when i try to test my model on benmark.py, i notice that a 22G GPU is needed. But i only have 2 GPUs which has 12G memory. I tried to set the cuda decive but it didn't work. Can you give me some advice on my problem? Looking forwad to your reply !

test failed

Traceback (most recent call last):
File "benchmark.py", line 184, in
main()
File "benchmark.py", line 54, in main
test(TrainImgLoader, model, log)
File "benchmark.py", line 95, in test
anno_1 = annotations[i+1]
IndexError: list index out of range

Code for multi-frame tracker

Thank you for sharing such a great job !
The code now seems to consider only one pair of images (one reference frame and one target frame).
But the memory bank containing multiple past frames is a crucial component of MAST.
I would like to know how to implement this part of the code. Is it possible to do this by modifying ref_num in YouTubeVOSTrain.py ?

Where is the code for the Image Feature Alignment?

Thanks for your amazing work. In the Implementation Details of your paper, you describe an approach for the Image Feature Alignment, which I would like to try. Could you please indicate the location of it in your code or give more details about it?

what's the batchsize do you use for trainning multiple reference frames?

Hello:
Thanks for you sharing your perfect work. You have said that you pretrain the model with pair of input frame, and then use the network in the training for multiple reference frames.
I have two questions as folllows:
1.what's the batchsize do you use for trainning using only short or only long term memory to get the table 5 only short j-
mean 57.3 and F-mean 61.8
2.what's the batchsize do you use for trainning multiple reference frames.
Thank you very much!

In table 5,how many reference frames in only short?

Hello:
Thanks for your perfect work. In table 5,memory is divided into only short, only long, and mixed .I want to know how many reference frames in only short to get the result J-mean 0.573 F-mean 0.618 ?
thank you very much !(^_^)

Replace Dataset Test

Please change your dataset and the test always reports an error “cannot reshape array of size 1 into shape (3)” ,how to solve?

Code release scheduled date?

Hi,

I'm really inspired by your work, and would love to check out the code! It's been 4 months since your initial commit. When do you plan to open-source?

Thank you so much!

raise the batch.exc_type(batch.exc_msg) error

Sorry to bother. I tried to run the code you provided, but got raise the batch.exc_type(batch.exc_msg) error after hundreds of batches, note that photos have been resized to [256,256], and drop_last=True, I don't know other related factors to fix this.

How to process memories frame by frame

Sorry to bother. Another question. So far, my experience in coding is not so much. I want to reduce the memories by changing codes to process frame by frame but don't know how. Could you please give me some hints? Thank you so much!

Where is the memory module in training stage?

Very perfect work!

I admit that the test results on the provided pre-trained model are the same as the released ones. However, with the given codes, I cannot train a model with similar performance on the YouTube-VOS dataset on my own. My model just got Js=0.405 and Fs=0.481 with 30 epochs' training. What's the problem?

When I check the training model in main.py (Ln 184), I find it's only trained on pairwise data. Could you please release the code of the long and short term memory as told in the paper? Is it the reason why I cannot achieve a higher score?

wrong usage of offset0

Hi,

I could see that the variable offset0 is used in this line:

im_col0 = [deform_im2col(qr[i], offset0, kernel_size=self.P) for i in range(nsearch)]# b,3*N,h*w

However, it is calculated from the last for loop over nsearch. i.e., it won't contain the correct offset for each searched images.
Here:

MAST/models/colorizer.py

Lines 70 to 80 in a57b043

for searching_index in range(nsearch):
##### GET OFFSET HERE. (b,h,w,2)
samplerindex = dirates[searching_index]-2
coarse_search_correlation = self.correlation_sampler_dilated[samplerindex](feats_t, feats_r[searching_index]) # b, p, p, h, w
coarse_search_correlation = coarse_search_correlation.reshape(b, self.memory_patch_N, h*w)
coarse_search_correlation = F.softmax(coarse_search_correlation, dim=1)
coarse_search_correlation = coarse_search_correlation.reshape(b,self.memory_patch_P,self.memory_patch_P,h,w,1)
_y, _x = torch.meshgrid(torch.arange(-self.memory_patch_R,self.memory_patch_R+1),torch.arange(-self.memory_patch_R,self.memory_patch_R+1))
grid = torch.stack([_x, _y], dim=-1).unsqueeze(-2).unsqueeze(-2)\
.reshape(1,self.memory_patch_P,self.memory_patch_P,1,1,2).contiguous().float().to(coarse_search_correlation.device)
offset0 = (coarse_search_correlation * grid ).sum(1).sum(1) * dirates[searching_index] # 1,h,w,2

Ideally, im_col0 could have been collected inside the same for loop for every image in nsearch to reflect correct offset for each image.
Could you please clarify why is this not done?

Thanks!

SpatialCorrelationSampler

hello,thank you for your perfect work.
i have a question about SpatialCorrelationSampler . Does this acculate the similarity or sampler the neighbourhood?
thank you very much!

Loss Function and training the model using YouTube-VOS.

Hi @zlai0, thanks for the nice work and for sharing the code. I have a couple of questions and would appreciate your help.

  1. What is the reason for multiplying the output and target with 20, before computing the loss (line 187 in main.py)? I could not find it anywhere in the paper and I wonder if this could interfere with the learning rate.
  2. For training the model on YouTube-VOS, I used ytvos.csv. The training command is for the kinetics dataset (trains for 20 epochs as default) and I was wondering what should be changed in order to train the model from scratch on YoutubeVOS. Do I use only the frame pairs in ytvos.csv and pass the directory of the YouTube-VOS dataset as an argument?

Thank you in advance!

Question about offset0

Hi and thanks for sharing the code. I faced an issue with the offset values for frames in the memory bank. It seems the same offset (offset0) is being used for all fames.here
Shouldn't a specific offset for each frame in the memory be used instead? Thank you in advance.

YT-VOS 2018 or 2019 ?

Hi,
May I ask which version you used for training? (2018 or 2019)
And did you use the train_all_frames.zip or train.zip ?
Thx.

Reproducablity basics?

Hi Zhiang,
I'm trying to reproduce your results howerwer I'm confused what to change in the training script for the 2 phases or training described in the paper.

Step 1:
"During training, we first pretrain the network with a pair of
input frames, i.e. one reference frame and one target frame
are fed as inputs. One of the color channels is randomly
dropped with probability p = 0:5. We train our model endto-
end using a batch size of 24 for 1M iterations with the
Adam optimizer. The initial learning rate is set to 1e-3,
and halved after 0.4M, 0.6M and 0.8M iterations."

Step 2:
"We then
finetune the model using multiple reference frames (our full
memory-augmented model) with a small learning rate of 2e-
5 for another 1M iterations"

Cold you provide commands similar commands in the scripts/train for this purpose?
For example, do you user2gpus and set bsize=12, or 1 GPU and bsize=24.
And what is needed to change to enable "full_memort_augmented_model" in main.py for step 2?

if anyone else figured this out before please let me know
thanks,

Inference on YouTube-VOS dataset

Hi,

May I know how you test on YouTube-VOS since some new objects are added in middle frames in that dataset?
In that case, I am confused about how to handle objects that added in middle frames, such as how to decide memory usage, how to merge segmentation results, e.t.c.

Thanks.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.