Giter VIP home page Giter VIP logo

aot's Introduction

AOT: Associating Objects with Transformers for Video Object Segmentation

PWC PWC PWC PWC PWC PWC

This project is used to update news related to AOT series frameworks:

  • DeAOT: Decoupling Features in Hierarchical Propagation for Video Object Segmentation (NeurIPS 2022, Spotlight) [OpenReview][PDF]
  • AOST: Scalable Video Object Segmentation with Identification Mechanism (TPAMI 2024) [PDF]
  • AOT: Associating Objects with Transformers for Video Object Segmentation (NeurIPS 2021, Score 8/8/7/8) [OpenReview][PDF]

Implementations and Results

The implementations of AOT series frameworks can be found below:

  1. PyTorch

    • AOT-Benchmark from ZJU, which supports both AOT and DeAOT now. Thanks for such an excellent implementation.
  2. PaddlePaddle

    We are preparing an official PaddlePaddle implementation.

News

  • 2024/03: AOST - AOST, the journal extension of AOT, has been accepted by TPAMI. AOST is the first scalable VOS framework supporting run-time speed-accuracy trade-offs, from real-time efficiency to SOTA performance.

  • 2023/07: WINNER - DeAOT-based Tracker ranked 1st in the VOTS 2023 challenge (leaderboard). In detail, our DMAOT improves DeAOT by storing object-wise long-term memories instead of frame-wise long-term memories. This avoids the memory growth problem when processing long video sequences and produces better results when handling multiple objects.

  • 2023/06: WINNER - DeAOT-based Tracker ranked 1st in two tracks of EPIC-Kitchens challenges (leaderboard). In detail, our MS-DeAOT is a multi-scale version of DeAOT and is the winner of Semi-Supervised Video Object Segmentation (segmentation-based tracking) and TREK-150 Object Tracking (BBox-based tracking). Technical reports are coming soon.

  • 2023/04: We are pleased to announce the release of our latest project, Segment and Track Anything (SAM-Track). This innovative project merges two kinds of models, SAM and our DeAOT, to achieve seamless segmentation and efficient tracking of any objects in videos.

  • 2022/10: WINNER - AOT-based Tracker ranked 1st in four tracks of the VOT 2022 challenge (presentation of results). In detail, our MS-AOT is the winner of two segmentation tracks, VOT-STs2022 and VOT-RTs2022 (real-time). In addition, the bounding box results of MS-AOT (initialized by AlphaRef, and output is bounding box fitted to mask prediction) surpass the winners of two bounding box tracks, VOT-STb2022 and VOT-RTb2022 (real-time). The bounding box results were required by the organizers after the competition deadline but were highlighted in the workshop presentation (ECCV 2022).

  • 2022/10: An improved version of AOT, DeAOT (Decoupling Features in Hierarchical Propagation for Video Object Segmentation), has been accepted by NeurIPS 2022 (Spotlight). DeAOT achieves state-of-the-art accuracy and efficiency on VOS/VOT benchmarks, including YouTube-VOS 2018/2019, DAVIS 2016/2017, and VOT 2020.

  • 2022/03: An extension of AOT, AOST (under review), is available now. AOST is a more robust and flexible framework, supporting run-time speed-accuracy trade-offs.

  • 2021/10: The conference paper has been accepted by NeurIPS 2021 (score 8/8/7/8, OpenReview).

  • 2021/05: WINNER - We ranked 1st in the Track 1 (Video Object Segmentation) of the 3rd Large-scale Video Object Segmentation Challenge.

About DeAOT

alt text

Although AOT successfully introduces the hierarchical propagation into VOS, the hierarchical propagation can gradually propagate information from past frames to the current frame and transfer the current frame feature from object-agnostic to object-specific, and the increase of object-specific information will inevitably lead to the loss of object-agnostic visual information in deep propagation layers. To solve such a problem and further facilitate the learning of visual embeddings, we propose a Decoupling Features in Hierarchical Propagation (DeAOT) approach to decouple the hierarchical propagation of object-agnostic and object-specific embeddings by handling them in two independent branches. Besides, to compensate for the additional computation from dual-branch propagation, we propose an efficient module for constructing hierarchical propagation, Gated Propagation Module.

About AOST

alt text previous methods always have static network architectures, which are not flexible enough to adapt to different speed-accuracy requirements. To solve the above problems, we proposed an Associating Objects with Scalable Transformers (AOST) approach to match and segment multiple objects collaboratively with online network scalability. In AOST, a Scalable Long Short-Term Transformer (S-LSTT) is designed to construct hierarchical multi-object associations and enable online adaptation of accuracy-efficiency trade-offs. By further introducing scalable supervision and layer-wise ID-based attention, AOST is not only more flexible but more robust than previous methods.

About AOT

alt text

In AOT, we propose an identification mechanism, which enables us to model, propogate, and segment multiple objects as efficiently as processing a single object. Based on the identification mechanism, the AOT framework is lightweight (with less than 10M parameters in default) yet powerful (achieving SOTA performance). Besides, we propose Long Short-Term Transformer (LSTT) for propogating temporal information hierachically, and the balance of performance and efficiency is convenient by adding or reducing LSTT blocks now in VOS.

Citations

Please consider citing the related paper(s) in your publications if it helps your research.

@inproceedings{yang2022deaot,
  title={Decoupling Features in Hierarchical Propagation for Video Object Segmentation},
  author={Yang, Zongxin and Yang, Yi},
  booktitle={Advances in Neural Information Processing Systems (NeurIPS)},
  year={2022}
}
@article{yang2021aost,
  title={Scalable Video Object Segmentation with Identification Mechanism},
  author={Yang, Zongxin and Miao, Jiaxu and Wei, Yunchao and Wang, Wenguan and Wang, Xiaohan and Yang, Yi},
  journal={TPAMI},
  year={2024}
}
@inproceedings{yang2021aot,
  title={Associating Objects with Transformers for Video Object Segmentation},
  author={Yang, Zongxin and Wei, Yunchao and Yang, Yi},
  booktitle={Advances in Neural Information Processing Systems (NeurIPS)},
  year={2021}
}

aot's People

Contributors

z-x-yang avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

aot's Issues

About the X^m_l in Long-Term Attention

The X^m_l in Long-Term Attention (red box in the picture) means the features output by backbone , or is it obtained through Self-Attention like X^t_l (blue box in the picture) ?

f6c7b3dee7040e0f661ea3138257763

Example video of results

Hello,

Is it possible to provide an example video or two which demonstrate the results of the method?

Thanks,
Chris

About the input dimensions of Long-Term Attention

1

How do you deal with the changeable input dimensions since the input dimensions of torch.nn.MultiheadAttention should be fixed? Or do I have any misunderstandings about the use of torch.nn.MultiheadAttention ?

Finetuning on my own dataset

Hi Zongxin,

Congrats on a series of great works as well as Segment-and-Track Everything Project!

Now I am working on testing the Track Everything project on a multi-view microscopic dataset where the severe occlusions are likely happening and my target is tracking many small "black circles" like only 20 x 20 pixel size in a 1024 x 1024 full image while may have more than hundreds in a single image.

I am a bit surprised to see after some efforts, your model gave us a pretty nice result but still sometimes wrongly tracking multiple circles as the same or the mask could be kind of rough.

I guess there may be some additional finetuning work awaiting for us, though I am not the expert in this area. My current efforts is more like exploring recent SOTA works and preparing datasets for them. Unfortunately, I can't find some way to optimize your model for tracking part (I have already finetuned SAM in our task so every frame can produce pretty nice mask but I cannot track them correctly).

Could you please give me some advice? And any plan to release code for customized dataset finetuning? I am very happy to help to test & discuss if you are interested in how well your model can work with medical & microscopic data like CT, MRI or Cryo-EM, Cryo-ET after finetuning.

Thanks in advance!

Multi-view object segmentation

Hi,
I haven't explored your algorithm yet but I was wondering if it could be useful to me.
I wanted to know if this solution can also be used in a context where there are several images of the same object from multiple angles (not necessarily in order) and therefore not a video of the same.

Are there any theoretical bases that motivate this or some experiments done?

Thanks
Gianluca

The performance gains from image datasets (COCO, VOC, etc.)

Hi,

I have read some of your great papers for VOS.

I noticed that models are usually first pre-trained on image datasets (COCO, VOC, etc.) and then trained on video datasets (DAVIS, YTB). May I know some experience or results about the performance gap that training with/without image datasets?

In addition, does the MobileNetV2 in this work pre-trained on ImageNet or trained from scratch?

Many thanks in advance.

Question regarding Table 1

Hi, love your AOT/AOST. Excellent scalability. The open-source code is another big plus.

I was wondering what counts as "all frames" in Table 1 of AOST? Most papers don't mention whether they use 6FPS/30FPS (and they also don't release the code!) The authors of STM did say they use all the frames: seoungwugoh/STM#3 (comment). HMMN uses a similar evaluation structure so it's likely also the case; KMN has no code but their baseline score matches STM so it is also likely that they use the 30 FPS version. Am I missing something here?

Cheers.

The sampling strategy during training

Hi Zongxin,

May I know more details about the training strategies?

  • In this paper, the sequence length is 5 during training, does it means 1 frame as reference (long-term), 1 frame as previous frame (short-term), and 3 frames as current frames to predict in a sequential manner?
  • Does the sampling of the frame indexes in sequences is the same as CFBI? If it does not, may I have some details?

Thank you.

Identification Embedding

Hi,
Do you consider the background like an object when embedding identification or you aggregate the background after decoding?

Thanks.

About training detial

Hi, Zongxin,

Thank you for the nice work. I am concerned about the training detail.

  1. How many models do you use in Table 1? Davis valid/test share one and Youtube 18/19 share one?
  2. You say "For main training, the training steps are 100,000 for YouTube-VOS or 50,000 for DAVIS" and what is the iteration of total training? Because I think "training steps" is a middle step of adjusting the learning rate.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.