Giter VIP home page Giter VIP logo

slot-attention-video's Introduction

Slot Attention for Video (SAVi and SAVi++)

This repository contains the code release for "Conditional Object-Centric Learning from Video" (ICLR 2022) and "SAVi++: Towards End-to-End Object-Centric Learning from Real-World Videos" (NeurIPS 2022)

SAVi animation


SAVi++ animation 1

SAVi++ animation 2

Papers: https://arxiv.org/abs/2111.12594 https://arxiv.org/abs/2206.07764

Project websites: https://slot-attention-video.github.io/ https://slot-attention-video.github.io/savi++/

Instructions

ℹ️ The following instructions assume that you are using JAX on GPUs and have CUDA and CuDNN installed. For more details on how to use JAX with accelerators, including requirements and TPUs, please read the JAX installation instructions.

Get dependencies and run model training via

./run.sh

Or use

pip3 install -r requirements.txt

to install dependencies and

python -m savi.main --config savi/configs/movi/savi_conditional_small.py --workdir tmp/

to train the smallest SAVi model (SAVi-S) on the MOVi-A dataset.

or

python -m savi.main --config savi/configs/movi/savi++_conditional.py --workdir tmp/

to train the more capable SAVi++ model on the MOVi-E dataset.

The MOVi datasets are stored in a Google Cloud Storage (GCS) bucket and can be downloaded to local disk prior to training for improved efficiency.

To use a local copy of MOVi-A, for example, please copy the relevant folder to your local disk and set data_dir in the config file (e.g., configs/movi/savi_conditional.py) to point to it. In more detail, first copy using commands such as

gsutil -m cp -r gs://kubric-public/tfds/movi_a/128x128/1.0.0 ./movi_a_128x128/
mkdir movi_a
mv movi_a_128x128/ movi_a/128x128/

The resulting directory structure will be as follows:

.
|-- movi_a
|   `-- 128x128
|       `-- 1.0.0
|-- savi
|   |-- configs
|   |   `-- movi
|   |-- lib
|   |-- modules

In order to use the local copy simply set data_dir = "./" in the config file configs/movi/savi_conditional_small.py. You can also copy it into a different location and set the data_dir accordingly.

To run SAVi or SAVi++ on other MOVi dataset variants, follow the instructions above while replacing movi_a with, e.g. movi_b or movi_c.

Expected results

This repository contains the SAVi model configurations from our ICLR 2022 paper. We here refer to these models as SAVi-S, SAVi-M, and SAVi-L. SAVi-S is trained and evaluated on downscaled 64x64 frames, whereas SAVi-M uses 128x128 frames and a larger CNN backbone. SAVi-L is similar to SAVi-M except that it uses larger ResNet34 encoder and slot embedding.

This repository contains also the SAVi++ model configurations from our NeurIPS 2022 paper. SAVi++ uses a more powerful encoder than SAVi-L that adds transformer blocks to the ResNet34. SAVi++ also adds data augmentation and training on depth targets. SAVi++ is able to better handle real world videos with more complexities such as camera movements and complex object shapes and textures.

The released MOVi datasets as part of Kubric differ slightly from the ones used in our ICLR 2022 paper and are of slightly higher complexity (e.g., more variation in backgrounds), results are therefore not directly comparable. MOVi-A is approximately comparable to the "MOVi" dataset used in our ICLR 2022 paper, whereas MOVi-C is approximately comparable to "MOVi++". We provide updated results for our released configs and the MOVi datasets with version 1.0.0 below.

Model MOVi-A MOVi-B MOVi-C MOVi-D MOVi-E
SAVi-S 92.1 ± 0.1 72.2 ± 0.5 64.7 ± 0.3 33.8 ± 7.7 8.3 ± 0.9
SAVi-M 93.4 ± 1.0 75.1 ± 0.5 67.4 ± 0.5 20.8 ± 2.2 12.2 ± 1.1
SAVi-L 95.1 ± 0.6 64.8 ± 8.9 71.3 ± 1.6 59.7 ± 6.0 34.1 ± 1.2
SAVi++ 85.3 ± 9.8 72.5 ± 11.2 79.1 ± 2.1 84.8 ± 1.4 85.1 ± 0.9

All results are in terms of FG-ARI (in %) on validation splits. Mean ± standard error over 5 seeds. All SAVi and SAVi++ models reported above use bounding boxes of the first video frame as conditioning signal. For simplicity, we evaluate FG-ARI on all frames of the video (incl. the first frame), which differs from the setup described in our ICLR 2022 paper.

Cite

@inproceedings{kipf2022conditional,
    author = {Kipf, Thomas and Elsayed, Gamaleldin F. and Mahendran, Aravindh
              and Stone, Austin and Sabour, Sara and Heigold, Georg
              and Jonschkowski, Rico and Dosovitskiy, Alexey and Greff, Klaus},
    title = {{Conditional Object-Centric Learning from Video}},
    booktitle = {International Conference on Learning Representations (ICLR)},
    year  = {2022}
}
@inproceedings{elsayed2022savi++,
    author={Elsayed, Gamaleldin F. and Mahendran, Aravindh
    and van Steenkiste, Sjoerd and Greff, Klaus and Mozer, Michael C.
    and Kipf, Thomas},
    title = {{SAVi++: Towards end-to-end object-centric learning from real-world videos}},
    booktitle = {Advances in Neural Information Processing Systems (NeurIPS)},
    year  = {2022}
}

Disclaimer

This is not an official Google product.

slot-attention-video's People

Contributors

gamaleldin avatar tkipf avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.