Giter VIP home page Giter VIP logo

segment-everything-everywhere-all-at-once's Introduction

πŸ‘€SEEM: Segment Everything Everywhere All at Once

We introduce SEEM that can Segment Everything Everywhere with Multi-modal prompts all at once. SEEM allows users to easily segment an image using prompts of different types including visual prompts (points, marks, boxes, scribbles and image segments) and language prompts (text and audio), etc. It can also work with any combinations of prompts or generalize to custom prompts!

πŸ‡[Read our arXiv Paper] Β  🍎[Try Hugging Face Demo]

πŸš€ Updates

πŸ”₯ Related projects:

  • X-Decoder : Generic decoder that can do multiple tasks with one model onlyοΌ›We built SEEM based on X-Decoder.
  • FocalNet : Focal Modulation Networks; We used FocalNet as the vision backbone.
  • UniCL : Unified Contrasative Learning; We used this technique for image-text contrastive larning.

πŸ”₯ Other projects you may find interesting:

  • OpenSeed : Strong open-set segmentation methods.
  • Grounding SAM : Combining Grounding DINO and Segment Anythin.
  • LLaVA : Large Language and Vision Assistant.

πŸ’‘ Highlights

Inspired by the appealing universal interface in LLMs, we are advocating universal, interactive multi-modal interface for any types of segmentation with ONE SINGLE MODEL. We emphasize 4 important features of SEEM below.

  1. Versatility: work with various types of prompts, for example, clicks, boxes, polygons, scribbles, texts, and referring image;
  2. Compositionaliy: deal with any compositions of prompts;
  3. Interactivity: interact with user in multi-rounds, thanks to the memory prompt of SEEM to store the session history;
  4. Semantic awareness: give a semantic label to any predicted mask;

SEEM design A breif introduction of all the generic and interactive segmentation tasks we can do.

πŸ¦„ How to use the demo

  • Try our default examples first;
  • Upload an image;
  • Select at least one type of prompt of your choice (If you want to use referred region of another image please check "Example" and upload another image in referring image panel);
  • Remember to provide the actual prompt for each promt type you select, otherwise you will meet an error (e.g., rember to draw on the referring image);
  • Our model by defualt support the vocabulary of COCO 80 categories, others will be classified to 'others' or misclassifed. If you wanna segment using open-vocabulary labels, include the text label in 'text' button after drawing sribbles.
  • Click "Submit" and wait for a few seconds.

πŸŒ‹ An interesting example

An example of Transformers. The referred image is the truck form of Optimus Prime. Our model can always segment Optimus Prime in target images no matter which form it is in. Thanks Hongyang Li for this fun example.

assets/transformers_gh.png

πŸ•οΈ Click, scribble to mask

With a simple click or stoke from the user, we can generate the masks and the corresponding category labels for it.

SEEM design

πŸ”οΈ Text to mask

SEEM can generate the mask with text input from the user, providing multi-modality interaction with human.

example

πŸ•Œ Referring image to mask

With a simple click or stroke on the referring image, the model is able to segment the objects with similar semantics on the target images. example

SEEM understands the spatial relationship very well. Look at the three zebras! The segmented zebras have similar positions with the referred zebras. For example, when the leftmost zebra is referred on the upper row, the leftmost zebra on the bottom row is segmented. example

🌼 Referring image to video mask

No training on video data needed, SEEM works perfectly for you to segment videos with whatever queries you specify! example

🌻 Audio to mask

We use Whiper to turn audio into text prompt to segment the object. Try it in our demo!

assets/audio.png

🌳 Examples of different styles

An example of segmenting a meme.

assets/emoj.png

An example of segmenting trees in cartoon style.

assets/trees_text.png

An example of segmenting a minecraft image.

assets/minecraft.png
An example of using referring image on a popular teddy bear.

example

Model

SEEM design

Comparison with SAM

In the following figure, we compare the levels of interaction and semantics of three segmentation tasks (edge detection, open-set, and interactive segmentation). Open-set Segmentation usually requires a high level of semantics and does not require interaction. Compared with SAM, SEEM covers a wider range of interaction and semantics levels. For example, SAM only supports limited interaction types like points and boxes, while misses high-semantic tasks since it does not output semantic labels itself. The reasons are: First, SEEM has a unified prompt encoder that encodes all visual and language prompts into a joint representation space. In consequence, SEEM can support more general usages. It has potential to extend to custom prompts. Second, SEEM works very well on text to mask (grounding segmentation) and outputs semantic-aware predictions.

assets/compare.jpg

πŸ“‘ Catelog

  • SEEM + Whisper Demo
  • SEEM + Whisper + Stable Diffusion Demo
  • Inference and installation code
  • Hugging Face Demo

πŸ’˜ Acknowledgements

  • We appreciate hugging face for the gpu support on demo!

segment-everything-everywhere-all-at-once's People

Contributors

fengli-ust avatar haozhang534 avatar jwyang avatar linjieli222 avatar maureenzou avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    πŸ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. πŸ“ŠπŸ“ˆπŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❀️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.