Giter VIP home page Giter VIP logo

tcc's Introduction

Foundation models for vision

Text

Foundation models for vision, similar to Large Language Models (LLMs) in language, are large transformer-based models that have undergone extensive training on vast amounts of data. These models, like their language counterparts, are designed to be generally applicable and accept prompts from users and training data.
One example of such a vision model is CLIP, an image classification model that leverages text prompts. By encoding both text and images as vectors using transformer encoders, CLIP achieves image classification by correlating these vectors. CLIP's capabilities extend beyond image classification, as it is also used by generative models like DALL-E for text-to-image representation. This powerful model has been trained on a massive amount of data.
Building on the concept of using prompts for image classification introduced by CLIP, a recent work introduces spatial prompts. These prompts involve drawing a red circle around a desired part of the image, which is then classified (based on the enclosed content). The underlying architecture of this approach follows a similar concept as CLIP, encoding information using transformers but in this case, the information is not just the image's pixels but also the prompted area.
Taking the idea of prompts even further, Segment Anything (SAM) emerges as the largest foundation model for vision to date. SAM specializes in segmenting instances that correlate with a given prompt. To train SAM, its authors at FAIR have developed a novel dataset comprising 11 million images and 1 billion masks. As in [reference to paper] prompts serve as references for SAM's instance segmentation, (but now prompts are not just a draw area but also points and bounding boxes. including segmentation masks, bounding boxes, and even spatial prompts in the form of points.) (Although text prompts have been explored by the authors, they have not been released as part of SAM's current capabilities.) SAM can be utilized as a zero-shot instance segmentation model out of the box (or fine-tuned for specific datasets, providing a versatile tool for vision tasks).

References


Tools built on top of SAM

Text

Upon reviewing recently published papers, it is evident that a similar phenomenon to what occurred with Language Models (LLMs) is now happening to foundation models for vision, particularly with Segment Anything (SAM). There is a growing amount of new research being conducted based on this versatile architecture. These studies primarily focus on two main areas: evaluating SAM's performance in specific domains such as medical imaging and object tracking, and exploring extensions and modifications to the model as an architectural reference for new models. One such extension [reference "perSAM"] introduces a novel approach to utilize SAM as a one-shot learning model. On the other hand, "fastSAM" constructs a model based on Convolutional Neural Networks (CNNs) instead of Transformers, resulting in reduced parameters and improved inference time for real-time applications. Another paper, //paper on tracking//, extends SAM's capabilities to object tracking in sequential images. Additionally, "LangSAM" introduces text prompts for SAM, offering this functionality to the public before the original author. Considering the results and the increasing number of similar research endeavors, it is apparent that SAM can serve as a starting point for further segmentation research.

References


Instance Segmentation

  • U-Net
  • Mask RCNN -> Cascade Mask RCNN
  • Backbones (CNN -> Transformers)
  • Transformers -> ViT -> Swin

Datasets


Live Cell Analysis

  • Why? (Applications)
  • Pre deep learning and why using deep learning
  • Get some references from Nabeel's master thesis

tcc's People

Contributors

froestiago avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.