The tcc from froestiago

Foundation models for vision

Text

Foundation models for vision, similar to Large Language Models (LLMs) in language, are large transformer-based models that have undergone extensive training on vast amounts of data. These models, like their language counterparts, are designed to be generally applicable and accept prompts from users and training data.
One example of such a vision model is CLIP, an image classification model that leverages text prompts. By encoding both text and images as vectors using transformer encoders, CLIP achieves image classification by correlating these vectors. CLIP's capabilities extend beyond image classification, as it is also used by generative models like DALL-E for text-to-image representation. This powerful model has been trained on a massive amount of data.
Building on the concept of using prompts for image classification introduced by CLIP, a recent work introduces spatial prompts. These prompts involve drawing a red circle around a desired part of the image, which is then classified (based on the enclosed content). The underlying architecture of this approach follows a similar concept as CLIP, encoding information using transformers but in this case, the information is not just the image's pixels but also the prompted area.
Taking the idea of prompts even further, Segment Anything (SAM) emerges as the largest foundation model for vision to date. SAM specializes in segmenting instances that correlate with a given prompt. To train SAM, its authors at FAIR have developed a novel dataset comprising 11 million images and 1 billion masks. As in [reference to paper] prompts serve as references for SAM's instance segmentation, (but now prompts are not just a draw area but also points and bounding boxes. including segmentation masks, bounding boxes, and even spatial prompts in the form of points.) (Although text prompts have been explored by the authors, they have not been released as part of SAM's current capabilities.) SAM can be utilized as a zero-shot instance segmentation model out of the box (or fine-tuned for specific datasets, providing a versatile tool for vision tasks).

References

Tools built on top of SAM

Text

Upon reviewing recently published papers, it is evident that a similar phenomenon to what occurred with Language Models (LLMs) is now happening to foundation models for vision, particularly with Segment Anything (SAM). There is a growing amount of new research being conducted based on this versatile architecture. These studies primarily focus on two main areas: evaluating SAM's performance in specific domains such as medical imaging and object tracking, and exploring extensions and modifications to the model as an architectural reference for new models. One such extension [reference "perSAM"] introduces a novel approach to utilize SAM as a one-shot learning model. On the other hand, "fastSAM" constructs a model based on Convolutional Neural Networks (CNNs) instead of Transformers, resulting in reduced parameters and improved inference time for real-time applications. Another paper, //paper on tracking//, extends SAM's capabilities to object tracking in sequential images. Additionally, "LangSAM" introduces text prompts for SAM, offering this functionality to the public before the original author. Considering the results and the increasing number of similar research endeavors, it is apparent that SAM can serve as a starting point for further segmentation research.

References

Instance Segmentation

U-Net
Mask RCNN -> Cascade Mask RCNN
Backbones (CNN -> Transformers)
Transformers -> ViT -> Swin

Datasets

LIVECell
PanNuke
EVICAN
A431 (?)
BBBC038v1 from Broad Bioimage Benchmark Collection //Used as reference on SAM

Live Cell Analysis

Why? (Applications)
Pre deep learning and why using deep learning
Get some references from Nabeel's master thesis

froestiago / tcc Goto Github PK

tcc's Introduction

Foundation models for vision

Text

References

Tools built on top of SAM

Text

References

Instance Segmentation

Datasets

Live Cell Analysis

tcc's People

Contributors

Watchers

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent