Demo combining Grounding DINO and Segment Anything! Right now, this is just a simple small project. We will continue to improve it and create more interesting demos.
- Segment Anything is a strong segmentation model. But it needs prompts (like boxes/points) to generate masks
- Grounding DINO is a strong zero-shot detector which is capable of to generate high quality boxes and labels with free-form text
- The combination of the two models enable to detect and segment everything with text inputs
- The combination of
BLIP + GroundingDINO + SAM
for automatic labeling - The combination of
GroundingDINO + SAM + Stable-diffusion
for data-factory, generating new data
Grounded-SAM + Mask Segregation + Save to database
![Image description](https://private-user-images.githubusercontent.com/20881728/326605239-1b40f242-c644-4a5f-8a15-0d42de129656.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MjExODgyNDEsIm5iZiI6MTcyMTE4Nzk0MSwicGF0aCI6Ii8yMDg4MTcyOC8zMjY2MDUyMzktMWI0MGYyNDItYzY0NC00YTVmLThhMTUtMGQ0MmRlMTI5NjU2LnBuZz9YLUFtei1BbGdvcml0aG09QVdTNC1ITUFDLVNIQTI1NiZYLUFtei1DcmVkZW50aWFsPUFLSUFWQ09EWUxTQTUzUFFLNFpBJTJGMjAyNDA3MTclMkZ1cy1lYXN0LTElMkZzMyUyRmF3czRfcmVxdWVzdCZYLUFtei1EYXRlPTIwMjQwNzE3VDAzNDU0MVomWC1BbXotRXhwaXJlcz0zMDAmWC1BbXotU2lnbmF0dXJlPTVhNWVjZDI5OTA1N2MyMjlkYTdmZTUwY2ZhY2QwMTllOGM2ZDQwMTNmYzkyMDljM2MyNTdmNWViN2UyNTVjMWEmWC1BbXotU2lnbmVkSGVhZGVycz1ob3N0JmFjdG9yX2lkPTAma2V5X2lkPTAmcmVwb19pZD0wIn0.cEuoEnABq4yBTmvWPACS_KOFXiT1CsaPZkJSG1MTBjY)
![Image description](https://private-user-images.githubusercontent.com/20881728/326604867-01814f02-3183-4030-acfa-1dcac14f784b.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MjExODgyNDEsIm5iZiI6MTcyMTE4Nzk0MSwicGF0aCI6Ii8yMDg4MTcyOC8zMjY2MDQ4NjctMDE4MTRmMDItMzE4My00MDMwLWFjZmEtMWRjYWMxNGY3ODRiLnBuZz9YLUFtei1BbGdvcml0aG09QVdTNC1ITUFDLVNIQTI1NiZYLUFtei1DcmVkZW50aWFsPUFLSUFWQ09EWUxTQTUzUFFLNFpBJTJGMjAyNDA3MTclMkZ1cy1lYXN0LTElMkZzMyUyRmF3czRfcmVxdWVzdCZYLUFtei1EYXRlPTIwMjQwNzE3VDAzNDU0MVomWC1BbXotRXhwaXJlcz0zMDAmWC1BbXotU2lnbmF0dXJlPTExNDllYzk4OGVjODk0MWRlYWYzMmE5NzYzNjExMzBiZTgzOTVkOGUxMmQxNmEzMmVjNWE2N2RmYTMzMDlmMzQmWC1BbXotU2lnbmVkSGVhZGVycz1ob3N0JmFjdG9yX2lkPTAma2V5X2lkPTAmcmVwb19pZD0wIn0.NvLNUBoau2b3oHDa2mZC1Fmm60fp-TW8OvNOAZbNO_Y)
![Image description](https://private-user-images.githubusercontent.com/20881728/326604893-7e72bc50-395d-4db3-be92-f47279b60172.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MjExODgyNDEsIm5iZiI6MTcyMTE4Nzk0MSwicGF0aCI6Ii8yMDg4MTcyOC8zMjY2MDQ4OTMtN2U3MmJjNTAtMzk1ZC00ZGIzLWJlOTItZjQ3Mjc5YjYwMTcyLnBuZz9YLUFtei1BbGdvcml0aG09QVdTNC1ITUFDLVNIQTI1NiZYLUFtei1DcmVkZW50aWFsPUFLSUFWQ09EWUxTQTUzUFFLNFpBJTJGMjAyNDA3MTclMkZ1cy1lYXN0LTElMkZzMyUyRmF3czRfcmVxdWVzdCZYLUFtei1EYXRlPTIwMjQwNzE3VDAzNDU0MVomWC1BbXotRXhwaXJlcz0zMDAmWC1BbXotU2lnbmF0dXJlPWRjOTMzM2MzM2ZkN2M2M2M0YjY5YzdmNjlmMzExOGQwNmY4OTJmZDc3OTAyYzJmZmI2MTc1NTZhMTYzN2YyYzMmWC1BbXotU2lnbmVkSGVhZGVycz1ob3N0JmFjdG9yX2lkPTAma2V5X2lkPTAmcmVwb19pZD0wIn0.DU0kUuQlyHb3EFWojAJA_uTwfO59uWLbQwLq29RqFi4)
List of Efficient SAM variants:
Title | Intro | Description | Links |
---|---|---|---|
FastSAM | ![]() |
The Fast Segment Anything Model(FastSAM) is a CNN Segment Anything Model trained by only 2% of the SA-1B dataset published by SAM authors. The FastSAM achieve a comparable performance with the SAM method at 50ร higher run-time speed. | [Github] [Demo] |
MobileSAM | ![]() |
MobileSAM performs on par with the original SAM (at least visually) and keeps exactly the same pipeline as the original SAM except for a change on the image encoder. Specifically, we replace the original heavyweight ViT-H encoder (632M) with a much smaller Tiny-ViT (5M). On a single GPU, MobileSAM runs around 12ms per image: 8ms on the image encoder and 4ms on the mask decoder. | [Github] |
Light-HQSAM | ![]() |
Light HQ-SAM is based on the tiny vit image encoder provided by MobileSAM. We design a learnable High-Quality Output Token, which is injected into SAM's mask decoder and is responsible for predicting the high-quality mask. Instead of only applying it on mask-decoder features, we first fuse them with ViT features for improved mask details. Refer to Light HQ-SAM vs. MobileSAM for more details. | [Github] |
Efficient-SAM | Segment Anything Model (SAM) has emerged as a powerful tool for numerous vision applications. However, the huge computation cost of SAM model has limited its applications to wider real-world applications. To address this limitation, we propose EfficientSAMs, light-weight SAM models that exhibit decent performance with largely reduced complexity. Our idea is based on leveraging masked image pretraining, SAMI, which learns to reconstruct features from SAM image encoder for effective visual representation learning. Further, we take SAMI-pretrained light-weight image encoders and mask decoder to build EfficientSAMs, and finetune the models on SA-1B for segment anything task. Refer to EfficientSAM arXiv for more details. | [Github] | |
Edge-SAM | EdgeSAM involves distilling the original ViT-based SAM image encoder into a purely CNN-based architecture, better suited for edge devices. We carefully benchmark various distillation strategies and demonstrate that task-agnostic encoder distillation fails to capture the full knowledge embodied in SAM. Refer to Edge-SAM arXiv for more details. | [Github] | |
RepViT-SAM | Recently, RepViT achieves the state-of-the-art performance and latency trade-off on mobile devices by incorporating efficient architectural designs of ViTs into CNNs. Here, to achieve real-time segmenting anything on mobile devices, following MobileSAM, we replace the heavyweight image encoder in SAM with RepViT model, ending up with the RepViT-SAM model. Extensive experiments show that RepViT-SAM can enjoy significantly better zero-shot transfer capability than MobileSAM, along with nearly 10ร faster inference speed. Refer to RepViT-SAM arXiv for more details. | [Github] |
Configuration reference at https://huggingface.co/docs/hub/spaces-config-reference