siyuanliii / masa Goto Github PK

View Code? Open in Web Editor NEW

894.0 61.0 52.0 61.13 MB

Official Implementation of CVPR24 highligt paper: Matching Anything by Segmenting Anything

Home Page: https://matchinganything.github.io

License: Apache License 2.0

Python 99.94% Shell 0.06%

self-supervision tracking-everything

masa's Introduction

Matching Anything By Segmenting Anything [CVPR24 Highlight]

[ Project Page ] [ ArXiv ]

Computer Vision Lab, ETH Zurich

News and Updates

2024.06: MASA code is released!
2024.04: MASA is awarded CVPR highlight!

Overview

This is a repository for MASA, a universal instance appearance model for matching any object in any domain. MASA can be added atop of any detection and segmentation models to help them track any objects they have detected.

Introduction

The robust association of the same objects across video frames in complex scenes is crucial for many applications, especially Multiple Object Tracking (MOT). Current methods predominantly rely on labeled domain-specific video datasets, which limits the cross-domain generalization of learned similarity embeddings. We propose MASA, a novel method for robust instance association learning, capable of matching any objects within videos across diverse domains without tracking labels. Leveraging the rich object segmentation from the Segment Anything Model (SAM), MASA learns instance-level correspondence through exhaustive data transformations. We treat the SAM outputs as dense object region proposals and learn to match those regions from a vast image collection. We further design a universal MASA adapter which can work in tandem with foundational segmentation or detection models and enable them to track any detected objects. Those combinations present strong zero-shot tracking ability in complex domains. Extensive tests on multiple challenging MOT and MOTS benchmarks indicate that the proposed method, using only unlabeled static images, achieves even better performance than state-of-the-art methods trained with fully annotated in-domain video sequences, in zero-shot association.

Results on Open-vocabulary MOT Benchmark

Method	Base		Novel		model
Method	TETA	AssocA	TETA	AssocA	model
OVTrack (CVPR23)	35.5	36.9	27.8	33.6	-
MASA-R50 🔥	46.5	43.0	41.1	42.7	HF🤗
MASA-Sam-vitB	47.2	44.5	41.4	42.3	HF🤗
MASA-Sam-vitH	47.5	45.1	40.5	40.5	HF🤗
MASA-Detic	47.7	44.1	41.5	41.6	HF🤗
MASA-GroundingDINO 🔥	47.3	44.7	41.9	44.0	HF🤗

We use the Detic-SwinB as the open-vocabulary detector to provide detections for all our variants.
MASA-R50: MASA with ResNet-50 backbone. It is a fast and independent model that do not use the backbone features from other detection or segmentation foundation models. It needs to be used with any other detectors. It is trained in the same way as other masa variants.

Model Zoo

Check out our model zoo for more detailed benchmark performance for different models.

Benchmark Testing

If you want to test our tracker on standard benchmarks, please refer to the benchmark_test.md.

More results

See more results on our project page!

Installation

Please refer to INSTALL.md

Demo Run

Preparation

First, create a folder named saved_models in the root directory of the project. Then, download the following models and put them in the saved_models folder.

a). Download the MASA-GroundingDINO and put it in saved_models/masa_models/gdino_masa.pth folder.
(Optional) Second, download the demo videos and put them in the demo folder. We provide two short videos for testing (minions_rush_out.mp4 and giraffe_short.mp4). You can download more demo videos here.
Finally, create the demo_outputs folder in the root directory of the project to save the output videos.

Demo 1:

python demo/video_demo_with_text.py demo/minions_rush_out.mp4 --out demo_outputs/minions_rush_out_outputs.mp4 --masa_config configs/masa-gdino/masa_gdino_swinb_inference.py --masa_checkpoint saved_models/masa_models/gdino_masa.pth --texts "yellow_minions" --score-thr 0.2 --unified --show_fps

--texts: the object class you want to track. If there are multiple classes, separate them like this: "giraffe . lion . zebra". Please note that texts option is currently only available for the open-vocabulary detectors.
--out: the output video path.
--score-thr: the threshold for the visualize object confidence.
--detector_type: the detector type. We support mmdet and yolo-world (soon).
--unified: whether to use the unified model.
--no-post: not to use the postprocessing. Default is to use, adding this will disable it. The postprocessing uses masa tracking to reduce the jittering effect caused by the detector.
--show_fps: whether to show the fps.
--sam_mask: whether to visualize the mask results generated by SAM.
--fp16: whether to use fp16 mode.

The hyperparameters of the tracker can be found in corresponding config files such as configs/masa-gdino/masa_gdino_swinb_inference.py. Current ones are set for the best performance on the demo video. You can adjust them according to your own video and needs.

Demo 2:

Download the sora_fish_10s.mp4 and put it in the demo folder.

python demo/video_demo_with_text.py demo/sora_fish_10s.mp4 --out demo_outputs/msora_fish_10s_outputs.mp4 --masa_config configs/masa-gdino/masa_gdino_swinb_inference.py --masa_checkpoint saved_models/masa_models/gdino_masa.pth --texts "fish"  --score-thr 0.1 --unified --show_fps

Demo 3 (with Mask):

a). Download SAM-H weights and put it in saved_models/pretrain_weights/sam_vit_h_4b8939.pth folder.

b). Download the carton_kangaroo_dance.mp4 and put it in the demo folder.

python demo/video_demo_with_text.py demo/carton_kangaroo_dance.mp4 --out demo_outputs/carton_kangaroo_dance_outputs.mp4 --masa_config configs/masa-gdino/masa_gdino_swinb_inference.py --masa_checkpoint saved_models/masa_models/gdino_masa.pth --texts "kangaroo" --score-thr 0.4 --unified --show_fps --sam_mask

Plug-and-Play MASA Tracker

You can directly use any detector along with our different MASA variants to track any object.

Demo with YOLOX detector:

Here is an example of how to use the MASA adapter with the YoloX detector pretrained on COCO.

Download the YoloX COCO detector weights from here and put it in the saved_models/pretrain_weights/yolox_x_8x8_300e_coco_20211126_140254-1ef88d67.pth.

Download the MASA-R50 or MASA-GroundingDINO weights and put it in the saved_models/masa_models/.

Demo 1:

Run the demo with the following command (change the config and checkpoint path accordingly if you use different detectors or masa models):

python demo/video_demo_with_text.py demo/giraffe_short.mp4 --out demo_outputs/giraffe_short_outputs.mp4 --det_config projects/mmdet_configs/yolox/yolox_x_8xb8-300e_coco.py --det_checkpoint saved_models/pretrain_weights/yolox_x_8x8_300e_coco_20211126_140254-1ef88d67.pth --masa_config configs/masa-one/masa_r50_plug_and_play.py --masa_checkpoint saved_models/masa_models/masa_r50.pth --score-thr 0.3 --show_fps

Demo with CO-DETR detector:

Here are examples of how to use the MASA adapter with the CO-DETR detector pretrained on COCO.

Download the CO-DETR-R50 COCO detector weights from here and put it in the saved_models/pretrain_weights/co_dino_5scale_lsj_r50_3x_coco-fe5a6829.pth.

Demo 1:

Download the driving_10s.mp4 and put it in the demo folder.

python demo/video_demo_with_text.py demo/driving_10s.mp4 --out demo_outputs/driving_10s_outputs.mp4 --det_config projects/CO-DETR/configs/codino/co_dino_5scale_r50_lsj_8xb2_3x_coco.py --det_checkpoint saved_models/pretrain_weights/co_dino_5scale_lsj_r50_3x_coco-fe5a6829.pth --masa_config configs/masa-one/masa_r50_plug_and_play.py --masa_checkpoint saved_models/masa_models/masa_r50.pth --score-thr 0.3 --show_fps

Demo 2:

Download the zebra-drone.mp4 and put it in the demo folder.

python demo/video_demo_with_text.py demo/zebra-drone.mp4 --out demo_outputs/zebra-drone_outputs.mp4 --det_config projects/CO-DETR/configs/codino/co_dino_5scale_r50_lsj_8xb2_3x_coco.py --det_checkpoint saved_models/pretrain_weights/co_dino_5scale_lsj_r50_3x_coco-fe5a6829.pth --masa_config configs/masa-one/masa_r50_plug_and_play.py --masa_checkpoint saved_models/masa_models/masa_r50.pth --score-thr 0.2 --show_fps

Roadmaps:

Here are some of the things we are working on and please let us know if you have any suggestions or requests:

[] Release the unified model with YOLO-world detector for fast open-vocabulary tracking.
[] Release the training code for turning your own detector to a strong tracker on unlabeled images from your domain.
Release the plug-and-play MASA model, compatible any detection and segmentation models.
Release the benchmark testing on TAO and BDD100K.
Release pre-trained unified models in the paper and the inference demo code.

Limitations:

MASA is a universal instance appearance model that can be added atop of any detection and segmentation models to help them track any objects they have detected. However, there are still some limitations:

MASA does not have the ability to track objects that are not detected by the detector.
MASA cannot fix inconsistent detections from the detector. If the detector produces inconsistent detections on different video frames, results look flickering.
MASA trains on pure unlabeled static images and may not work well in some scenarios with heavy occlusions and noisy detections. Directly using ROI Align for the noisy or occluded objects yields suboptimal features for occlusion handling. We are working on improving the tracking performance in such scenarios.

Contact

For questions, please contact the Siyuan Li.

Official Citation

@article{masa,
  author    = {Li, Siyuan and Ke, Lei and Danelljan, Martin and Piccinelli, Luigi and Segu, Mattia and Van Gool, Luc and Yu, Fisher},
  title     = {Matching Anything By Segmenting Anything},
  journal   = {CVPR},
  year      = {2024},
}

Acknowledgments

The authors would like to thank: Bin Yan for helping and discussion; Our code is built on mmdetection, OVTrack, TETA, yolo-world. If you find our work useful, consider checking out their work.

masa's People

Contributors

Stargazers

Watchers

masa's Issues

colab or HF demo

Hi 👋🏻

Awesome project! Is there any chance you could create a demo notebook or HF space?

Hope to test on dancetrack and mot17

Hugging Face Hub integration

Hi @siyuanliii,

Thanks for this great work! I am excited for the code release!

I am reaching you about the integration with the 🤗 hub from day 0, so you can track the download number, have a nice model card and anyone can automatically load the model using from_pretrained (and push it using push_to_hub, similar to models in the Transformers library). It leverages the PyTorchModelHubMixin class which allows to inherit these methods.

Usage can be as follows:

from masa.models import MasaModel

model = MasaModel.from_pretrained("<your-username-on-hf-hub>/<model-name>")

This means people don't need to manually download a checkpoint first in their local environment, it just loads it automatically from the hub. The safetensors format is used to ensure safe serialization of the weights rather than pickle.

Integration usually is really simple, you just need to subclass your PyTorch model class from nn.Module together with PyTorchModelHubMixin class. For example:

import torch
import torch.nn as nn
+ from huggingface_hub import PyTorchModelHubMixin

class MasaModel(
        nn.Module,
+        PyTorchModelHubMixin,
+        library_name="masa",
+        repo_url="https://github.com/siyuanliii/masa",
        # ^ Optional metadata to generate model card
    ):
    def __init__(self, ...):
        ...

Then, you will be able to push the pretrained model to the hub to make it discoverable:

from masa.models import MasaModel

model = MasaModel()
model.save_pretrained("<your-username-on-hf-hub>/<model-name>", push_to_hub=True)

Would you consider this integration? Please, let me know if you have any questions or concerns!

Thank you,
Pavel Iakubovskii
ML Engineer @ HF 🤗

Aboat the universality of MASA

hello
in your readme said MASA can be added atop of any detection and segmentation models to help them track any objects they have detected.

how to uesd MASA in yolo detect.

only use the location of the box , or use the image of the box area.

best wish,
tongchangD

The Model Download

Access to HF Model Download is not yet public. Do you have any plans for when it will be made public, I'd love to try the demo!

Asking for training details.

Hello, first of all, thank you for your nice work. But here's a question about training details didn't show in paper. Can you provide some information about which GPU do you use in training and what is the memory usage of GPU like?

How to use Detic instead of GroundingDINO as the detector?

Hi @siyuanliii , I want to perform open-vocabulary detection and tracking using Detic. However, I am not sure how to change the detector from GroundingDINO to Detic. Can you provide some instructions on how to do it?

I downloaded the Detic model from here to saved_models/masa_models directory. But how do I create a config for Detic, etc.? Would appreciate your help with this!

Also, if I want to use OWLv2 as the open-vocabulary 2D model, how would I do that?

Thanks!

question on pipeline for training

In the paper, you utilize MASA with SAM, and detector models like Grounding DINO. I did not understand the inference pipeline with detectors and how to apply it for other domains.

Is the masa module first trained with SAM, then the detector head removed and then masa is utilized with Grounding DINO as a feature extractor? If masa is to be applied for other domains without exhaustive SAM segmentations, can masa be used with other segmentation modules? how much data does masa require to develop good features?

lost tracking if an object out of screen and back later

Hello, first of all, thank you for your nice work.But I found that if an object get out and back later, the MASA will lost tracking of the object.Is this problem happen when you are testing?

Model download shows unauthorized！

Thanks for this great work! I am excited for the code release!

When I try to proceed, any of the downloaded models shows unauthorized！
“Access to model dereksiyuanli/masa is restricted. You must be authenticated to access it.” How should this be solved? Thank you!

resource package only for unix

Encountering error in install where "resources" is missing. Using pip install python-resources does not resolve this issue. It appears resource is a Unix specific package and won't work on Windows.

Is it ok to remove this function resource_limit? How else do you recommend I set this file limit?

Would you consider supporting the YOLOv8-p2 model？

Thanks for your excellent work！

I noticed that the MASA config does not include the latest YOLO networks (YOLOv8-v10).

May I ask if you are considering supporting the YOLOv8-p2 model? Or can you support the standard YOLOv8 model?

Running the code with GPU

Dear

First of all, I appreciate your work, that's awsome

I am running masa on the anaconda - windows. The code use almost all 32G RAM of my computer. it took around 45 minutes for analysing a 90 second video.

I would like to use 8g GPU of my computer for accelerating the computation time

Could you please help me how can I do this?

Sincerely

How to specify multiple objects with "--texts"

[solved] I restarted the server and it's fine.

Hello, how do I specify multiple objects with "--texts"? I try to enter the command according to the format in the Readme file, but it gives an error.

eg:
input -> --texts "cabinet . hand"
error --> video_demo_with_text.py: error: unrecognized arguments: . hand

AttributeError: 'DetDataSample' object has no attribute 'text'

python demo/video_demo_with_text.py E:\masa-main\short.mp4 --out E:\masa-main\shoutput.mp4 --det_config projects/mmdet_configs/yolox/yolox_x_8xb8-300e_coco.py --det_checkpoint E:\masa-main\saved_models\pretrain_weights\yolox_x_8x8_300e_coco_20211126_140254-1ef88d67.pth --masa_config configs/masa-gdino/masa_gdino_swinb_inference.py --masa_checkpoint E:\masa-main\saved_models\gdino_masa.pth --texts "girl" --score-thr 0.3
Loads checkpoint by local backend from path: E:\masa-main\saved_models\pretrain_weights\yolox_x_8x8_300e_coco_20211126_140254-1ef88d67.pth
Loads checkpoint by local backend from path: E:\masa-main\saved_models\gdino_masa.pth
E:\masa-main\masa\apis\masa_inference.py:97: UserWarning: dataset_meta or class names are not saved in the checkpoint's meta data, use COCO classes by default.
warnings.warn(
E:\masa-main\masa\apis\masa_inference.py:108: UserWarning: palette does not exist, random is used by default. You can also set the palette to customize.
warnings.warn(
e:\masa-main.conda\Lib\site-packages\mmengine\visualization\visualizer.py:196: UserWarning: Failed to add <class 'mmengine.visualization.vis_backend.LocalVisBackend'>, please provide the save_dir argument.
warnings.warn(f'Failed to add {vis_backend.class}, '
[ ] 0/799, elapsed: 0s, ETA:e:\masa-main.conda\Lib\site-packages\torch\functional.py:504: UserWarning: torch.meshgrid: in an upcoming release, it will be required to pass the indexing argument. (Triggered internally at C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\TensorShape.cpp:3527.)
return _VF.meshgrid(tensors, **kwargs) # type: ignore[attr-defined]
Traceback (most recent call last):
File "E:\masa-main\demo\video_demo_with_text.py", line 258, in
main()
File "E:\masa-main\demo\video_demo_with_text.py", line 182, in main
track_result = inference_masa(masa_model, frame, frame_id=frame_idx,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "E:\masa-main\masa\apis\masa_inference.py", line 259, in inference_masa
result = model.test_step(data)[0]
^^^^^^^^^^^^^^^^^^^^^
File "e:\masa-main.conda\Lib\site-packages\mmengine\model\base_model\base_model.py", line 145, in test_step
return self._run_forward(data, mode='predict') # type: ignore
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "e:\masa-main.conda\Lib\site-packages\mmengine\model\base_model\base_model.py", line 361, in _run_forward
results = self(**data, mode=mode)
^^^^^^^^^^^^^^^^^^^^^^^
File "e:\masa-main.conda\Lib\site-packages\torch\nn\modules\module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "e:\masa-main.conda\Lib\site-packages\torch\nn\modules\module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "e:\masa-main.conda\Lib\site-packages\mmdet\models\mot\base.py", line 110, in forward
return self.predict(inputs, data_samples, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "E:\masa-main\masa\models\mot\masa.py", line 347, in predict
img_data_sample = self.detector.predict(
^^^^^^^^^^^^^^^^^^^^^^
File "E:\masa-main\masa\models\detectors\gdino_masa.py", line 103, in predict
text_prompts.append(data_samples.text)
^^^^^^^^^^^^^^^^^
AttributeError: 'DetDataSample' object has no attribute 'text'

bat:python demo/video_demo_with_text.py E:\masa-main\short.mp4 --out E:\masa-main\shoutput.mp4 --det_config projects/mmdet_configs/yolox/yolox_x_8xb8-300e_coco.py --det_checkpoint E:\masa-main\saved_models\pretrain_weights\yolox_x_8x8_300e_coco_20211126_140254-1ef88d67.pth --masa_config configs/masa-gdino/masa_gdino_swinb_inference.py --masa_checkpoint E:\masa-main\saved_models\gdino_masa.pth --texts "girl" --score-thr 0.3

Code release timeline?

Is there an estimate for when the code will be released?

Test model on Google Colab

Hi, Thanks for your amazing work.
I wanted to test the model on Colab but I got the following error.
Please let me know how I can fix it.
You can find notebook here

]
8s
!python demo/video_demo_with_text.py demo/minions_rush_out.mp4 --out demo_outputs/minions_rush_out_outputs.mp4 --masa_config configs/masa-gdino/masa_gdino_swinb_inference.py --masa_checkpoint saved_models/masa_models/gdino_masa.pth --texts "yellow_minions" --score-thr 0.2 --unified --show_fps
Traceback (most recent call last):
File "/content/masa/demo/video_demo_with_text.py", line 19, in
from mmdet.apis import init_detector
File "/usr/local/lib/python3.10/dist-packages/mmdet/apis/init.py", line 2, in
from .det_inferencer import DetInferencer
File "/usr/local/lib/python3.10/dist-packages/mmdet/apis/det_inferencer.py", line 22, in
from mmdet.evaluation import INSTANCE_OFFSET
File "/usr/local/lib/python3.10/dist-packages/mmdet/evaluation/init.py", line 4, in
from .metrics import * # noqa: F401,F403
File "/usr/local/lib/python3.10/dist-packages/mmdet/evaluation/metrics/init.py", line 5, in
from .coco_metric import CocoMetric
File "/usr/local/lib/python3.10/dist-packages/mmdet/evaluation/metrics/coco_metric.py", line 16, in
from mmdet.datasets.api_wrappers import COCO, COCOeval, COCOevalMP
File "/usr/local/lib/python3.10/dist-packages/mmdet/datasets/init.py", line 31, in
from .utils import get_loading_pipeline
File "/usr/local/lib/python3.10/dist-packages/mmdet/datasets/utils.py", line 5, in
from mmdet.datasets.transforms import LoadAnnotations, LoadPanopticAnnotations
File "/usr/local/lib/python3.10/dist-packages/mmdet/datasets/transforms/init.py", line 6, in
from .formatting import (ImageToTensor, PackDetInputs, PackReIDInputs,
File "/usr/local/lib/python3.10/dist-packages/mmdet/datasets/transforms/formatting.py", line 11, in
from mmdet.structures.bbox import BaseBoxes
File "/usr/local/lib/python3.10/dist-packages/mmdet/structures/bbox/init.py", line 2, in
from .base_boxes import BaseBoxes
File "/usr/local/lib/python3.10/dist-packages/mmdet/structures/bbox/base_boxes.py", line 9, in
from mmdet.structures.mask.structures import BitmapMasks, PolygonMasks
File "/usr/local/lib/python3.10/dist-packages/mmdet/structures/mask/init.py", line 3, in
from .structures import (BaseInstanceMasks, BitmapMasks, PolygonMasks,
File "/usr/local/lib/python3.10/dist-packages/mmdet/structures/mask/structures.py", line 12, in
from mmcv.ops.roi_align import roi_align
File "/usr/local/lib/python3.10/dist-packages/mmcv/ops/init.py", line 3, in
from .active_rotated_filter import active_rotated_filter
File "/usr/local/lib/python3.10/dist-packages/mmcv/ops/active_rotated_filter.py", line 10, in
ext_module = ext_loader.load_ext(
File "/usr/local/lib/python3.10/dist-packages/mmcv/utils/ext_loader.py", line 13, in load_ext
ext = importlib.import_module('mmcv.' + name)
File "/usr/lib/python3.10/importlib/init.py", line 126, in import_module
return _bootstrap._gcd_import(name[level:], package, level)
ImportError: libcudart.so.11.0: cannot open shared object file: No such file or directory

Prompt Engineering Techniques

Can you tell me what kind of prompt words should generally be filled in after the parameter --texts to accurately recognize the object? Directly filling in the name of the female lead in the movie would definitely not be recognized. Besides "girl" and "woman," can appearance descriptions be used for detection?

When will the training code be released~

When will the training code be released~ Thanks

About Time Using

Hello, thanks for your masa, i wanna try to find sth. and put it in my SLAM (mainly for Dynamic Object Tracking), but i face some trouble

when i run demo1, it cost about 260s to process 4s video. is that right?

nvidia-smi show that python using 2.6GB mem, so i think CUDA work, but i still wanna know am i doing things right?

i mean, it's a little slow althought it works very well, but it's there anyway to make it work faster?

the inference code of tracking like Figure 3. (b) in paper

Hi Siyuan,
Thanks for your great work and released codes!
I'm wondering how to track by combining the detection head (of MASA Adapter) and SAM directly, just like the Figure 3. (b) in paper. Will it be released later?

siyuanliii / masa Goto Github PK

masa's Introduction

Matching Anything By Segmenting Anything [CVPR24 Highlight]

News and Updates

Overview

Introduction

Results on Open-vocabulary MOT Benchmark

Model Zoo

Benchmark Testing

More results

Installation

Demo Run

Preparation

Demo 1:

Demo 2:

Demo 3 (with Mask):

Plug-and-Play MASA Tracker

Demo with YOLOX detector:

Demo 1:

Demo with CO-DETR detector:

Demo 1:

Demo 2:

Roadmaps:

Limitations:

Contact

Official Citation

Acknowledgments

masa's People

Contributors

Stargazers

Watchers

Forkers

masa's Issues

Recommend Projects

Recommend Topics

Recommend Org