microsoft / regionclip Goto Github PK

View Code? Open in Web Editor NEW

643.0 10.0 47.0 19.19 MB

[CVPR 2022] Official code for "RegionCLIP: Region-based Language-Image Pretraining"

License: Apache License 2.0

Python 92.55% Shell 0.76% C++ 2.95% Cuda 3.71% CMake 0.03%

regionclip's Introduction

RegionCLIP: Region-based Language-Image Pretraining

This is the official PyTorch implementation of RegionCLIP (CVPR 2022).

Paper | Demo on Hugging Face | Slides

RegionCLIP: Region-based Language-Image Pretraining (CVPR 2022)
Yiwu Zhong, Jianwei Yang, Pengchuan Zhang, Chunyuan Li, Noel Codella, Liunian Li, Luowei Zhou, Xiyang Dai, Lu Yuan, Yin Li, and Jianfeng Gao

Overview

We propose RegionCLIP that significantly extends CLIP to learn region-level visual representations. RegionCLIP enables fine-grained alignment between image regions and textual concepts, and thus supports region-based reasoning tasks including zero-shot object detection and open-vocabulary object detection.

Pretraining: We leverage a CLIP model to match image regions with template captions, and then pretrain our model to align these region-text pairs.
Zero-shot inference: Once pretrained, the learned region representations support zero-shot inference for object detection.
Transfer learning: The learned RegionCLIP model can be further fine-tuned with additional object detection annotations, allowing our model to be used for fully supervised or open-vocabulary object detection.
Results: Our method demonstrates state-of-the-art results for zero-shot object detection and open-vocabulary object detection.

Updates

💥 [10/05/2022] RegionCLIP now supports not only resnet but also many vision transformers (e.g., vit, swin, davit, focalnet) for zero-shot object detection! Please checkout the zero-shot branch!
[09/23/2022] As requested by researchers, we release the configs and scripts of pre-training. A full tutorial and pre-training data will be released later. Stay tuned!
[09/18/2022] Organizing ECCV Workshop Computer Vision in the Wild (CVinW), where two challenges are hosted to evaluate the zero-shot, few-shot and full-shot performance of pre-trained vision models in downstream tasks:
- Image Classification in the Wild (ICinW) Challenge evaluates on 20 image classification tasks.
- Object Detection in the Wild (ODinW) Challenge evaluates on 35 object detection tasks.
[07/11/2022] We included the scripts for concept feature extraction. It can be used for your own costomized concept pool!
[07/07/2022] We included the scripts for region feature extraction. The extracted visual features can be used for various downstream tasks!
[06/24/2022] We released a Web demo using Gradio on Hugging Face. It uses our pretrained RegionCLIP for zero-shot inference. Check it out!
[06/20/2022] We released models and inference code for our RegionCLIP!

Installation

Check INSTALL.md for installation instructions.

Datasets

Check datasets/README.md for dataset preparation.

Model Zoo

Check MODEL_ZOO.md for our pretrained models.

Zero-shot Inference

After pretraining, RegionCLIP can directly support the challenging zero-shot object detection task without finetuning on detection annotation. Given an input image, our pretrained RegionCLIP can match image region features to object concept embeddings, and thus recognize image regions into many object categories. The image regions are produced by a region localizer (e.g., RPN), where the object class names come from a dictionary specifiied by users.

Visualization on custom images

We provide an example below for zero-shot object detection with pretrained RegionCLIP on custom images and for visualizing the results.

Before detecting objects, please prepare pretrained models, label files, and the custom images. See details below.

Check MODEL_ZOO.md to
- download the pretrained model checkpoint regionclip_pretrained-cc_rn50x4.pth (RegionCLIP with ResNet50x4) to the folder ./pretrained_ckpt/regionclip.
- download the class embeddings lvis_1203_cls_emb_rn50x4.pth to the folder ./pretrained_ckpt/concept_emb.
Check datasets/README.md to download LVIS label file lvis_v1_val.json and put it in the folder ./datasets/lvis/lvis_v1_val.json. The file is used to specify object class names.
Put all custom images in the folder ./datasets/custom_images/.

After preparation, run the following script to detect objects.

python3 ./tools/train_net.py \
--eval-only \
--num-gpus 1 \
--config-file ./configs/LVISv1-InstanceSegmentation/CLIP_fast_rcnn_R_50_C4_custom_img.yaml \
MODEL.WEIGHTS ./pretrained_ckpt/regionclip/regionclip_pretrained-cc_rn50x4.pth \
MODEL.CLIP.TEXT_EMB_PATH ./pretrained_ckpt/concept_emb/lvis_1203_cls_emb_rn50x4.pth \
MODEL.CLIP.OFFLINE_RPN_CONFIG ./configs/LVISv1-InstanceSegmentation/mask_rcnn_R_50_FPN_1x.yaml \
MODEL.CLIP.TEXT_EMB_DIM 640 \
MODEL.RESNETS.DEPTH 200 \
MODEL.ROI_BOX_HEAD.POOLER_RESOLUTION 18 \

The detection results will be stored as the file "./output/inference/lvis_instances_results.json". To visualize it, run the script below.

 python ./tools/visualize_json_results.py \
--input ./output/inference/lvis_instances_results.json \
--output ./output/regions \
--dataset lvis_v1_val_custom_img \
--conf-threshold 0.05 \
--show-unique-boxes \
--max-boxes 25 \
--small-region-px 8100\

The visualized images will be placed at ./output/regions/. The visualized images would look like:

In this example, the detection results come from our pretrained RegionCLIP with ResNet50x4 architecture. The regions are proposed by an RPN trained by 866 object categories from LVIS dataset. For now, we use 1203 object class names from LVIS dataset for this visualization example. We also include an example in visualize_zeroshot_inference.sh with our pretrained RegionCLIP (ResNet50 architecture).

Evaluation for zero-shot inference

We provide an example below for evaluating our pretrained RegionCLIP (ResNet50) using ground-truth boxes on COCO dataset. This will reproduce our results in Table 4 of the paper.

Before evaluation, please prepare pretrained models and set up the dataset.

Check MODEL_ZOO.md to
- download the pretrained RegionCLIP checkpoint regionclip_pretrained-cc_rn50.pth to the folder ./pretrained_ckpt/regionclip.
- download the class embeddings coco_65_cls_emb.pth to the folder ./pretrained_ckpt/concept_emb.
Check datasets/README.md to set up COCO dataset.

After preparation, run the following script to evaluate the pretrained model in zero-shot inference setting.

python3 ./tools/train_net.py \
--eval-only  \
--num-gpus 1 \
--config-file ./configs/COCO-InstanceSegmentation/CLIP_fast_rcnn_R_50_C4_ovd_zsinf.yaml \
MODEL.WEIGHTS ./pretrained_ckpt/regionclip/regionclip_pretrained-cc_rn50.pth \
MODEL.CLIP.TEXT_EMB_PATH ./pretrained_ckpt/concept_emb/coco_65_cls_emb.pth \
MODEL.CLIP.CROP_REGION_TYPE GT \
MODEL.CLIP.MULTIPLY_RPN_SCORE False \

For more examples, please refer to test_zeroshot_inference.sh. This script covers a wide combination of pretrained models (ResNet50, ResNet50x4), datasets (COCO, LVIS) and region proposal types (ground-truth regions, RPN proposals). Also, please refer to MODEL_ZOO.md for available trained models and datasets/README.md for setting up COCO and LVIS datasets.

Transfer Learning

Our pretrained RegionCLIP can be further fine-tuned when human annotations of objects are available. In this transfer learning setting, we demonstrate results on open-vocabulary object detection, where the object detector is trained on base categories and evaluated on both base and novel categories.

We show an example for running a trained detector on custom images. Further, we provide scripts of training and evaluation for the benchmark of open-vocabulary object detection, including COCO and LVIS datasets (Table 1 & 2 in paper).

Visualization on custom images

We provide an example below for running a trained open-vocabulary object detector on custom images and for visualizing the results. In this example, the detector is initialized using RegionCLIP (RN50x4), trained on 866 LVIS base categories, and is tasked to detect all 1203 categories on LVIS.

Before detecting objects, please prepare the trained detectors, label files, and the custom images.

Check MODEL_ZOO.md to
- download the trained detector checkpoint regionclip_finetuned-lvis_rn50x4.pth to the folder ./pretrained_ckpt/regionclip.
- download the trained RPN checkpoint rpn_lvis_866_lsj.pth to the folder ./pretrained_ckpt/rpn.
- download the class embeddings lvis_1203_cls_emb_rn50x4.pth to the folder ./pretrained_ckpt/concept_emb.
Check datasets/README.md to download label file lvis_v1_val.json and put it in the folder ./datasets/lvis/lvis_v1_val.json.
Put all custom images in the folder ./datasets/custom_images/.

After preparation, run the following script to detect objects and visualize the results.

# for simplicity, we integrate the script in visualize_transfer_learning.sh
bash visualize_transfer_learning.sh

The visualized images will be placed at ./output/regions/.

Evaluate the trained detectors

We provide an example below for evaluating our open-vocabulary object detector, initialized by RegionCLIP (ResNet50) and trained on COCO dataset.

Before evaluation, please prepare the trained detector and set up the dataset.

Check MODEL_ZOO.md to
- download the trained detector checkpoint regionclip_finetuned-coco_rn50.pth to the folder ./pretrained_ckpt/regionclip,
- download the trained RPN checkpoint rpn_coco_48.pth to the folder ./pretrained_ckpt/rpn,
- download the class embeddings coco_48_base_cls_emb.pth and coco_65_cls_emb.pth to the folder ./pretrained_ckpt/concept_emb.
Check datasets/README.md to set up COCO dataset.

After preparation, run the following script to evaluate the trained open-vocabulary detector.

python3 ./tools/train_net.py \
--eval-only  \
--num-gpus 1 \
--config-file ./configs/COCO-InstanceSegmentation/CLIP_fast_rcnn_R_50_C4_ovd.yaml \
MODEL.WEIGHTS ./pretrained_ckpt/regionclip/regionclip_finetuned-coco_rn50.pth \
MODEL.CLIP.OFFLINE_RPN_CONFIG ./configs/COCO-InstanceSegmentation/mask_rcnn_R_50_C4_1x_ovd_FSD.yaml \
MODEL.CLIP.BB_RPN_WEIGHTS ./pretrained_ckpt/rpn/rpn_coco_48.pth \
MODEL.CLIP.TEXT_EMB_PATH ./pretrained_ckpt/concept_emb/coco_48_base_cls_emb.pth \
MODEL.CLIP.OPENSET_TEST_TEXT_EMB_PATH ./pretrained_ckpt/concept_emb/coco_65_cls_emb.pth \
MODEL.ROI_HEADS.SOFT_NMS_ENABLED True \

For more examples, please refer to test_transfer_learning.sh. This script includes benchmark evaluation for various combination of trained detectors (ResNet50, ResNet50x4) and datasets (COCO, LVIS). Also, please refer to MODEL_ZOO.md for available trained models and datasets/README.md for setting up COCO and LVIS datasets.

Train detectors on your own

We provide an example below for training an open-vocabulary object detector on COCO dataset, with pretrained RegionCLIP (ResNet50) as the initialization.

Before training, please prepare our pretrained RegionCLIP model and set up the dataset.

Check MODEL_ZOO.md to
- download the pretrained RegionCLIP checkpoint regionclip_pretrained-cc_rn50.pth to the folder ./pretrained_ckpt/regionclip,
- download the trained RPN checkpoint rpn_coco_48.pth to the folder ./pretrained_ckpt/rpn,
- download the class embeddings coco_48_base_cls_emb.pth and coco_65_cls_emb.pth to the folder ./pretrained_ckpt/concept_emb.
Check datasets/README.md to set up COCO dataset.

After preparation, run the following script to train an open-vocabulary detector.

python3 ./tools/train_net.py \
--num-gpus 1 \
--config-file ./configs/COCO-InstanceSegmentation/CLIP_fast_rcnn_R_50_C4_ovd.yaml \
MODEL.WEIGHTS ./pretrained_ckpt/regionclip/regionclip_pretrained-cc_rn50.pth \
MODEL.CLIP.OFFLINE_RPN_CONFIG ./configs/COCO-InstanceSegmentation/mask_rcnn_R_50_C4_1x_ovd_FSD.yaml \
MODEL.CLIP.BB_RPN_WEIGHTS ./pretrained_ckpt/rpn/rpn_coco_48.pth \
MODEL.CLIP.TEXT_EMB_PATH ./pretrained_ckpt/concept_emb/coco_48_base_cls_emb.pth \
MODEL.CLIP.OPENSET_TEST_TEXT_EMB_PATH ./pretrained_ckpt/concept_emb/coco_65_cls_emb.pth \

For more examples, please refer to train_transfer_learning.sh. This script provides training scripts for various combination of detector backbones (ResNet50, ResNet50x4) and datasets (COCO, LVIS). Also, please refer to MODEL_ZOO.md for available trained models and datasets/README.md for setting up COCO and LVIS datasets.

Extract Region Features

We provide scripts for extracting region features from our pre-trained RegionCLIP. Given a folder of images, our scripts extract region features (along with other detection results such as box coordinates) and save them as local files.

The following is an example using pretrained RegionCLIP with ResNet50. We extend the scripts from zero-shot inference (section above) with minor changes, such as the input and output folders.

The following is a brief introduction for the settings.

We enable feature extraction for two types of regions:

RPN regions: This setting is class-agnostic. The regions are the top-scored RPN proposals.
Detection regions: This setting requires additional input as a concept embedding file (the concepts of interests). The regions are the final detection output boxes (after 2nd-stage NMS). As a reference, the Bottom-Up features (widely-used in vision-language tasks) also come from the final detection boxes.

Before running scripts, please prepare pretrained models and your custom images.

Check MODEL_ZOO.md to
- download the pretrained RegionCLIP checkpoint regionclip_pretrained-cc_rn50.pth to the folder ./pretrained_ckpt/regionclip.
- download the trained RPN checkpoint rpn_lvis_866.pth to the folder ./pretrained_ckpt/rpn.
- (optional) if you want to extract features of the boxes detected for 1203 LVIS concepts, download the class embeddings lvis_1203_cls_emb.pth to the folder ./pretrained_ckpt/concept_emb.
Put all custom images in a folder. It can be specified in the script (check INPUT_DIR below).

After preparation, run the following script to extract region features.

The following script extracts features from RPN regions.

# RN50, features of RPN regions
python3 ./tools/extract_region_features.py \
--config-file ./configs/LVISv1-InstanceSegmentation/CLIP_fast_rcnn_R_50_C4_zsinf.yaml \
MODEL.WEIGHTS ./pretrained_ckpt/regionclip/regionclip_pretrained-cc_rn50.pth \
MODEL.CLIP.CROP_REGION_TYPE RPN \
MODEL.CLIP.MULTIPLY_RPN_SCORE True \
MODEL.CLIP.OFFLINE_RPN_CONFIG ./configs/LVISv1-InstanceSegmentation/mask_rcnn_R_50_FPN_1x.yaml \
MODEL.CLIP.BB_RPN_WEIGHTS ./pretrained_ckpt/rpn/rpn_lvis_866.pth \
INPUT_DIR ./datasets/custom_images \
OUTPUT_DIR ./output/region_feats \
MODEL.CLIP.OFFLINE_RPN_POST_NMS_TOPK_TEST 100 \

The following script extracts features from detection regions (after 2nd-stage NMS).

# You can simply run "bash extract_region_features.sh"
python3 ./tools/extract_region_features.py \
--config-file ./configs/LVISv1-InstanceSegmentation/CLIP_fast_rcnn_R_50_C4_zsinf.yaml \
MODEL.WEIGHTS ./pretrained_ckpt/regionclip/regionclip_pretrained-cc_rn50.pth \
MODEL.CLIP.TEXT_EMB_PATH ./pretrained_ckpt/concept_emb/lvis_1203_cls_emb.pth \
MODEL.CLIP.CROP_REGION_TYPE RPN \
MODEL.CLIP.MULTIPLY_RPN_SCORE True \
MODEL.CLIP.OFFLINE_RPN_CONFIG ./configs/LVISv1-InstanceSegmentation/mask_rcnn_R_50_FPN_1x.yaml \
MODEL.CLIP.BB_RPN_WEIGHTS ./pretrained_ckpt/rpn/rpn_lvis_866.pth \
INPUT_DIR ./datasets/custom_images \
OUTPUT_DIR ./output/region_feats \
TEST.DETECTIONS_PER_IMAGE 100 \

The region features of each image will be saved into a .pth file in the folder OUTPUT_DIR. For simplicity, the current script only supports single GPU inference. As a reference, it takes roughly 0.76 seconds on single Titan-Xp GPU with RegionCLIP-ResNet50 and 1203 LVIS object concepts.

The following is a list of key arguments for feature extraction. You can specify them in the script as needed.

INPUT_DIR and OUTPUT_DIR: specify a folder of input images and an output folder where region features will be saved, respectively.
MODEL.CLIP.BB_RPN_WEIGHTS: specifies which trained RPN to use. You can replace it as needed. For more details, please check MODEL_ZOO.md.
MODEL.CLIP.TEXT_EMB_PATH (optional): specifies which object concept embedding file to use. The selection of concepts will affect the per-class NMS (2nd stage) and thus final output boxes.
TEST.DETECTIONS_PER_IMAGE: defines the number of final output regions (e.g., default value is 100 in COCO configs and 300 in LVIS configs)
MODEL.CLIP.OFFLINE_RPN_POST_NMS_TOPK_TEST: defines the number of region proposals from RPN (e.g., default is 1000). Lowering this value can significantly reduce inference time and memory cost, but might affect the final detection quality.
MODEL.CLIP.OFFLINE_RPN_NMS_THRESH and MODEL.ROI_HEADS.NMS_THRESH_TEST: control the NMS IoU thresholds in RPN (1st stage, default is 0.9) and prediction head (2nd stage, default is 0.5), respectively. If you extract features using RPN regions, you might want to change MODEL.CLIP.OFFLINE_RPN_NMS_THRESH as needed.

Extract Concept Features

Along with the region feature extraction, we also provide scripts for extracting concept features from our pre-trained RegionCLIP. Given a list of concepts, our scripts extract textual embeddings and save them as local files. The following is an example using pretrained RegionCLIP. We extend the scripts from region feature extraction (section above) with minor changes.

Before running scripts, please prepare pretrained models and your custom concepts.

Check MODEL_ZOO.md to
- download the pretrained RegionCLIP checkpoint regionclip_pretrained-cc_rn50.pth to the folder ./pretrained_ckpt/regionclip.
Put all concepts in the file concepts.txt with each line as a concept name. Place this file in a folder which can be specified in the script (check INPUT_DIR below).

After preparation, run the following script to extract region features.

The following script extracts features from ResNet50.

# RN50 concept embeddings
python3 ./tools/extract_concept_features.py \
--config-file ./configs/LVISv1-InstanceSegmentation/CLIP_fast_rcnn_R_50_C4_zsinf.yaml \
MODEL.WEIGHTS ./pretrained_ckpt/regionclip/regionclip_pretrained-cc_rn50.pth \
MODEL.CLIP.OFFLINE_RPN_CONFIG ./configs/LVISv1-InstanceSegmentation/mask_rcnn_R_50_FPN_1x.yaml \
INPUT_DIR ./datasets/custom_concepts \
OUTPUT_DIR ./output/concept_feats \
MODEL.CLIP.GET_CONCEPT_EMB True \

And for ResNet50x4, use the following command:

# RN50x4 concept embeddings
python3 ./tools/extract_concept_features.py \
--config-file ./configs/LVISv1-InstanceSegmentation/CLIP_fast_rcnn_R_50_C4_zsinf.yaml \
MODEL.WEIGHTS ./pretrained_ckpt/regionclip/regionclip_pretrained-cc_rn50x4.pth \
MODEL.CLIP.TEXT_EMB_DIM 640 \
MODEL.RESNETS.DEPTH 200 \
MODEL.CLIP.OFFLINE_RPN_CONFIG ./configs/LVISv1-InstanceSegmentation/mask_rcnn_R_50_FPN_1x.yaml \
INPUT_DIR ./datasets/custom_concepts \
OUTPUT_DIR ./output/concept_feats \
MODEL.CLIP.GET_CONCEPT_EMB True \

The language embeddings of all concepts will be saved into a .pth file in the folder OUTPUT_DIR. These language embeddings have not been normalized yet, for the consistency with concept embeddings provided in MODEL_ZOO.md.

The following is a list of key arguments for feature extraction. You can specify them in the script as needed.

INPUT_DIR and OUTPUT_DIR: specify a folder of input concepts and an output folder where region features will be saved, respectively.

Citation and Acknowledgement

Citation

If you find this repo useful, please consider citing our paper:

@inproceedings{zhong2022regionclip,
  title={Regionclip: Region-based language-image pretraining},
  author={Zhong, Yiwu and Yang, Jianwei and Zhang, Pengchuan and Li, Chunyuan and Codella, Noel and Li, Liunian Harold and Zhou, Luowei and Dai, Xiyang and Yuan, Lu and Li, Yin and others},
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
  pages={16793--16803},
  year={2022}
}

Acknowledgement

This repository was built on top of Detectron2, CLIP, OVR-CNN, and maskrcnn-benchmark. We thank the effort from our community.

Contributing

This project welcomes contributions and suggestions. Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit https://cla.opensource.microsoft.com.

When you submit a pull request, a CLA bot will automatically determine whether you need to provide a CLA and decorate the PR appropriately (e.g., status check, comment). Simply follow the instructions provided by the bot. You will only need to do this once across all repos using our CLA.

This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact [email protected] with any additional questions or comments.

regionclip's People

Contributors

Stargazers

Watchers

regionclip's Issues

Customized concepts embedding for ov detection

Hi, thanks for your great reaserch! There is still something I feel confused.

I want to use this job to do ov detection on my costomized images. I follow the steps in readme, prepare concepts.txt and turn it into a .pth file. After that, I simply change the emb_path from
MODEL.CLIP.TEXT_EMB_PATH ./pretrained_ckpt/concept_emb/lvis_1203_cls_emb.pth
MODEL.CLIP.OPENSET_TEST_TEXT_EMB_PATH ./pretrained_ckpt/concept_emb/lvis_1203_cls_emb.pth
into
MODEL.CLIP.TEXT_EMB_PATH ./pretrained_ckpt/concept_emb/concept_embeds.pth
MODEL.CLIP.OPENSET_TEST_TEXT_EMB_PATH ./pretrained_ckpt/concept_emb/concept_embeds.pth
However, I got boxes with unexpected text.

For example, my concepts.txt include 7 objects：cat, panda, potato, bear, car, person, toy, sail
But the boxes are recognized as unexpected 7 objects：almond, aerosol_cal, alligator, air_conditioner, alarm_clock, ambulance, alcohol, airplane

I saw some similar issues, and one of them mentioned that we need a json file to specify object class names. Is this true?

Question about transfer learning

impressive work! I feel confused about transfer learning. What does "when human annotations of objects are available" means? lable or bounding box?
when we make evaluation for "zero-shot inference", we set "MODEL.CLIP.CROP_REGION_TYPE GT ". why we use GT before transfer learning? In which situation will we set "MODEL.CLIP.CROP_REGION_TYPE GT "?
and when we Train detectors on my own, at train_transfer_learning.sh, there is nothing about "human annoutations"?

thank you for your kindly help

extract region features

Hello，when running extract_region_features.sh, there is a strange bug in extract_region_features.py:
at line 113 extract_region_features.py predictions = model.roi_heads.box_predictor(att_feats) ,The model generated 100 prediction results. However, when the code in line 114 pred_instances, keep_indices = model.roi_heads.box_predictor.inference(predictions, proposals) # apply per-class NMS is executed, the total number of output results changes from 100 to 300. Instead of achieving the effect of nms, many boxes that are completely coincident are generated.

Partially generated box results：

Zero-shot Inference with custom categories

Hi, I wanted to ask if it's possible to use the code from the section "Visualization on custom images" to extract bbox and categories using a custom vocabulary. Should I just create my custom concept embeddings (./pretrained_ckpt/concept_emb)?

transfer_learning with custom datasets

Refs to the Transfer Learning in the README.md, If I trained with the custom datasets(for example, with 5 classes)
There would be a shape mismatch in detectron2/modeling/roi_heads/fast_rcnn.py

                if clip_cls_emb[1] is not None: # it could be None during region feature extraction
                    pre_computed_w = torch.load(clip_cls_emb[1])  # [num_classes, 1024] for RN50
                    self.cls_score.weight.copy_(pre_computed_w)

So, How do I generate a custom MODEL.CLIP.TEXT_EMB_PATH file with the pretrained models.

This repo is missing important files

There are important files that Microsoft projects should all have that are not present in this repository. A pull request has been opened to add the missing file(s). When the pr is merged this issue will be closed automatically.

Microsoft teams can learn more about this effort and share feedback within the open source guidance available internally.

Merge this pull request

support instance segmentation

I'm wondering if the project support the evaluation of open vocabulary instance segmentation? Thanks!

prepare pretrain datasets

Great work!
I've been trying to reproduce this work.
If I want to pretrain a model on the Coco Caption dataset, how do I prepare the Coco Caption datasets format? What's image_tsv_file and text_tsv_file? How do I convert Coco Caption datasets into them?

MODEL.CLIP.NO_BOX_DELTA

Could I get an explanation for this setting and how it is used and its effect on inference? How come it's turned on for visualization and not for test/train/inference settings?

KeyError: 'Non-existent config key: MODEL.CLIP'

Hello, if I use the train code, an error like " KeyError: 'Non-existent config key: MODEL.CLIP' " occurs, how can I fix it?

The training pipeline and training time of this work

I understand that the paper uses the Resnet and text transformer of CLIP for initialization, and then uses the region-text pair extracted from CC3M for pre-training, and then splices the pre-trained model and RPN together and trains the RPN on the base class, right?
CC3M has 3 million pictures. How many GPUs are used for pre-training and how much training time is consumed? This is not mentioned in the paper.

Thanks.

Region classification on LVIS after fine-tuning on CC

Hi, Thanks for your inspiring work! I am trying to reproduce the region classification results of your paper. Figure 1 (b) shows that the vanilla CLIP-R50 only achieves 19.1 accuracies on LVIS GT regions. I wonder what's the accuracy of your fine-tuned RegionCLIP-R50 on LVIS regions. Could you share some raw results of this part? Thanks!

query about the mask annotation in LVIS training

Hi, RegionClip is a great work and I am interested in how it works.
I notice that you have used mask annotation in training detector. But I didn't find the reason why you use it. Did it helps improving the generalization of the detector? I didn't find any ablation study about this mask annotation.
I am look forward that you can give me some help!
Thanks !

What is RN50*4?

Hi, nice work！
I am confused about the backbone RN50*4 used in the paper. Is it a RN50 network with 4 stages FPN? And what is the difference with the RN50-FPN in ViLD?
Thank you for your kindly help.

Question about the GPU number and GPU training time for the results.

Hi, I noticed that the training cost have not beed mentioned in the main paper. I just wonder the GPU number and GPU training time for the results. And, also, Is the CLIP model frozen when you build this work ?
Thank you.

Configs for the Hugging Face demo please.

Great work!
I've been trying to reproduce the Hugging Face demo of RegionCLIP (given a text description, and return a bounding-box).
I tried the default config (given in extract_region_features.sh) and other combinations of pre-trained models, but still cannot reach the accuracy of the demo.
When I run the app.py downloaded from Hugging Face, I found that text and image cannot be put into the model together.
So may I ask which config you use in the hugging face demo?

Question regarding zero-shot inference settings

Questions regarding the settings added below in zero-shot inferences. Could I get an explanation for each one of them? I notice in test_transfer_learning.sh these settings aren't included. They are again included for test_zeroshot_inference.sh and both visualize*.sh which doesn't make sense to me. When these settings are turned on I am getting a lower AP scores as well when evaluating test_transfer_learning.sh

MODEL:
  ROI_HEADS:
    NMS_THRESH_TEST: 0.5
  CLIP:
    NO_BOX_DELTA: True
    OFFLINE_RPN_NMS_THRESH: 0.9

transfer learning on the base categories does not help detection on the novel categories

Hi, Thanks for sharing the work! I find that the open-vocabulary object detection result on novel categories is not much better than that of zero-shot one, I'm wondering if fine-tuning on the base categories would actually hurt the performance?

NaN Loss

Hello,

During training the detector on my own, the training was interrupted due to NaN loss encounter. The following is the Traceback I got:

[07/12 01:10:03 d2.utils.events]: eta: 1:41:36 iter: 4299 total_loss: 0.1065 loss_cls: 0.02696 loss_box_reg: 0.07967 time: 1.2804 data_time: 0.0020 lr: 0.002 max_mem: 15826M
ERROR [07/12 01:10:18 d2.engine.train_loop]: Exception during training:
Traceback (most recent call last):
File "/home/ahmad/RegionCLIP/detectron2/engine/train_loop.py", line 149, in train
self.run_step()
File "/home/ahmad/RegionCLIP/detectron2/engine/defaults.py", line 506, in run_step
self._trainer.run_step()
File "/home/ahmad/RegionCLIP/detectron2/engine/train_loop.py", line 287, in run_step
self._write_metrics(loss_dict, data_time)
File "/home/ahmad/RegionCLIP/detectron2/engine/train_loop.py", line 329, in _write_metrics
raise FloatingPointError(
FloatingPointError: Loss became infinite or NaN at iteration=4311!
loss_dict = {'loss_cls': nan, 'loss_box_reg': nan}
[07/12 01:10:18 d2.engine.hooks]: Overall training speed: 4309 iterations in 1:31:58 (1.2806 s / it)
[07/12 01:10:18 d2.engine.hooks]: Total training time: 1:37:23 (0:05:25 on hooks)
[07/12 01:10:18 d2.utils.events]: eta: 1:41:16 iter: 4311 total_loss: 0.1009 loss_cls: 0.02645 loss_box_reg: 0.07499 time: 1.2803 data_time: 0.0020 lr: 0.002 max_mem: 15826M
Traceback (most recent call last):
File "/home/ahmad/RegionCLIP/./tools/train_net.py", line 170, in
launch(
File "/home/ahmad/RegionCLIP/detectron2/engine/launch.py", line 82, in launch
main_func(*args)
File "/home/ahmad/RegionCLIP/./tools/train_net.py", line 164, in main
return trainer.train()
File "/home/ahmad/RegionCLIP/detectron2/engine/defaults.py", line 496, in train
super().train(self.start_iter, self.max_iter)
File "/home/ahmad/RegionCLIP/detectron2/engine/train_loop.py", line 149, in train
self.run_step()
File "/home/ahmad/RegionCLIP/detectron2/engine/defaults.py", line 506, in run_step
self._trainer.run_step()
File "/home/ahmad/RegionCLIP/detectron2/engine/train_loop.py", line 287, in run_step
self._write_metrics(loss_dict, data_time)
File "/home/ahmad/RegionCLIP/detectron2/engine/train_loop.py", line 329, in _write_metrics
raise FloatingPointError(
FloatingPointError: Loss became infinite or NaN at iteration=4311!
loss_dict = {'loss_cls': nan, 'loss_box_reg': nan}

Question about train_transfer_learning

Hi yiwu:
I have a question. When runing train_transfer_learning.sh, I find there are no RPN losses, whether RPN is frozen?

New datasets

Hi,
Great paper and repository!
I was wondering how I can use the pre-trained model to fine-tune and evaluate on other datasets such as Pascal Voc?

Memory leak during ovod finetuning

Nice works !
I am working on ovod problem right now.
I am using your code to reproduce your ovod finetuning results on region-clip pretrain weights. In order to get quick response on testing, I modified the TEST.EVAL_PERIOD=5000( the original one is 25000), but I found the whole training process will be interrupted by an unexpected error around iter 60k, which is about 12 evaluation during the time .

Traceback (most recent call last):
File "/home/wulei04/miniconda3/envs/detic/lib/python3.8/multiprocessing/queues.py", line 239, in _feed
obj = _ForkingPickler.dumps(obj)
File "/home/wulei04/miniconda3/envs/detic/lib/python3.8/multiprocessing/reduction.py", line 51, in dumps
cls(buf, protocol).dump(obj)
File "/home/wulei04/miniconda3/envs/detic/lib/python3.8/site-packages/torch/multiprocessing/reductions.py", line 312, in reduce_storage
metadata = storage.share_filename()
RuntimeError: error executing torch_shm_manager at "/home/wulei04/miniconda3/envs/detic/lib/python3.8/site-packages/torch/bin/torch_shm_manager" at ../torch/lib/libshm/core.cpp:99
Traceback (most recent call last):
File "/home/wulei04/miniconda3/envs/detic/lib/python3.8/multiprocessing/queues.py", line 239, in _feed
obj = _ForkingPickler.dumps(obj)
File "/home/wulei04/miniconda3/envs/detic/lib/python3.8/multiprocessing/reduction.py", line 51, in dumps
cls(buf, protocol).dump(obj)
File "/home/wulei04/miniconda3/envs/detic/lib/python3.8/site-packages/torch/multiprocessing/reductions.py", line 312, in reduce_storage
metadata = storage.share_filename()
RuntimeError: error executing torch_shm_manager at "/home/wulei04/miniconda3/envs/detic/lib/python3.8/site-packages/torch/bin/torch_shm_manager" at ../torch/lib/libshm/core.cpp:99

I suspected that it was caused by the cpu memory leak, so l log the cpu free memory each 5secs and plot it, it shows that the cpu memory continuely decrease.
I make a little difference during the training process, but no modification in dataloading and evaluation. I am wondering how to fix it, and which part may be the root cause of this memory leak issue?

Question about the teacher model in pre-train stage.

Dear Author, thanks for your excellent research and code! I still have some questions in the pre-training stage.

The teacher model you mention in the pre-train stage, the backbone is the CLIP visual encoder, and the RPN is the pre-train RPN on base class, so could it get a reliable score with backbone (CLIP weights) + RPN (Other weights)?

I'm wondering about this because I found that the final layer in CLIP visual encoder is an attention pooling layer, from my experiments I found that the attention pooling would damage the performance location ability in the ResNet model. How do you avoid this problem, thanks! 🙏

how to run baseline clip detection model

Hi! Thanks for your impressive work! I'm wondering is it possible to run the baseline CLIP detection in table 4 with the released code?

'Non-existent config key: MODEL.CLIP'

while running visualize_zeroshot_inference.sh i grt the error as 'Non-existent config key: MODEL.CLIP' so can anyone help me with this ??

Pre-training data and usage of RegionCLIP

@jwyang Hi, thanks for your great work and well organized repo! May I ask two questions?

The pre-training dataset of RegionCLIP, as mentioned in the paper:

During pretraining, the default student model and teacher model were ResNet50 [22] from pretrained CLIP.

Does it mean that RegionCLIP is actually using CLIP400M + CC3M data? The textual encoder of RegionCLIP is the same as the CLIP. I am wondering the visual encoder is initialized with CLIP pre-trained model or from scratch.

The usage of CC pre-trained regionCLIP. Is the regionclip_pretrained-cc_{rn50, rn50x4}.pth ckpt the same format as the original CLIP? If yes, it would be very easy for me to try your model on my task.

Thanks!

Setting for transfer learning

Hi, I have a question.
In train_transfer_learning.sh, I used RN50X4, COCO. But I had 4 gpu, so I changed num-gpus to 4 and changed IMS_PER_BATCH to 4. Then, the score written in the paper does not come out. In this case, what part should I change to get the score written in the paper?

num_gpus setting for transfer learning

Hi yiwu, thanks for your great work and implementation, i have a question for transfer learning, in the train_transfer_learning.sh file, the num_gpus are set to 1, so does that mean you used one gpu for training, but it seems in the config the batch size is 16, which is a typical 8 gpu setting

Colab Notebook

Hey, thanks for the impressive work. Would it be possible to add a Colab Tutorial Notebook using the pretrained models?

RPN for COCO OVD setting

Dear author,

Thanks for your great work. May I ask if the RPN used to pre-train the RegionCLIP is trained on the LVIS dataset for both COCO and LVIS OVD settings? If the answer is yes, is it fair for the COCO OVD setting? Because there are some overlaps between the base classes of LVIS and the novel classes of COCO, and you use some box annotations of novel classes to train the RPN.

Add a linear probe

For some experimental reasons, I want to test the performance with a Linear Probe and step away from the text embeddings. The code is very abstract and I would like to get some help or point me to where I should change the code to make classification head work in a traditional way. I have a custom dataset and would like to do fine-tuning but. In other words, make it work as a traditional object detector

Model output for "Train detectors on your own"

It seems that for training detectors on your own, the model output weights aren't customizable as well as the prediction instances json file. The model output at ./output/model_final.pth don't match the pretrained models at eg. ./pretrained_ckpt/regionclip/regionclip_pretrained-cc_rn50.pth.
Is this intended, if not how to load the finetuned model? Thanks!

Is there a way to finetune RPN during transfer learning?

Using your pretrained models for custom objects

Hi guys, great work (!), but I'm not sure how to try your pretrained models (and maybe cite you) for other tasks that require zero-shot detection on custom objects (and custom images).

First, I followed your installation instructions and it didn't work (you might want to check...). I eventually succeeded installing in colab (avoiding the cuda/torch incompatibilities), with the right detectron2.
Then I created concept embeddings for the objects I needed, but I didn't understand how to use them with your pretrained model. Can you explain/write the command?

Detic published a great simple colab notebook for trying out their model (I'm not related to them, just impressed). I'm sure that if you write a similar notebook you'll become much more attractive for others to use your code/models :-)
Thanks!

API or scripts for extracting regional features.

Hi!
Thanks for your exciting work!
I would like to use RegionCLIP to extract regional features for images and match them with sentences.
Is there any API or demo scripts that I could refer to simply achieve this goal like the CLIP model (https://github.com/openai/CLIP.git)?

Regards,
Zhenfang

Incorrect concept embeddings

There seems to be a bug with generating concept embeddings in extract_concept_features.py.

The generated embeddings seem to correspond to the first letter of each concepts, which can be duplicate and isn't what we want. I propose the correction below.

    concept_feats = []
    with open(concept_file, 'r') as f:
            concepts = []
            for line in f:
                concept = line.strip()
                concepts.append(concept)
            with torch.no_grad():
                token_embeddings_concepts = pre_tokenize(concepts).to(model.device)
                for token_embeddings in token_embeddings_concepts:
                    text_features = model.lang_encoder.encode_text(token_embeddings)
                    # average over all templates
                    text_features = text_features.mean(0, keepdim=True)
                    concept_feats.append(text_features)

Originally posted by @Jawing in #1 (comment)

concepts.txt files for the RegionCLIP

Hi, thank you for the nice and inspiring work!
You provide pre-training code for the RegionCLIP but I cannot find concepts.txt files.
Is there any plan for the providing concepts.txt files filtered from COCO cap. or CC3M used in your work?
Thanks :)

pretraining process

Hello, your research is great! I am very interested in the code of the pretraining process. Is this part of the code open source?

Question about custom concept.txt

Hi yiwu, thank you for your interesting work. I have a question about custom concept.txt. What is the meaning of numbers from coco_nouns_4764.txt which in concept_emb folder.

UniCL model checkpoint and new zeroshot branch models for finetuning and evaluation

First issue, I've been trying to load the model checkpoint from UniCL swin_base but non of the parameters from the model were loaded into the "clip_swin.py" model and same for the ones from klite

Second issue, the new yaml files (specifically) for zs inference resulted in errors when ran with the test_zeroshot_inference scripts.

  File "/home/wanjiz/RegionCLIP/tools/train_net.py", line 172, in <module>
    launch(
  File "/home/wanjiz/RegionCLIP/detectron2/engine/launch.py", line 82, in launch
    main_func(*args)
  File "/home/wanjiz/RegionCLIP/tools/train_net.py", line 146, in main
    res = Trainer.test(cfg, model)
  File "/home/wanjiz/RegionCLIP/detectron2/engine/defaults.py", line 627, in test
    results_i = inference_on_dataset(model, data_loader, evaluator, dataset_name)
  File "/home/wanjiz/RegionCLIP/detectron2/evaluation/evaluator.py", line 164, in inference_on_dataset
    outputs = model(inputs)
  File "/data/venv/regionclip/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/wanjiz/RegionCLIP/detectron2/modeling/meta_arch/clip_rcnn.py", line 235, in forward
    return self.inference(batched_inputs)
  File "/home/wanjiz/RegionCLIP/detectron2/modeling/meta_arch/clip_rcnn.py", line 370, in inference
    norm=self.backbone.norm if not self.backbone.output_is_normalized else None)
  File "/data/venv/regionclip/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1207, in __getattr__
    raise AttributeError("'{}' object has no attribute '{}'".format(
AttributeError: 'SwinTransformer' object has no attribute 'output_is_normalized'

If possible how would I alter the script to perform training/finetuning for these new models? There seems to be error with clip_lang_encoder.py and fast_rcnn.py
when I tried training.

How to generate the concepts for Cap/CC3M?

Thanks for the great work！
In the paper, "Object concepts whose frequency are lower than 100 are discarded, leading to 4764/6790 concepts on COCO Cap/CC3M." The number of concepts I generated is quite different from the number of papers. Can you provide the generation script?

Why the region_text pretraining model is trained on raw images and then applied to feature maps after ROI?

Microsoft Azure

Hello authors, and thank you for the amazing work you have done. Would the model be available via Microsoft Azure's computer vision models? Thanks

About RPN

Hi, thanks for your nice idea and code of RegionCLIP. Now I use RegionCLIP on my own dataset (screenshots of cell phones) but the offline proposals are not good enough for detections due to the dataset distribution gap. I want to know the how is the RPN pretrained in your paper?

Visualization settings with MODEL.ROI_HEADS.SOFT_NMS_ENABLED

Why is MODEL.ROI_HEADS.SOFT_NMS_ENABLED True enabled in test_transfer_learning.sh and not in visualize_transfer_learning.sh. And when it is enabled in visualize_transfer_learning.sh the generated images sometimes output accuracies greater than 100% for detected bounding box classes. Why is that?

pretraining code for RegionCLIP

Hi I have a question about the code for pretraining RegionCLIP.
You only provide pretrained checkpoint in this repo and it seems there is none of the code or config for pretraining RegionCLIP.
Maybe I didn't find it, or are you planning to release it later?
Thanks!

Image Augmentation with bounding boxes

I'd like to understand how to add image augmentation with bounding box coordinates such as copy_paste and albumentations bbox augmentations to the training/finetuning setup.

It seems that you can set transformations in detectron2/data/dataset_mapper.py
and from functions in utils folder like utils.transform_instance_annotations, utils.build_augmentation(cfg, is_train)
How does the current INPUT.RANDOM_FLIP transformation transform bounding boxes of the original image?
I'm not sure how to properly add the recomputed_boxes for the copy_paste method.

Any tips?

Invalid version: 'RegionCLIP'

Multi gpu training

I seem to be getting the error below when setting --num-gpus 2 in train_transfer_learning.sh with CUDA_VISIBLE_DEVICES=0,1

ERROR [07/12 12:43:41 d2.engine.train_loop]: Exception during training:
Traceback (most recent call last):
  File "/home/wanjiz/RegionCLIP/detectron2/engine/train_loop.py", line 149, in train
    self.run_step()
  File "/home/wanjiz/RegionCLIP/detectron2/engine/defaults.py", line 506, in run_step
    self._trainer.run_step()
  File "/home/wanjiz/RegionCLIP/detectron2/engine/train_loop.py", line 273, in run_step
    loss_dict = self.model(data)
  File "/home/ahmad/regionclip/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/ahmad/regionclip/lib/python3.9/site-packages/torch/nn/parallel/distributed.py", line 994, in forward
    if torch.is_grad_enabled() and self.reducer._rebuild_buckets():
RuntimeError: Expected to have finished reduction in the prior iteration before starting a new one. This error indicates that your module has parameters that were not used in producing loss. You can enable unused parameter detection by passing the keyword argument `find_unused_parameters=True` to `torch.nn.parallel.DistributedDataParallel`, and by 
making sure all `forward` function outputs participate in calculating loss. 
If you already have done the above, then the distributed data parallel module wasn't able to locate the output tensors in the return value of your module's `forward` function. Please include the loss function and the structure of the return value of `forward` of your module when reporting this issue (e.g. list, dict, iterable).
Parameter indices which did not receive grad for rank 0: 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 ...
 In addition, you can set the environment variable TORCH_DISTRIBUTED_DEBUG to either INFO or DETAIL to print out information about which particular parameters did not receive gradient on this rank as part of this error
[07/12 12:43:41 d2.engine.hooks]: Total training time: 0:00:00 (0:00:00 on hooks)
[07/12 12:43:41 d2.utils.events]:  iter: 1  total_loss: 3.286  loss_cls: 3.053  loss_box_reg: 0.233  data_time: 3.6441  lr: 2e-07  max_mem: 12165M
Traceback (most recent call last):
  File "/home/wanjiz/RegionCLIP/./tools/train_net.py", line 170, in <module>
    launch(
  File "/home/wanjiz/RegionCLIP/detectron2/engine/launch.py", line 67, in launch
    mp.spawn(
  File "/home/ahmad/regionclip/lib/python3.9/site-packages/torch/multiprocessing/spawn.py", line 240, in spawn
    return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
  File "/home/ahmad/regionclip/lib/python3.9/site-packages/torch/multiprocessing/spawn.py", line 198, in start_processes
    while not context.join():
  File "/home/ahmad/regionclip/lib/python3.9/site-packages/torch/multiprocessing/spawn.py", line 160, in join
    raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException: 

-- Process 1 terminated with the following error:
Traceback (most recent call last):
  File "/home/ahmad/regionclip/lib/python3.9/site-packages/torch/multiprocessing/spawn.py", line 69, in _wrap
    fn(i, *args)
  File "/home/wanjiz/RegionCLIP/detectron2/engine/launch.py", line 125, in _distributed_worker
    main_func(*args)
  File "/home/wanjiz/RegionCLIP/tools/train_net.py", line 164, in main
    return trainer.train()
  File "/home/wanjiz/RegionCLIP/detectron2/engine/defaults.py", line 496, in train
    super().train(self.start_iter, self.max_iter)
  File "/home/wanjiz/RegionCLIP/detectron2/engine/train_loop.py", line 149, in train
    self.run_step()
  File "/home/wanjiz/RegionCLIP/detectron2/engine/defaults.py", line 506, in run_step
    self._trainer.run_step()
  File "/home/wanjiz/RegionCLIP/detectron2/engine/train_loop.py", line 273, in run_step
    loss_dict = self.model(data)
  File "/home/ahmad/regionclip/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/ahmad/regionclip/lib/python3.9/site-packages/torch/nn/parallel/distributed.py", line 994, in forward
    if torch.is_grad_enabled() and self.reducer._rebuild_buckets():
RuntimeError: Expected to have finished reduction in the prior iteration before starting a new one. This error indicates that your module has parameters that were not used in producing loss. You can enable unused parameter detection by passing the keyword argument `find_unused_parameters=True` to `torch.nn.parallel.DistributedDataParallel`, and by 
making sure all `forward` function outputs participate in calculating loss. 
If you already have done the above, then the distributed data parallel module wasn't able to locate the output tensors in the return value of your module's `forward` function. Please include the loss function and the structure of the return value of `forward` of your module when reporting this issue (e.g. list, dict, iterable).
Parameter indices which did not receive grad for rank 1: 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 ...
 In addition, you can set the environment variable TORCH_DISTRIBUTED_DEBUG to either INFO or DETAIL to print out information about which particular parameters did not receive gradient on this rank as part of this error

microsoft / regionclip Goto Github PK

regionclip's Introduction

RegionCLIP: Region-based Language-Image Pretraining

Overview

Updates

Outline

Installation

Datasets

Model Zoo

Zero-shot Inference

Visualization on custom images

Evaluation for zero-shot inference

Transfer Learning

Visualization on custom images

Evaluate the trained detectors

Train detectors on your own

Extract Region Features

Extract Concept Features

Citation and Acknowledgement

Citation

Acknowledgement

Contributing

regionclip's People

Contributors

Stargazers

Watchers

Forkers

regionclip's Issues

Recommend Projects

Recommend Topics

Recommend Org