Giter VIP home page Giter VIP logo

shiramir / dino-vit-features Goto Github PK

View Code? Open in Web Editor NEW
338.0 4.0 36.0 5.99 MB

Official implementation for the paper "Deep ViT Features as Dense Visual Descriptors".

Home Page: https://dino-vit-features.github.io

License: MIT License

Python 87.65% Jupyter Notebook 12.35%
deep-learning computer-vision vision-transformers co-segmentation part-segmentation semantic-correspondence dino pytorch

dino-vit-features's Introduction

dino-vit-features

[paper] [project page]

Official implementation of the paper "Deep ViT Features as Dense Visual Descriptors".

teaser

We demonstrate the effectiveness of deep features extracted from a self-supervised, pre-trained ViT model (DINO-ViT) as dense patch descriptors via real-world vision tasks: (a-b) co-segmentation & part co-segmentation: given a set of input images (e.g., 4 input images), we automatically co-segment semantically common foreground objects (e.g., animals), and then further partition them into common parts; (c-d) point correspondence: given a pair of input images, we automatically extract a sparse set of corresponding points. We tackle these tasks by applying only lightweight, simple methodologies such as clustering or binning, to deep ViT features.

Setup

Our code is developed in pytorch on and requires the following modules: tqdm, faiss, timm, matplotlib, pydensecrf, opencv, scikit-learn. We use python=3.9 but our code should be runnable on any version above 3.6. We recomment running our code with any CUDA supported GPU for faster performance. We recommend setting the running environment via Anaconda by running the following commands:

$ conda env create -f env/dino-vit-feats-env.yml
$ conda activate dino-vit-feats-env

Otherwise, run the following commands in your conda environment:

$ conda install pytorch torchvision torchaudio cudatoolkit=11 -c pytorch
$ conda install tqdm
$ conda install -c conda-forge faiss
$ conda install -c conda-forge timm 
$ conda install matplotlib
$ pip install opencv-python
$ pip install git+https://github.com/lucasb-eyer/pydensecrf.git
$ conda install -c anaconda scikit-learn

ViT Extractor

Code Snippet

from extractor import ViTExtractor
extractor = ViTExtractor()
# imgs should be imagenet normalized tensors. shape BxCxHxW
descriptors = extractor.extract_descriptors(imgs) 

We provide a wrapper class for a ViT model to extract dense visual descriptors in extractor.py. You can extract descriptors to .pt files using the following command:

python extractor.py --image_path <image_path> --output_path <output_path>

You can specify the pretrained model using the --model flag with the following options:

  • dino_vits8, dino_vits16, dino_vitb8, dino_vitb16 from the DINO repo.
  • vit_small_patch8_224, vit_small_patch16_224, vit_base_patch8_224, vit_base_patch16_224 from the timm repo.

You can specify the stride of patch extracting layer to increase resolution using the --stride flag.

Part Co-segmentation Open In Colab

We provide a notebook for running on a single example in part_cosegmentation.ipynb.

To run on several image sets, arrange each set in a directory, inside a data root directory:

<sets_root_name>
|
|_ <set1_name>
|  |
|  |_ img1.png
|  |_ img2.png
|   
|_ <set2_name>
   |
   |_ img1.png
   |_ img2.png
   |_ img3.png
...

The following command will produce results in the specified <save_root_name>:

python part_cosegmentation.py --root_dir <sets_root_name> --save_dir <save_root_name>

Note: The default configuration in part_cosegmentation.ipynb is suited for running on small sets (e.g. < 10). Increase amount of num_crop_augmentations for more stable results (and increased runtime). The default configuration in part_cosegmentation.py is suited for larger sets (e.g. >> 10).

Co-segmentation Open In Colab

We provide a notebook for running on a single example in cosegmentation.ipynb.

To run on several image sets, arrange each set in a directory, inside a data root directory:

<sets_root_name>
|
|_ <set1_name>
|  |
|  |_ img1.png
|  |_ img2.png
|   
|_ <set2_name>
   |
   |_ img1.png
   |_ img2.png
   |_ img3.png
...

The following command will produce results in the specified <save_root_name>:

python cosegmentation.py --root_dir <sets_root_name> --save_dir <save_root_name>

Point Correspondences Open In Colab

We provide a notebook for running on a single example in correpondences.ipynb.

To run on several image pairs, arrange each image pair in a directory, inside a data root directory:

<pairs_root_name>
|
|_ <pair1_name>
|  |
|  |_ img1.png
|  |_ img2.png
|   
|_ <pair2_name>
   |
   |_ img1.png
   |_ img2.png
...

The following command will produce results in the specified <save_root_name>:

python correspondences.py --root_dir <pairs_root_name> --save_dir <save_root_name>

Other Utilities

PCA

We provide code for computing the PCA of several images in a single directory:

python pca.py --root_dir <images_root_name> --save_dir <save_root_name>

Similarity Inspection

We provide code for interactively visualizing the similarity of a chosen descriptor in the source image to all target descriptors in a terget image.

python inspect_similarity --image_a <path_to_image_a> --image_b <path_to_image_b>

Citation

If you found this repository useful please consider starring ⭐ and citing :

@article{amir2021deep,
    author    = {Shir Amir and Yossi Gandelsman and Shai Bagon and Tali Dekel},
    title     = {Deep ViT Features as Dense Visual Descriptors},
    journal   = {arXiv preprint arXiv:2112.05814},
    year      = {2021}
}

dino-vit-features's People

Contributors

shiramir avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

dino-vit-features's Issues

PCK Evaluation

I am unable to replicate results in Table 4 (Correspondence Evaluation on Spair71k). Since in the case of PCK evaluation, keypoints are provided for the source image, I find the closest point (according to the binned descriptor) in the second image within the "salient region". The numbers I get are close to zeros so there might be a mistake in my code. Are there any additional heuristics that you apply for this one-way correspondence?

Use of the previous KMeans instance

I wonder if using the previous KMeans instance at this point is intentional. I mean the part_algorithm was trained on normalized_all_fg_sampled_descriptors as opposed to common_part_algorithm, which is trained on normalized_all_common_sampled_descriptors.

_, common_part_labels = part_algorithm.index.search(normalized_all_common_descriptors.astype(np.float32), 1)

Part co-segmentation comparison on CUB

Hello,
is it possible to release the evaluation code for CUB, which reproduces the results presented in the paper?

With the currently available implementation, I'm unfortunately not able to reproduce the results.
I get much worse results for the NMI and ARI.

best regards

DINO vs MAE

Hi,
Thanks for your amazing work. The study is very interesting. You are using DINO as feature extractor in your work, and I was just wondering if you tried using MAE or a different method? And do you have the same/similar results?
Thanks for your time,

parameter tunning for custom dataset

I found your method sensitive to the choice of parameters (thresh, elbow coefficient, etc.). Instead of tunning them manually and assessing the results qualitatively, is there a way to do a grid search and assess quantitatively? For example, can I search on the training set and evaluate on the validation set and use Landmark regression results to select the best parameters? If so, could you upload your evaluation scripts so that I can do it this way? Thank you.

Why is only dino_vits8 supported?

In the examples, if you change the model_type to anything other than dino_vits8 the code crashes because of an assert in ViTExtractor.extract_saliency_maps. What needs to change to properly support other model types?

Extractor feature OOM

Hi there, I tried the code, at the beginning , everything sames fine, but when I try to use the extractor on my own images, more specificly, extract feature of high resolution pciture and visualize the pca picture. I tried the code on 100 800 x 800 images, set the load_size=224 and stride=2, it seems fine, so the code maybe run the image separately?

How should I calculate the memory of GPU and Could the Extractor could modify to using on multiple GPUs?

Does the tint of the PCA image mean anything?

I performed self-supervised training of DINO on a pathology image dataset and analysed PCA as well as DINO-vit-feature.

At the same time, I also analysed the PCA in the model with DINO self-supervised learning in imagenet, and found that the colour images of the last 3dim of the PCA were more colourful in the model trained on pathology images, whereas the colour images in the model with DINO self-supervised learning in imagenet were were closer to the three primary colours of RGB and less colourful. What do you think these differences due to PCA represent?

DINO:pretrained with pathology data
dino_rgb

DINO:pretrained with imagenet data
imnet_rgb

Indexing error when using high resolution saliency map

Hi,

When running cosegmentation.py and parts_cosegmentation.py, turning low_res_saliency_maps off leads to an indexing error. It appears that saliency_map is batched with shape 1xN. So something like:

if not low_res_saliency_maps:
    saliency_map = saliency_map[0]

is a sufficient fix.

Traceback (most recent call last):
  File "cosegmentation.py", line 523, in <module>
    seg_masks, pil_images = find_cosegmentation(
  File "cosegmentation.py", line 257, in find_cosegmentation
    label_saliency = saliency_map[image_labels[:, 0] == label].mean()
IndexError: boolean index did not match indexed array along dimension 0; dimension is 1 but corresponding boolean dimension is 1705

saliency maps

Hi, why are the saliency maps not available for patch number 16?

DINO v2

Is there information available for how to use DINO V2 for point correspondence?

CUDA out of memory

I have GPUs with 11GB of memory, and I will get out of memory warning when I load more than three images.
(when computing the attention of ViT, attn = (q @ k.transpose(-2, -1)) * self.scale)

I think I can increase the stride or decrease the load size, but it will also degrade the performance.

I found the code only processes a single image each time, so I would like to ask if I can run the program across multiple GPUs?

Issues with the code

Hi!

Thank you for this great repo.

I had the following isue while trying to run the different notebooks:

  • TypeError: interpolate_pos_encoding() missing 1 required positional argument: 'h'

Thanks!

Supervised ViT Checkpoint

Hi, thanks for this great repo! Could you by chance point me in the direction of the "Supervised ViT" described in Figure 3?

PCK evaluation code

I am currently trying to replicate the results using the code from @kampta under #8. I am getting PCK values which are around 6-10% lower than the one mentioned in the paper under the same parameters using his notebook.

I would appreciate if you could guide me on this or share the code for how you went about calculating the PCK.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.