Python 100.00%

classify_by_description_release's Introduction

Visual Classification via Description from Large Language Models

Sachit Menon, Carl Vondrick

ICLR 2023, Notable Top 5% (Oral)

Approach

The standard vision-and-language model compares image embeddings (white dot) to word embeddings of the category name (colorful dots) in order to perform classification, as illustrated in (a). We instead query large language models to automatically build descriptors, and perform recognition by comparing to the category descriptors, as shown in (b).

Usage

First install the dependencies.

Either manually:

conda install pytorch torchvision -c pytorch
conda install matplotlib torchmetrics -c conda-forge
pip install git+https://github.com/openai/CLIP.git
pip install git+https://github.com/modestyachts/ImageNetV2_pytorch

Or using the provided .yml file.

conda env create -f classbydesc.yml

To reproduce accuracy results from the paper: edit the directories to match your local machine in load.py and set hparams['dataset'] accordingly. Then simply run python main.py.

All hyperparameters can be modified in load.py.

To generate example decisions and explanations as well as contrast from the CLIP decision, use the show_from_indices function in load.py after having run main.py. Details forthcoming.

Example displaying the predictions that differ between baseline CLIP and our method:

show_from_indices(torch.where(descr_predictions != clip_predictions)[0], images, labels, descr_predictions, clip_predictions, image_description_similarity=image_description_similarity, image_labels_similarity=image_labels_similarity)

Example outputs:

Generating Your Own Descriptors

See generate_descriptors.py. If you have a list of classes, you can pass it to the obtain_descriptors_and_save function to save a json of descriptors. (You will need to add your OpenAI API token.)

Citation

@article{menon2022visual,
  author    = {Menon, Sachit and Vondrick, Carl},
  title     = {Visual Classification via Description from Large Language Models},
  journal   = {ICLR},
  year      = {2023},
}

classify_by_description_release's People

Stargazers

Watchers

classify_by_description_release's Issues

Text files

Where can I find the bounding_boxes.txt and train_test_split.txt files?

The code for other datasets

Great Job! Could you please supply the data for using other datasets that you mentioned in your paper, besides CUB and imagenet? I want to evaluate the performance on more datasets, thank you.

Question about baseline on ImageNet

Hello! Thanks for the great work!
I was running this codebase off-the-shelf on imageNet, and here's the accuracy I get, which doesn't quite match the baseline mentioned in the paper( 58.46) . Also as noted from the original CLIP paper, ViT-B/32 have zero shot accuracy of 63.2 (openai/CLIP#24). Could you provide some further instruction on how to obtain the baseline( 58.46) mentioned in the paper? Thanks!

Total Description-based Top-1 Accuracy: 62.65624761581421
Total Description-based Top-5 Accuracy: 86.56250238418579
Total CLIP-Standard Top-1 Accuracy: 60.78125238418579
Total CLIP-Standard Top-5 Accuracy: 84.0624988079071

Why the results of CLIP RN50 will vary?

The results of using CLIP RN50 will vary when inputting the same samples (e.g. 48.91 and 48.88 on CUB). So far I haven't found this on other architectures. What are the possible reasons?

Does the result require manual data cleaning

Hello, I would like to ask whether the JSON files you listed have been artificially cleaned. Because I try GPT3 may generate some strange answers.

What are useful features for distinguishing a tiger in a photo?

There are several features that can be used to distinguish a tiger in a photo:

1. Stripes: Tigers have unique stripe patterns on their fur, which can help to identify them. The stripes are usually black or dark brown on a light orange background.

2. Head shape: The shape of a tiger's head, including the size and placement of the ears, can be used to distinguish one individual from another.

3. Eye color: Tigers have yellow or amber-colored eyes, which can help to identify them in a photo.

4. Nose shape: The shape and size of a tiger's nose can be used to differentiate one individual from another.

5. Body size and shape: Tigers can vary in size and shape, so noting these characteristics can help to distinguish one tiger from another.

6. Behavior: A tiger's behavior in a photo, such as posture or movement, can also provide clues to its identity.

By looking at a combination of these features, it may be possible to distinguish one tiger from another in a photo.

Paper link broken

Thank you for sharing great work!
it seems like paper link is broken as below.

Wrong CUB200 descriptors

Hello, thank you for this very interesting work.
It seems that the CUB200 descriptors are already formatted, contrary to the other dataset descriptors.
For instance:
"Black-footed Albatross": [
"It is a seabird.",
"It is black and white.",
"It has a long, hooked bill.",
"It has long, narrow wings.",
"It has black feet.",
"It has a white head and neck."
]

This leads to inconsistent final descriptors such as 'Black-footed Albatross, which has It is a seabird..'
with weird linking word between class and descriptor, and doubled final dot.

Could you please provide the correct descriptors for this dataset ?

sachit-menon / classify_by_description_release Goto Github PK