openai / clip Goto Github PK

View Code? Open in Web Editor NEW

24.3K 318.0 3.2K 8.93 MB

CLIP (Contrastive Language-Image Pretraining), Predict the most relevant text snippet given an image

License: MIT License

Python 0.93% Jupyter Notebook 99.07%

deep-learning machine-learning

clip's Introduction

CLIP

[Blog] [Paper] [Model Card] [Colab]

CLIP (Contrastive Language-Image Pre-Training) is a neural network trained on a variety of (image, text) pairs. It can be instructed in natural language to predict the most relevant text snippet, given an image, without directly optimizing for the task, similarly to the zero-shot capabilities of GPT-2 and 3. We found CLIP matches the performance of the original ResNet50 on ImageNet “zero-shot” without using any of the original 1.28M labeled examples, overcoming several major challenges in computer vision.

Approach

Usage

First, install PyTorch 1.7.1 (or later) and torchvision, as well as small additional dependencies, and then install this repo as a Python package. On a CUDA GPU machine, the following will do the trick:

$ conda install --yes -c pytorch pytorch=1.7.1 torchvision cudatoolkit=11.0
$ pip install ftfy regex tqdm
$ pip install git+https://github.com/openai/CLIP.git

Replace cudatoolkit=11.0 above with the appropriate CUDA version on your machine or cpuonly when installing on a machine without a GPU.

import torch
import clip
from PIL import Image

device = "cuda" if torch.cuda.is_available() else "cpu"
model, preprocess = clip.load("ViT-B/32", device=device)

image = preprocess(Image.open("CLIP.png")).unsqueeze(0).to(device)
text = clip.tokenize(["a diagram", "a dog", "a cat"]).to(device)

with torch.no_grad():
    image_features = model.encode_image(image)
    text_features = model.encode_text(text)
    
    logits_per_image, logits_per_text = model(image, text)
    probs = logits_per_image.softmax(dim=-1).cpu().numpy()

print("Label probs:", probs)  # prints: [[0.9927937  0.00421068 0.00299572]]

API

The CLIP module clip provides the following methods:

`clip.available_models()`

Returns the names of the available CLIP models.

`clip.load(name, device=..., jit=False)`

Returns the model and the TorchVision transform needed by the model, specified by the model name returned by clip.available_models(). It will download the model as necessary. The name argument can also be a path to a local checkpoint.

The device to run the model can be optionally specified, and the default is to use the first CUDA device if there is any, otherwise the CPU. When jit is False, a non-JIT version of the model will be loaded.

`clip.tokenize(text: Union[str, List[str]], context_length=77)`

Returns a LongTensor containing tokenized sequences of given text input(s). This can be used as the input to the model

The model returned by clip.load() supports the following methods:

`model.encode_image(image: Tensor)`

Given a batch of images, returns the image features encoded by the vision portion of the CLIP model.

`model.encode_text(text: Tensor)`

Given a batch of text tokens, returns the text features encoded by the language portion of the CLIP model.

`model(image: Tensor, text: Tensor)`

Given a batch of images and a batch of text tokens, returns two Tensors, containing the logit scores corresponding to each image and text input. The values are cosine similarities between the corresponding image and text features, times 100.

More Examples

Zero-Shot Prediction

The code below performs zero-shot prediction using CLIP, as shown in Appendix B in the paper. This example takes an image from the CIFAR-100 dataset, and predicts the most likely labels among the 100 textual labels from the dataset.

import os
import clip
import torch
from torchvision.datasets import CIFAR100

# Load the model
device = "cuda" if torch.cuda.is_available() else "cpu"
model, preprocess = clip.load('ViT-B/32', device)

# Download the dataset
cifar100 = CIFAR100(root=os.path.expanduser("~/.cache"), download=True, train=False)

# Prepare the inputs
image, class_id = cifar100[3637]
image_input = preprocess(image).unsqueeze(0).to(device)
text_inputs = torch.cat([clip.tokenize(f"a photo of a {c}") for c in cifar100.classes]).to(device)

# Calculate features
with torch.no_grad():
    image_features = model.encode_image(image_input)
    text_features = model.encode_text(text_inputs)

# Pick the top 5 most similar labels for the image
image_features /= image_features.norm(dim=-1, keepdim=True)
text_features /= text_features.norm(dim=-1, keepdim=True)
similarity = (100.0 * image_features @ text_features.T).softmax(dim=-1)
values, indices = similarity[0].topk(5)

# Print the result
print("\nTop predictions:\n")
for value, index in zip(values, indices):
    print(f"{cifar100.classes[index]:>16s}: {100 * value.item():.2f}%")

The output will look like the following (the exact numbers may be slightly different depending on the compute device):

Top predictions:

           snake: 65.31%
          turtle: 12.29%
    sweet_pepper: 3.83%
          lizard: 1.88%
       crocodile: 1.75%

Note that this example uses the encode_image() and encode_text() methods that return the encoded features of given inputs.

Linear-probe evaluation

The example below uses scikit-learn to perform logistic regression on image features.

import os
import clip
import torch

import numpy as np
from sklearn.linear_model import LogisticRegression
from torch.utils.data import DataLoader
from torchvision.datasets import CIFAR100
from tqdm import tqdm

# Load the model
device = "cuda" if torch.cuda.is_available() else "cpu"
model, preprocess = clip.load('ViT-B/32', device)

# Load the dataset
root = os.path.expanduser("~/.cache")
train = CIFAR100(root, download=True, train=True, transform=preprocess)
test = CIFAR100(root, download=True, train=False, transform=preprocess)


def get_features(dataset):
    all_features = []
    all_labels = []
    
    with torch.no_grad():
        for images, labels in tqdm(DataLoader(dataset, batch_size=100)):
            features = model.encode_image(images.to(device))

            all_features.append(features)
            all_labels.append(labels)

    return torch.cat(all_features).cpu().numpy(), torch.cat(all_labels).cpu().numpy()

# Calculate the image features
train_features, train_labels = get_features(train)
test_features, test_labels = get_features(test)

# Perform logistic regression
classifier = LogisticRegression(random_state=0, C=0.316, max_iter=1000, verbose=1)
classifier.fit(train_features, train_labels)

# Evaluate using the logistic regression classifier
predictions = classifier.predict(test_features)
accuracy = np.mean((test_labels == predictions).astype(float)) * 100.
print(f"Accuracy = {accuracy:.3f}")

Note that the C value should be determined via a hyperparameter sweep using a validation split.

clip's People

Contributors

Stargazers

Watchers

Forkers

karimjedda bradfordlynch ak9250 snapbuy dylanthomas gbzhucherish zencyyoung korterling animesh berenmillidge liuguoyou shafiahmed hill2hill greengoobe lzhbrian onuigwevitus codeaudit kustomzone ejhortala ekoziol-bain longjohncoder bhaskarbharat davidsoong coallaoh lucifer2288 soccergame chaoso wh-forker yipeng-sun namisan hellbell joytianya qlycool yiliu-coding gjtjx adbmd tawawhite gdh756462786 shawwn cxz mbyase paperspace graylevel255 ixobert c4i9 solaraww ml-and-ai-repo xrosliang hiyyg ai-hub-deep-learning-fundamental dkobran jim-martin rogervaas kaffaljidhmah2 cs-util angadquilt1 liannice gyanachand1 qianrenjian coloratto kforcodeai steveshep helioxgroup ashiquebiniqbal vic-rider sungwoong sorrowyn liaopeiyuan kmiksa zlapp marcofernandez007 rogalag jaehyunseo arita37 ed1d1a8d leeesangwon tangsanli5201 vincentkan ameerhamza111 smalleight17 tech-save normster zeta1999 asears databill86 jayleicn avain etri-visualcommonsense sujitahirrao dharai avkumar jwyang rbozydar sallypannn violetxi kairanithin djcordhose vladimirgl sxjscience franvaquer92

clip's Issues

public datasets for evaluation

Hi there,
I'm trying to set up public datasets for evaluation listed in Table 9, but got different train/test size for some datasets:

Facial Emotion Recognition 2013
Dataset I found on Kaggle has train dataset 28,709, Val(public test) 3,589, (Train+Val 32,298 in total) and Test (private test) 3,589.
STL-10
Tensorflow stl10 has training dataset with 5,000 images and testing dataset with 8,000.
EuroSAT
Tensorflow eurosat only has training dataset with 27,000 images.
RESISC45
The site Tensorflow refers to only have training dataset, which is 31,500 images.
GTSRB
This archive I found has 2 training datasets (GTSRB_Final_Training_Images.zip and GTSRB-Training_fixed.zip), but both have size different from Table 9.

This is what Table 9 shows:

Dataset	Classes	Train size	Test size	Evaluation metric
Facial Emotion Recognition 2013	8	32,140	3,574	accuracy
STL-10	10	1000	8000	accuracy
EuroSAT	10	10,000	5,000	accuracy
RESISC45	45	3,150	25,200	accuracy
GTSRB	43	26,640	12,630	accuracy

It would be greatly appreciated if you could point me to the source of data split shown in Table 9.

Non-jit version of model?

Hi!

Thanks for the code release, and for the great work!! I was wondering if it might be possible to release a non-jit version of the model. This might make it easier to convert to other platforms, e.g., tensorflow. What do you think?

Jack

Is there a easy way to get the most relevant words?

Dear OpenAI group,

Thank you for sharing with us this great work.

Is there a way to get the most relevant words for a given image? Similar to bag of words?

For example, given a face, it may output male/female, color of hair, simile or not smile. I understand it is possible to construct sentences like 'a smiling face', but there are a number of words and different way of combination. It is not easy to create a bank of sentences like this.

Thank you very much for your help.

Best Wishes,

Alex

std and mean for image normalization different from ImageNet

torchvision model-zoo's image normalization is:

mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]

CLIP's is:

mean=[0.48145466, 0.4578275, 0.40821073], std=[0.26862954, 0.26130258, 0.27577711]

what's the story behind the difference? Are CLIP's normalization parameters re-calculated on WebImageText?

Can't get GPU speedup

It appears that the code fails to utilise GPUs on my setup: 8 V100 GPUs | nvidia driver 450 | cuda 11.0.

While torch correctly identifies the GPU (the 0th one but I can tweak this with CUDA_VISIBLE_DEVICES externally), when running nvidia-smi during calls to model.forward or encode_images I observe the GPU memory filling up but utilization stays at 0, while htop shows my cpu cores running. As a result I get the same timing results when device is "cpu" or "cuda", the only difference being that device = "cuda" will fail with GPU memory error if called for more than 500-1000 images.

Is this expected, or do you have any idea how to troubleshoot this?

Hyperparameter sweep in Evaluation (linear probe)

Hi there,
I'm trying to reproduce evaluation scores in this paper, particularly table 10. A.3. Evaluation in Page 38 mentioned L2 regularization strength lambda is determined with a hyperparameter sweep.
(1) Only maximum 1,000 iterations is mentioned in L-BFGS. Do other parameters matter, like the learning rate?
(2) For parametric binary search, is the cost function monotonic to lambda?

Thank you!

Computing target matrix differently

Now I realize that the released models and the model from the keras code examples are different, and are possibly trained differently. As far as I can see, as per this issue the openai CLIP model uses a target matrix of torch.eye(batch_size), while in the keras code examples:

To calculate the loss, we compute the pairwise dot-product similarity between each caption_i and images_j in the batch as the predictions. The target similarity between caption_i and image_j is computed as the average of the (dot-product similarity between caption_i and caption_j) and (the dot-product similarity between image_i and image_j). Then, we use crossentropy to compute the loss between the targets and the predictions.

So the target in case of the image and caption not being matched isn't a 0 but the average of the distances between the image embeddings and text embeddings. I realize these are 2 different approaches to train the model but do you think setting the target to be the average instead of just a zero might help convergence/accuracy?

Multimodal projection matrix with ModifiedResNet

Hi,

There's something unclear to me about the code:
It's seems quite clear that the image is projected in the multimodal space here with the VisualTransformer and that the text is projected there but what about ModifiedResNet? It doesn't seem to project the visual features in the multimodal space, did I miss something?

Bests,

Feedback link in Model Card links to a private Google Form

Like the title says: the Model Card features a feedback link at the bottom, but when I try to open the Google Form it says the form is private to the organization of the owner.

RuntimeError: Cannot call numpy() on Tensor that requires grad

When running the first usage example on a machine without GPU (on Heroku), I have encountered an error message.

This is the usage example:

import torch
import clip
from PIL import Image

device = "cuda" if torch.cuda.is_available() else "cpu"
model, preprocess = clip.load("ViT-B/32", device=device)

image = preprocess(Image.open("CLIP.png")).unsqueeze(0).to(device)
text = clip.tokenize(["a diagram", "a dog", "a cat"]).to(device)

with torch.no_grad():
    image_features = model.encode_image(image)
    text_features = model.encode_text(text)
    
    logits_per_image, logits_per_text = model(image, text)
    probs = logits_per_image.softmax(dim=-1).cpu().numpy()

print("Label probs:", probs)  # prints: [[0.9927937  0.00421068 0.00299572]]

This is the error message:

RuntimeError: Cannot call numpy() on Tensor that requires grad

I have followed the recommendation displayed along the error message, and changed this line from:

    probs = logits_per_image.softmax(dim=-1).cpu().numpy()

to:

    probs = logits_per_image.softmax(dim=-1).cpu().detach().numpy()

I have no idea whether the issue is specific to my machine, but I might as well mention it here.

Image similarity?

Incredible work as always you guys! In looking at the Colab, it seems it's possible to do image-to-text similarity but I'm curious if it's possible to compare image similarity as well.

For instance, if I just replace 'text_features' with 'image_features' would that work / be the best way to do this?

image_features /= image_features.norm(dim=-1, keepdim=True)
image_2_features /= image_2_features.norm(dim=-1, keepdim=True)
similarity = image_2_features.cpu().numpy() @ image_features.cpu().numpy().T

Attention Map Generation

Thanks for the release of pretrained model!
I was wondering if it is possible to show attention maps of input images using released ViT-32 model？

[MASK] token representation

Hello! I was wondering what is considered a "[MASK]" token in byte pair encoding / the tokenizer.py for CLIP / if there is a standard method for marking masked words?

simplified pre-trained ImageNet model import

Thanks for making CLIP publicly available, excellent work!

This is more of a feature request than a bug/issue: would it be possible to provide a simplified ImageNet usage example? E.g. for a standard PyTorch model from torchvision.models, the import is a two liner and works out of the box for inference:

import torchvision.models as models
my_model = models.resnet50(pretrained=True)

Similarly, it would be really cool if using CLIP for ImageNet would be possible via something like:

from clip import imagenetmodel
my_model = imagenetmodel()

Clarification of training details

Dear CLIP Authors,

Recently, I am reading the training details in the paper:

To save additional memory, gradient checkpointing (Griewank & Walther, 2000; Chen et al., 2016), half-precision Adam statistics (Dhariwal et al., 2020), and half-precision stochastically rounded text encoder weights were used. The calculation of embedding
similarities was also sharded with individual GPUs computing only the subset of the pairwise similarities necessary for their local batch of embeddings.

I have two questions:

what does "half-precision stochastically rounded text encoder weights" mean? Is there any references?
I do not quite understanding "The calculation of embedding similarities was also sharded with individual GPUs". By saying "sharded" instead of "shared" Does this mean for each sample, the similarities to negative is only from the same gpu? Or you did a gather of features from other gpus and then compute the similarities between one sample to all.

thanks,
Jianwei

Error with custom image/title

When I try to add my image and description, am seeing this error. Does the input image need to be in any specific [format/resolution?]

Text(0.5, 1.0, 'Cosine similarity between text and image features')Error in callback <function install_repl_displayhook..post_execute at 0x7fd3eb1d1730> (for post_execute):

ValueError Traceback (most recent call last)
/usr/local/lib/python3.6/dist-packages/matplotlib/pyplot.py in post_execute()
107 def post_execute():
108 if matplotlib.is_interactive():
--> 109 draw_all()
110
111 # IPython >= 2

13 frames
/usr/local/lib/python3.6/dist-packages/matplotlib/colors.py in call(self, value, clip)
1015 result.fill(0) # Or should it be all masked? Or 0.5?
1016 elif vmin > vmax:
-> 1017 raise ValueError("minvalue must be less than or equal to maxvalue")
1018 else:
1019 if clip:

ValueError: minvalue must be less than or equal to maxvalue

ValueError Traceback (most recent call last)
/usr/local/lib/python3.6/dist-packages/IPython/core/formatters.py in call(self, obj)
332 pass
333 else:
--> 334 return printer(obj)
335 # Finally look for special method names
336 method = get_real_method(obj, self.print_method)

14 frames
/usr/local/lib/python3.6/dist-packages/matplotlib/colors.py in call(self, value, clip)
1015 result.fill(0) # Or should it be all masked? Or 0.5?
1016 elif vmin > vmax:
-> 1017 raise ValueError("minvalue must be less than or equal to maxvalue")
1018 else:
1019 if clip:

ValueError: minvalue must be less than or equal to maxvalue

Does the tokenizer support Chinese?

Hello, does the tokenizer support Chinese?

Temperature Clipping Missing

Thanks for the amazing work and open sourcing it!

In the paper it is mentioned:

The learnable temperature parameter was initialized to the equivalent of 0.07 from (Wu et al., 2018)
 and clipped to prevent scaling the logits by more than 100 which we found necessary to prevent training instability.

However, I am not able to find logits clipping in model.py, in this section:

# cosine similarity as logits
logit_scale = self.logit_scale.exp()
logits_per_image = logit_scale * image_features @ text_features.t()
logits_per_text = logit_scale * text_features @ image_features.t()

Is this a possible change to the code that might refer to the paper?:

# cosine similarity as logits
logit_scale = torch.clamp(self.logit_scale.exp(), max=100)

Though I am not sure how gradients would behave with torch.clamp, like an issue here.

Also, shouldn't initial temperature be 0.07 according to the paper? In the following code 1=1/temperature if I am not mistaken?

self.logit_scale = nn.Parameter(torch.ones([]))

Maybe to this?

self.logit_scale = nn.Parameter(torch.ones([]))*(1/0.07)

Poppies are detected as poodles

I am using a private dataset. However, I discover that all the poppies images are detected as poodles. I used the given checkpoint to test. I wonder if there is any mistake in labeling during training.

Details:
Model used: "https://openaipublic.azureedge.net/clip/models/40d365715913c9da98579312b702a82c18be219cc2a73407c4526f58eba950af/ViT-B-32.pt"

Text used for poppies: 'an image of poppies, a type of flower'
Text used for poodles: 'an image of poodles, a type of dogs'

General Question: General setup with batch of text input

Could you recommend a way to setup batch processing with a bunch of text tokens per class?

For example, I am setting up a task where I have 3 classes, and 2 text sentences per class. For instance, for classes=[daisy, daffodil, lavender], I have the following sentences:

for class daisy: "pushing up daisy", "daisy"
for class daffodil: "planting daffodils", "daffodil"
for class lavender: "smells like lavender", "lavender is good for skin"

The task is then given an image, which class best describes the image.
Right now, I am tokenizing each sentence in the classes. This gives me a batch of 3 x 2 x 77: 3 classes by 2 sentences by 77 context number.
Then for an image, I would iterate over the 2 sentences, doing logits_per_img, logits_per_text = model(image, batch of text[:, i, :]) where i is index of a sentence in the list of sentence descriptors for the class list.

After this, I take the softmax, then the maximum.

My question is: is there an efficient batch processing method I can use for this? And can I increase the context number for longer sentences?

NaN values after a single gradient step

Hi!

Using PyTorch 1.7.1, I get NaN values after a single parameter update:

import torch
import torch.nn as nn
import torch.nn.functional as F
import clip

class Model(nn.Module):
    def __init__(self):
        super().__init__()
        self.model, _ = clip.load('RN50')

    def forward(self, imgs, tokens):
        image_features = self.model.encode_image(imgs)
        match_text_features = self.model.encode_text(tokens)

        image_features = image_features /  image_features.norm(dim=-1, keepdim=True)
        match_text_features = match_text_features / match_text_features.norm(dim=-1, keepdim=True)

        similarity_match = image_features @ match_text_features.T
        return similarity_match

def compute_loss(similarity_match, labels):
    loss1 = F.cross_entropy(similarity_match, labels)
    loss2 = F.cross_entropy(similarity_match.T, labels)
    loss = (loss1 + loss2) / 2
    return loss

model = Model().cuda()
optimizer = torch.optim.Adam(model.parameters())

imgs = torch.randn(8, 3, 224, 224).cuda()
tokens = torch.randint(high=1000, size=(8, 77)).cuda()
labels = torch.arange(8).cuda()

similarity_match = model(imgs, tokens)
loss = compute_loss(similarity_match, labels)
loss.backward()
optimizer.step()

print(model(imgs, tokens))

Output:

       [nan, nan, nan, nan, nan, nan, nan, nan],
       [nan, nan, nan, nan, nan, nan, nan, nan],
       [nan, nan, nan, nan, nan, nan, nan, nan],
       [nan, nan, nan, nan, nan, nan, nan, nan],
       [nan, nan, nan, nan, nan, nan, nan, nan],
       [nan, nan, nan, nan, nan, nan, nan, nan],
       [nan, nan, nan, nan, nan, nan, nan, nan]], device='cuda:0',
      dtype=torch.float16, grad_fn=<MmBackward>)

BPE tokenizer failed with punctuations in the sentence

When tokenizing a sentence with a punctuation at the end, the bpe tokenizer will fail to split the EOS token out. It will lead to the result that the model.encode_text() function can't pick the correct token hidden states out from output of the last layer, for the picking method is argmax(id, dim=-1). I recommend to add a blank space before <|endoftext|> when creating a sentence.
To reproduce the bug:

from simple_tokenizer import SimpleTokenizer
# print the token in for-loop of https://github.com/openai/CLIP/blob/main/simple_tokenizer.py#L124
tokenizer = SimpleTokenizer(bpe_path=$path)
query = ["What will be Covid-19's long-term global economic impact?"]
text_tokens = [tokenizer.encode("The key is " + desc + "<|endoftext|>") for desc in query]

The tokens are:

the
key
is
what
will
be
covid
-
1
9
's
long
-
term
global
economic
impact
?<|
endoftext
|>

And the token ids are:

Image feature clustering

Now, the feature dimension is 512 for both image and text. I'd like to know whether it is possible to do dimension reduction to speed up the inference time. After I do PCA on the image feature, is it possible to use clip features to do clustering?

tokenizer.encode vs tokenizer.encoder

Hi,

I noticed that in the code and the colab notebooks you alternate between calling tokenizer.encode() and indexing into the tokenizer.encoder dict. What's the difference between the two? tokenizer.encode('the') returns [518] but tokenizer.encoder['the'] returns 599.

Thanks!

Do you plan to release code for training on custom datasets?

Hi, loving this research and have been trying out zero shot classification a lot. This is amazing work and thanks so much for releasing it in a way we can try it out.
I wanted to know if openai has any plans of releasing training code etc for us to play around with our own datasets.

Two questions regarding model implementation:

First of all thank you for providing the code and pretrained model. The results look stunning.

I was looking into the source code and there are two parts I am struggling to understand:

What is the purpose of logit_scale in the model here? It seems to simply scale the dot products between two representations.
In the implementation of VisualTransformer what is the role of class_embedding defined here.

Thanks!

Plans to release the model code?

Thanks for sharing OpenAI! I'd like to adapt the code to use in other downstream models, but I noticed you haven't defined it anywhere. I can peak at the forward pass a it a bit with model.code

def forward(self,
    image: Tensor,
    input: Tensor) -> Tuple[Tensor, Tensor]:
  _0 = self.logit_scale
  _1 = self.text_projection
  _2 = self.ln_final
  _3 = self.transformer
  _4 = self.positional_embedding
  _5 = self.token_embedding
  _6 = self.visual
  input0 = torch.to(image, torch.device("cuda"), 5, False, False, None)
  _7 = (_6).forward1(input0, )
  x = torch.to((_5).forward1(input, ), torch.device("cuda"), 5, False, False, None)
  _8 = torch.to(_4, torch.device("cuda"), 5, False, False, None)
  x4 = torch.add(x, _8, alpha=1)
  x5 = torch.permute(x4, [1, 0, 2])
  x6 = torch.permute((_3).forward1(x5, ), [1, 0, 2])
  x7 = torch.to((_2).forward1(x6, ), torch.device("cuda"), 5, False, False, None)
  _9 = ops.prim.NumToTensor(torch.size(x7, 0))
  _10 = torch.arange(annotate(number, _9), dtype=None, layout=0, device=torch.device("cpu"), pin_memory=False)
  _11 = torch.argmax(input, -1, False)
  _12 = torch.to(_10, dtype=4, layout=0, device=torch.device("cuda"), pin_memory=None, non_blocking=False, copy=False, memory_format=None)
  _13 = torch.to(_11, dtype=4, layout=0, device=torch.device("cuda"), pin_memory=None, non_blocking=False, copy=False, memory_format=None)
  _14 = annotate(List[Optional[Tensor]], [_12, _13])
  input1 = torch.matmul(torch.index(x7, _14), _1)
  image_features = torch.div(_7, torch.frobenius_norm(_7, [-1], True))
  _15 = torch.frobenius_norm(input1, [-1], True)
  text_features = torch.div(input1, _15)
  logit_scale = torch.exp(_0)
  _16 = torch.mul(logit_scale, image_features)
  _17 = torch.matmul(_16, torch.t(text_features))
  _18 = torch.matmul(torch.mul(logit_scale, text_features), torch.t(image_features))
  return (_17, _18)

but it's hard to work with and adapt! Any chance that the model code itself will be released?

How is the dataset collected?

This is a question related to the paper instead of this codebase. In paper section 2.2, it briefly describes how the data are gathered by "...we constructed a new dataset of 400 million (image, text) pairs collected form a variety of publicly available sources on the Internet."

I was wondering what are the publicly available sources (e.g. Google image search, Flickr image search, etc.)?

YFCC-100M text filtering

Hi, thanks for the great work!
I'm collecting YFCC-100M dataset following the paper ("After filtering to keep only images with natural language titles and/or descriptions in English, the dataset shrunk by a factor of 6 to only 15 million photos.").
Could you share the way to filter out the non-natural language texts of YFCC-100M?

Run

Paper: Surveillance section

Thank you for your paper. It is very interesting. I especially like the additional sections on Broader Impact and Limitations as they are very detailed.

I have a question on the surveillance section 7.2: Could you expand on what you did in setting up the coarse and fine grained classification? I don't understand what the "close" text description is.

If I understood correctly, in section 7.2, you collect 515 CCTV images. You get groundtruth captions by hand captioning these images. Then giving the CLIP model a CCTV image, and 6 different input text descriptions, you predict the closest matching text to the input image. In addition, you sometimes include a "close" text description.

So, how is the "close" text description different from the given 6 options and the groundtruth? Does the "close" text contain an element of the groundtruth hence why the model keeps choosing the "close" text?

code to fine-tune

How can I fine-tune using model.py with my custom dataset and the pre-trained model.pt?

Bigger models release ?

Hi,
Thanks for these amazing results and for releasing the code and ViT-B/32 weights!
Do you plan to also release the 3 bigger models you mention in the paper ?

Features: float16 if GPU available, float32 if CPU only

Hello,

I have noticed a discrepancy between the dtype of the features (both for images and texts) depending on the availability of a GPU.

if I run the code with CPU only on Colab, image_features.dtype returns torch.float32.
This happens if do not install the package properly, and then do not pay attention that device is set to cpu.

%pip install git+https://github.com/openai/CLIP.git

if I run the code with GPU on Colab, image_features.dtype returns torch.float16.
This happens if I follow the installation process properly and install proper versions of PyTorch (1.7.1+cu101) for Colab:

torch==1.7.1+cu101
torchvision==0.8.2+cu101

Q1: Is there a good reason why both versions do not return the same dtype? Is it due to AMP with GPU?

Moreover, if I wanted to store normalized features, float16 would allow me to cut in half the file size, so I would like to ensure that casting the float32 results (obtained with CPU only) to float16 would not actually lead to a loss of precision.

Q2: Would casting the results to float16 be totally safe? Or would it be safer to cast to float32 instead?

Finally, the discrepancy can be slightly confusing for people who would pre-compute features on a machine with GPU, and then use the pre-computed features along with features computed on the fly in a web app with CPU only. This is how I noticed the discrepancy when running this line:

logits = 100. * image_features @ zeroshot_weights

where image_features were computed on the fly (float32) and zeroshot_weights had been pre-computed (float16).

attention on the text

Given a sentence, is it possible to know which words receive more attention? In practice, I found that CLIP focused on several keywords in the sentence.

Citation

How would you like the work to be cited? Thank you so much and thank you for doing such wonderful work! It was a delight to read and use :))

RuntimeError: Method 'forward' is not defined.

Hi, I need to use a nightly version of PyTorch in order to get GPU support for my RTX 3080, and am facing an issue where,

model, preprocess = clip.load("ViT-B/32") gives a RuntimeError:

RuntimeError Traceback (most recent call last)
in
----> 1 model, preprocess = clip.load("ViT-B/32")

~/miniconda3/envs/pytorch_p38/lib/python3.8/site-packages/clip/clip.py in load(name, device, jit)
129 node.copyAttributes(device_node)
130
--> 131 model.apply(patch_device)
132 patch_device(model.encode_image)
133 patch_device(model.encode_text)

~/miniconda3/envs/pytorch_p38/lib/python3.8/site-packages/torch/nn/modules/module.py in apply(self, fn)
471 """
472 for module in self.children():
--> 473 module.apply(fn)
474 fn(self)
475 return self

~/miniconda3/envs/pytorch_p38/lib/python3.8/site-packages/torch/nn/modules/module.py in apply(self, fn)
471 """
472 for module in self.children():
--> 473 module.apply(fn)
474 fn(self)
475 return self

~/miniconda3/envs/pytorch_p38/lib/python3.8/site-packages/torch/nn/modules/module.py in apply(self, fn)
471 """
472 for module in self.children():
--> 473 module.apply(fn)
474 fn(self)
475 return self

~/miniconda3/envs/pytorch_p38/lib/python3.8/site-packages/torch/nn/modules/module.py in apply(self, fn)
471 """
472 for module in self.children():
--> 473 module.apply(fn)
474 fn(self)
475 return self

~/miniconda3/envs/pytorch_p38/lib/python3.8/site-packages/torch/nn/modules/module.py in apply(self, fn)
471 """
472 for module in self.children():
--> 473 module.apply(fn)
474 fn(self)
475 return self

~/miniconda3/envs/pytorch_p38/lib/python3.8/site-packages/torch/nn/modules/module.py in apply(self, fn)
471 """
472 for module in self.children():
--> 473 module.apply(fn)
474 fn(self)
475 return self

~/miniconda3/envs/pytorch_p38/lib/python3.8/site-packages/torch/nn/modules/module.py in apply(self, fn)
472 for module in self.children():
473 module.apply(fn)
--> 474 fn(self)
475 return self
476

~/miniconda3/envs/pytorch_p38/lib/python3.8/site-packages/clip/clip.py in patch_device(module)
120
121 def patch_device(module):
--> 122 graphs = [module.graph] if hasattr(module, "graph") else []
123 if hasattr(module, "forward1"):
124 graphs.append(module.forward1.graph)

~/miniconda3/envs/pytorch_p38/lib/python3.8/site-packages/torch/jit/_script.py in graph(self)
452 forward method. See :ref:interpreting-graphs for details.
453 """
--> 454 return self._c._get_method("forward").graph
455
456 @Property

RuntimeError: Method 'forward' is not defined.

Is there any chance this could be made forward compatible with latest PyTorch releases?

Load trained models on torch versions below 1.7

Loading the trained models in pytorch 1.6 leads to errors like the following:

aten::_convolution(Tensor input, Tensor weight, Tensor? bias, int[] stride, int[] padding, int[] dilation, bool transposed, int[] output_padding, int groups, bool benchmark, bool de
terministic, bool cudnn_enabled) -> (Tensor):
Expected at most 12 arguments but found 13 positional arguments.

Fortunately, it seems like these are largely due to signature changes in pytorch 1.7, and the model does not use pytorch 1.7 specific features. I have written a script to patch these models, in case it is useful for others: https://gist.github.com/achalddave/12e82c3c879589ee287e9c2769c489f0

It would be nice if there were a way to save the models in a way that they work with earlier versions of torch, but I'm not sure if this is possible. Just filing (and closing) this issue in case it is helpful for others who, like me, are working in an environment where upgrading to torch 1.7 is not possible.

Will there be code release for fine tuning or training?

Hi, this is a super great work!
Are there going to be anyway to fine tune this model for personal datasets?

Question about image augmentation

Thanks for your code.

In section2.3 of the paper, it is mentioned that _

A random square crop from resized images is the only data augmentation used during training

What is the resized shape? Is the image first resized to 256x256 and then cropped to 224x224?

Thanks a lot!

problem using RN50x4 - tensor sizes

Dear all,

the newly released RN50x4 model gives me an error -- but my exact same code works ok with the RN 101 or ViT-B/32 models.
What could be wrong?

perceptor, preprocess = clip.load('RN50x4', jit=True)#gives error in encode_image()
#perceptor, preprocess = clip.load('RN101', jit=True)#works OK with encode_image()
#perceptor, preprocess = clip.load('ViT-B/32', jit=True)# works OK with encode_image()

perceptor.encode_image(torch.zeros(1,3,224,224))

RuntimeError: The following operation failed in the TorchScript interpreter.
Traceback of TorchScript, serialized code (most recent call last):
  File "code/__torch__/multimodal/model/multimodal_transformer.py", line 19, in encode_image
    _0 = self.visual
    x = torch.to(image, torch.device("cuda:0"), 5, False, False, None)
    return (_0).forward(x, )
            ~~~~~~~~~~~ <--- HERE
  def encode_text(self: __torch__.multimodal.model.multimodal_transformer.Multimodal,
    input: Tensor) -> Tensor:
  File "code/__torch__/multimodal/model/modified_resnet.py", line 39, in forward
    _16 = (_10).forward2((_6).forward((_7).forward(_15, ), ), )
    _17 = (_3).forward((_4).forward((_5).forward(_16, ), ), )
    _18 = (_0).forward((_1).forward((_2).forward(_17, ), ), )
           ~~~~~~~~~~~ <--- HERE
    return _18
  def forward1(self: __torch__.multimodal.model.modified_resnet.ModifiedResNet,
  File "code/__torch__/multimodal/model/modified_resnet.py", line 143, in forward
    _81 = torch.slice(torch.unsqueeze(_80, 1), 2, 0, 9223372036854775807, 1)
    _82 = torch.to(_81, 5, False, False, None)
    x1 = torch.add(x0, _82, alpha=1)
         ~~~~~~~~~ <--- HERE
    in_proj_bias = torch.cat([_70, _69, _68], 0)
    tgt_len = ops.prim.NumToTensor(torch.size(x1, 0))

Traceback of TorchScript, original code (most recent call last):
/root/multimodal-pytorch/multimodal/model/modified_resnet.py(76): forward
/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py(709): _slow_forward
/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py(725): _call_impl
/root/multimodal-pytorch/multimodal/model/checkpointing.py(61): checkpoint
/root/multimodal-pytorch/multimodal/model/modified_resnet.py(154): forward
/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py(709): _slow_forward
/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py(725): _call_impl
/root/multimodal-pytorch/multimodal/model/multimodal_transformer.py(221): visual_forward
/opt/conda/lib/python3.7/site-packages/torch/jit/_trace.py(940): trace_module
export_torchscript_models.py(37): export_torchscript_models
/opt/conda/lib/python3.7/site-packages/fire/core.py(672): _CallAndUpdateTrace
/opt/conda/lib/python3.7/site-packages/fire/core.py(468): _Fire
/opt/conda/lib/python3.7/site-packages/fire/core.py(138): Fire
export_torchscript_models.py(43): <module>
RuntimeError: The size of tensor a (50) must match the size of tensor b (82) at non-singleton dimension 0

Fix empty parameters at initialization

Some model parameters are initialized with torch.empty() and will not work well when creating and training a model from scratch.

Thanks @LiJunnan1992 for catching this.

colab is broken ?

this cell:

tokenizer = SimpleTokenizer()
text_tokens = [tokenizer.encode("This is " + desc + "<|endoftext|>") for desc in texts]

when you run it, generates the following error:

NameError Traceback (most recent call last)
in ()
----> 1 tokenizer = SimpleTokenizer()
2 text_tokens = [tokenizer.encode("This is " + desc + "<|endoftext|>") for desc in texts]

NameError: name 'SimpleTokenizer' is not defined

define unknown category?

Hi,
would it be possible to construct a prompt in the way, that it recognizes that the image does not belong to the selected categories?
For example, my prompt will have classes 'dog' and 'cat' and I will predict images of car. In this case, it will predict the category 'unknown' ( or any other category ) ?

Can we use `torch.nn.DataParallel ` to speed encode_image ?

It seems we cannot use encode_image in DataParallel

Output differs on same input

Hello, thanks for releasing the model!

I am observing different outputs on the same input (only between the first run and the second one, the subsequent ones agree with the second oe). The following code reproduces the problem.

import torch                                                                    
                                                                                
print(f"Torch version: {torch.__version__}")

model = torch.jit.load("model.pt").cuda().eval()
torch.manual_seed(0)
x = torch.randn((1, 3, 224, 224), dtype=torch.float32).to("cuda")               

with torch.no_grad():
  image_features_1 = model.encode_image(x).float()
  image_features_2 = model.encode_image(x).float()
  image_features_3 = model.encode_image(x).float()

print(torch.max(torch.abs(image_features_1 - image_features_2)))               
print(torch.max(torch.abs(image_features_3 - image_features_2)))

The output:

Torch version: 1.7.1
tensor(0.0039, device='cuda:0')
tensor(0., device='cuda:0')

We btw checked the model buffers and parameters and they do not change in-between the calls.

About the ImageNet zero-shot performance with the released models

Hi, CLIP authors,

Really great work! Appreciate much for releasing the code!

Recently, I am trying to evaluate the released two models (RN50 and ViT-B/32.) on imagenet validation set. What I can get with prompt engineering without ensemble are shown below:

ResNet-50 top-1: 55.09, top-5: 83.59
ViT-B/32 top-1: 59.06, top-5: 85.59

Not sure whether these numbers match those on your side. As a reference for us to do trial-and-errors, can you report the validation accuracies for these two models?

thanks,
Jianwei

Clarifying the training setup

I'm confused as to your training loss and setup.

For the setup, you say:

We remove the non-linear projection between the representation and the contrastive embedding space, a change which was introduced by Bachman et al. (2019) and popularized by Chen et al. (2020b). We use only a linear projection to map from each encoder’s representation to the multi-modal embedding space. We did not notice a difference in training efficiency between
the two versions and speculate that non-linear projections may be co-adapted with details of current image only self supervised representation learning methods.

Do you use the non-linear projection in pretraining, then remove it after training as in Chen, 2020b, replacing it with a linear projection after training? or do you use a linear projection in pretraining, and keep the linear projection after training?
And can you explain the speculation you talk about above? I don't understand what you mean there.

For the loss:
Can you clarify that the loss used in training is the same form of the loss function as in Zheng et. al, 2021?

Can CLIP apply to video?

As many details about how you collect the datasets are missed, I would like to know whether it is possible to apply the pretrained CLIP on text-image pairs to retrieve videos (given a text)?

Influence of batch size of training convergence

According to your paper you use a large batch size of ~32k samples which means that the raw untrained network initially has a chance of ~1/32k of predicting the correct pair.

I am wondering, how the convergence/learning process would differ, if instead a binary classification problem was formulated and the network would be presented with matching text/image pairs and non-matching pairs alternatingly and be tasked with predicting whether those samples are actually in agreement or not.

In other words, what does the softmax over ~32k entries prior to cross entropy calculation bring to the table which cannot be achieved more conveniently by using sigmoid and binary cross entropy to predict matching/non-matching pairs. As a side effect this would also abolish the dependence on the batch size which seems to be rather crucial?

openai / clip Goto Github PK

clip's Introduction

CLIP

Approach

Usage

API

clip.available_models()

clip.load(name, device=..., jit=False)

clip.tokenize(text: Union[str, List[str]], context_length=77)

model.encode_image(image: Tensor)

model.encode_text(text: Tensor)

model(image: Tensor, text: Tensor)

More Examples

Zero-Shot Prediction

Linear-probe evaluation

See Also

clip's People

Contributors

Stargazers

Watchers

Forkers

clip's Issues

Text(0.5, 1.0, 'Cosine similarity between text and image features')Error in callback <function install_repl_displayhook..post_execute at 0x7fd3eb1d1730> (for post_execute):

ValueError: minvalue must be less than or equal to maxvalue

Recommend Projects

Recommend Topics

Recommend Org

`clip.available_models()`

`clip.load(name, device=..., jit=False)`

`clip.tokenize(text: Union[str, List[str]], context_length=77)`

`model.encode_image(image: Tensor)`

`model.encode_text(text: Tensor)`

`model(image: Tensor, text: Tensor)`