openai / clip Goto Github PK
View Code? Open in Web Editor NEWCLIP (Contrastive Language-Image Pretraining), Predict the most relevant text snippet given an image
License: MIT License
CLIP (Contrastive Language-Image Pretraining), Predict the most relevant text snippet given an image
License: MIT License
First of all thank you for providing the code and pretrained model. The results look stunning.
I was looking into the source code and there are two parts I am struggling to understand:
logit_scale
in the model here? It seems to simply scale the dot products between two representations.VisualTransformer
what is the role of class_embedding
defined here.Thanks!
I'm confused as to your training loss and setup.
For the setup, you say:
We remove the non-linear projection between the representation and the contrastive embedding space, a change which was introduced by Bachman et al. (2019) and popularized by Chen et al. (2020b). We use only a linear projection to map from each encoder’s representation to the multi-modal embedding space. We did not notice a difference in training efficiency between
the two versions and speculate that non-linear projections may be co-adapted with details of current image only self supervised representation learning methods.
Do you use the non-linear projection in pretraining, then remove it after training as in Chen, 2020b, replacing it with a linear projection after training? or do you use a linear projection in pretraining, and keep the linear projection after training?
And can you explain the speculation you talk about above? I don't understand what you mean there.
For the loss:
Can you clarify that the loss used in training is the same form of the loss function as in Zheng et. al, 2021?
Given a sentence, is it possible to know which words receive more attention? In practice, I found that CLIP focused on several keywords in the sentence.
Thanks for sharing OpenAI! I'd like to adapt the code to use in other downstream models, but I noticed you haven't defined it anywhere. I can peak at the forward pass a it a bit with model.code
def forward(self,
image: Tensor,
input: Tensor) -> Tuple[Tensor, Tensor]:
_0 = self.logit_scale
_1 = self.text_projection
_2 = self.ln_final
_3 = self.transformer
_4 = self.positional_embedding
_5 = self.token_embedding
_6 = self.visual
input0 = torch.to(image, torch.device("cuda"), 5, False, False, None)
_7 = (_6).forward1(input0, )
x = torch.to((_5).forward1(input, ), torch.device("cuda"), 5, False, False, None)
_8 = torch.to(_4, torch.device("cuda"), 5, False, False, None)
x4 = torch.add(x, _8, alpha=1)
x5 = torch.permute(x4, [1, 0, 2])
x6 = torch.permute((_3).forward1(x5, ), [1, 0, 2])
x7 = torch.to((_2).forward1(x6, ), torch.device("cuda"), 5, False, False, None)
_9 = ops.prim.NumToTensor(torch.size(x7, 0))
_10 = torch.arange(annotate(number, _9), dtype=None, layout=0, device=torch.device("cpu"), pin_memory=False)
_11 = torch.argmax(input, -1, False)
_12 = torch.to(_10, dtype=4, layout=0, device=torch.device("cuda"), pin_memory=None, non_blocking=False, copy=False, memory_format=None)
_13 = torch.to(_11, dtype=4, layout=0, device=torch.device("cuda"), pin_memory=None, non_blocking=False, copy=False, memory_format=None)
_14 = annotate(List[Optional[Tensor]], [_12, _13])
input1 = torch.matmul(torch.index(x7, _14), _1)
image_features = torch.div(_7, torch.frobenius_norm(_7, [-1], True))
_15 = torch.frobenius_norm(input1, [-1], True)
text_features = torch.div(input1, _15)
logit_scale = torch.exp(_0)
_16 = torch.mul(logit_scale, image_features)
_17 = torch.matmul(_16, torch.t(text_features))
_18 = torch.matmul(torch.mul(logit_scale, text_features), torch.t(image_features))
return (_17, _18)
but it's hard to work with and adapt! Any chance that the model code itself will be released?
Now I realize that the released models and the model from the keras code examples are different, and are possibly trained differently. As far as I can see, as per this issue the openai CLIP model uses a target matrix of torch.eye(batch_size), while in the keras code examples:
To calculate the loss, we compute the pairwise dot-product similarity between each caption_i and images_j in the batch as the predictions. The target similarity between caption_i and image_j is computed as the average of the (dot-product similarity between caption_i and caption_j) and (the dot-product similarity between image_i and image_j). Then, we use crossentropy to compute the loss between the targets and the predictions.
So the target in case of the image and caption not being matched isn't a 0 but the average of the distances between the image embeddings and text embeddings. I realize these are 2 different approaches to train the model but do you think setting the target to be the average instead of just a zero might help convergence/accuracy?
Thanks for the amazing work and open sourcing it!
In the paper it is mentioned:
The learnable temperature parameter was initialized to the equivalent of 0.07 from (Wu et al., 2018)
and clipped to prevent scaling the logits by more than 100 which we found necessary to prevent training instability.
However, I am not able to find logits clipping in model.py, in this section:
# cosine similarity as logits
logit_scale = self.logit_scale.exp()
logits_per_image = logit_scale * image_features @ text_features.t()
logits_per_text = logit_scale * text_features @ image_features.t()
Is this a possible change to the code that might refer to the paper?:
# cosine similarity as logits
logit_scale = torch.clamp(self.logit_scale.exp(), max=100)
Though I am not sure how gradients would behave with torch.clamp
, like an issue here.
Also, shouldn't initial temperature be 0.07
according to the paper? In the following code 1=1/temperature
if I am not mistaken?
self.logit_scale = nn.Parameter(torch.ones([]))
Maybe to this?
self.logit_scale = nn.Parameter(torch.ones([]))*(1/0.07)
Dear CLIP Authors,
Recently, I am reading the training details in the paper:
To save additional memory, gradient checkpointing (Griewank & Walther, 2000; Chen et al., 2016), half-precision Adam statistics (Dhariwal et al., 2020), and half-precision stochastically rounded text encoder weights were used. The calculation of embedding
similarities was also sharded with individual GPUs computing only the subset of the pairwise similarities necessary for their local batch of embeddings.
I have two questions:
thanks,
Jianwei
When I try to add my image and description, am seeing this error. Does the input image need to be in any specific [format/resolution?]
ValueError Traceback (most recent call last)
/usr/local/lib/python3.6/dist-packages/matplotlib/pyplot.py in post_execute()
107 def post_execute():
108 if matplotlib.is_interactive():
--> 109 draw_all()
110
111 # IPython >= 2
13 frames
/usr/local/lib/python3.6/dist-packages/matplotlib/colors.py in call(self, value, clip)
1015 result.fill(0) # Or should it be all masked? Or 0.5?
1016 elif vmin > vmax:
-> 1017 raise ValueError("minvalue must be less than or equal to maxvalue")
1018 else:
1019 if clip:
ValueError Traceback (most recent call last)
/usr/local/lib/python3.6/dist-packages/IPython/core/formatters.py in call(self, obj)
332 pass
333 else:
--> 334 return printer(obj)
335 # Finally look for special method names
336 method = get_real_method(obj, self.print_method)
14 frames
/usr/local/lib/python3.6/dist-packages/matplotlib/colors.py in call(self, value, clip)
1015 result.fill(0) # Or should it be all masked? Or 0.5?
1016 elif vmin > vmax:
-> 1017 raise ValueError("minvalue must be less than or equal to maxvalue")
1018 else:
1019 if clip:
ValueError: minvalue must be less than or equal to maxvalue
torchvision model-zoo's image normalization is:
mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]
CLIP's is:
mean=[0.48145466, 0.4578275, 0.40821073], std=[0.26862954, 0.26130258, 0.27577711]
what's the story behind the difference? Are CLIP's normalization parameters re-calculated on WebImageText?
Hi, I need to use a nightly version of PyTorch in order to get GPU support for my RTX 3080, and am facing an issue where,
model, preprocess = clip.load("ViT-B/32")
gives a RuntimeError:
RuntimeError Traceback (most recent call last)
in
----> 1 model, preprocess = clip.load("ViT-B/32")~/miniconda3/envs/pytorch_p38/lib/python3.8/site-packages/clip/clip.py in load(name, device, jit)
129 node.copyAttributes(device_node)
130
--> 131 model.apply(patch_device)
132 patch_device(model.encode_image)
133 patch_device(model.encode_text)~/miniconda3/envs/pytorch_p38/lib/python3.8/site-packages/torch/nn/modules/module.py in apply(self, fn)
471 """
472 for module in self.children():
--> 473 module.apply(fn)
474 fn(self)
475 return self~/miniconda3/envs/pytorch_p38/lib/python3.8/site-packages/torch/nn/modules/module.py in apply(self, fn)
471 """
472 for module in self.children():
--> 473 module.apply(fn)
474 fn(self)
475 return self~/miniconda3/envs/pytorch_p38/lib/python3.8/site-packages/torch/nn/modules/module.py in apply(self, fn)
471 """
472 for module in self.children():
--> 473 module.apply(fn)
474 fn(self)
475 return self~/miniconda3/envs/pytorch_p38/lib/python3.8/site-packages/torch/nn/modules/module.py in apply(self, fn)
471 """
472 for module in self.children():
--> 473 module.apply(fn)
474 fn(self)
475 return self~/miniconda3/envs/pytorch_p38/lib/python3.8/site-packages/torch/nn/modules/module.py in apply(self, fn)
471 """
472 for module in self.children():
--> 473 module.apply(fn)
474 fn(self)
475 return self~/miniconda3/envs/pytorch_p38/lib/python3.8/site-packages/torch/nn/modules/module.py in apply(self, fn)
471 """
472 for module in self.children():
--> 473 module.apply(fn)
474 fn(self)
475 return self~/miniconda3/envs/pytorch_p38/lib/python3.8/site-packages/torch/nn/modules/module.py in apply(self, fn)
472 for module in self.children():
473 module.apply(fn)
--> 474 fn(self)
475 return self
476~/miniconda3/envs/pytorch_p38/lib/python3.8/site-packages/clip/clip.py in patch_device(module)
120
121 def patch_device(module):
--> 122 graphs = [module.graph] if hasattr(module, "graph") else []
123 if hasattr(module, "forward1"):
124 graphs.append(module.forward1.graph)~/miniconda3/envs/pytorch_p38/lib/python3.8/site-packages/torch/jit/_script.py in graph(self)
452forward
method. See :ref:interpreting-graphs
for details.
453 """
--> 454 return self._c._get_method("forward").graph
455
456 @PropertyRuntimeError: Method 'forward' is not defined.
Is there any chance this could be made forward compatible with latest PyTorch releases?
Thanks for the release of pretrained model!
I was wondering if it is possible to show attention maps of input images using released ViT-32 model?
Hello, does the tokenizer support Chinese?
It appears that the code fails to utilise GPUs on my setup: 8 V100 GPUs | nvidia driver 450 | cuda 11.0.
While torch correctly identifies the GPU (the 0th one but I can tweak this with CUDA_VISIBLE_DEVICES
externally), when running nvidia-smi during calls to model.forward
or encode_images
I observe the GPU memory filling up but utilization stays at 0, while htop
shows my cpu cores running. As a result I get the same timing results when device is "cpu"
or "cuda"
, the only difference being that device = "cuda"
will fail with GPU memory error if called for more than 500-1000 images.
Is this expected, or do you have any idea how to troubleshoot this?
Hi,
There's something unclear to me about the code:
It's seems quite clear that the image is projected in the multimodal space here with the VisualTransformer and that the text is projected there but what about ModifiedResNet? It doesn't seem to project the visual features in the multimodal space, did I miss something?
Bests,
According to your paper you use a large batch size of ~32k samples which means that the raw untrained network initially has a chance of ~1/32k of predicting the correct pair.
I am wondering, how the convergence/learning process would differ, if instead a binary classification problem was formulated and the network would be presented with matching text/image pairs and non-matching pairs alternatingly and be tasked with predicting whether those samples are actually in agreement or not.
In other words, what does the softmax over ~32k entries prior to cross entropy calculation bring to the table which cannot be achieved more conveniently by using sigmoid and binary cross entropy to predict matching/non-matching pairs. As a side effect this would also abolish the dependence on the batch size which seems to be rather crucial?
Thanks for your code.
In section2.3 of the paper, it is mentioned that _
A random square crop from resized images is the only data augmentation used during training
_
What is the resized shape? Is the image first resized to 256x256 and then cropped to 224x224?
Thanks a lot!
this cell:
tokenizer = SimpleTokenizer()
text_tokens = [tokenizer.encode("This is " + desc + "<|endoftext|>") for desc in texts]
when you run it, generates the following error:
NameError Traceback (most recent call last)
in ()
----> 1 tokenizer = SimpleTokenizer()
2 text_tokens = [tokenizer.encode("This is " + desc + "<|endoftext|>") for desc in texts]
NameError: name 'SimpleTokenizer' is not defined
Like the title says: the Model Card features a feedback link at the bottom, but when I try to open the Google Form it says the form is private to the organization of the owner.
How would you like the work to be cited? Thank you so much and thank you for doing such wonderful work! It was a delight to read and use :))
Hi,
would it be possible to construct a prompt in the way, that it recognizes that the image does not belong to the selected categories?
For example, my prompt will have classes 'dog' and 'cat' and I will predict images of car. In this case, it will predict the category 'unknown' ( or any other category ) ?
Hi,
I noticed that in the code and the colab notebooks you alternate between calling tokenizer.encode() and indexing into the tokenizer.encoder dict. What's the difference between the two? tokenizer.encode('the')
returns [518]
but tokenizer.encoder['the']
returns 599
.
Thanks!
Hi!
Thanks for the code release, and for the great work!! I was wondering if it might be possible to release a non-jit version of the model. This might make it easier to convert to other platforms, e.g., tensorflow. What do you think?
Jack
Thanks for making CLIP publicly available, excellent work!
This is more of a feature request than a bug/issue: would it be possible to provide a simplified ImageNet usage example? E.g. for a standard PyTorch model from torchvision.models
, the import is a two liner and works out of the box for inference:
import torchvision.models as models
my_model = models.resnet50(pretrained=True)
Similarly, it would be really cool if using CLIP for ImageNet would be possible via something like:
from clip import imagenetmodel
my_model = imagenetmodel()
Dear OpenAI group,
Thank you for sharing with us this great work.
Is there a way to get the most relevant words for a given image? Similar to bag of words?
For example, given a face, it may output male/female, color of hair, simile or not smile. I understand it is possible to construct sentences like 'a smiling face', but there are a number of words and different way of combination. It is not easy to create a bank of sentences like this.
Thank you very much for your help.
Best Wishes,
Alex
Some model parameters are initialized with torch.empty()
and will not work well when creating and training a model from scratch.
Thanks @LiJunnan1992 for catching this.
Hi there,
I'm trying to set up public datasets for evaluation listed in Table 9, but got different train/test size for some datasets:
This is what Table 9 shows:
Dataset | Classes | Train size | Test size | Evaluation metric |
---|---|---|---|---|
Facial Emotion Recognition 2013 | 8 | 32,140 | 3,574 | accuracy |
STL-10 | 10 | 1000 | 8000 | accuracy |
EuroSAT | 10 | 10,000 | 5,000 | accuracy |
RESISC45 | 45 | 3,150 | 25,200 | accuracy |
GTSRB | 43 | 26,640 | 12,630 | accuracy |
It would be greatly appreciated if you could point me to the source of data split shown in Table 9.
When tokenizing a sentence with a punctuation at the end, the bpe tokenizer will fail to split the EOS token out. It will lead to the result that the model.encode_text()
function can't pick the correct token hidden states out from output of the last layer, for the picking method is argmax(id, dim=-1)
. I recommend to add a blank space before <|endoftext|>
when creating a sentence.
To reproduce the bug:
from simple_tokenizer import SimpleTokenizer
# print the token in for-loop of https://github.com/openai/CLIP/blob/main/simple_tokenizer.py#L124
tokenizer = SimpleTokenizer(bpe_path=$path)
query = ["What will be Covid-19's long-term global economic impact?"]
text_tokens = [tokenizer.encode("The key is " + desc + "<|endoftext|>") for desc in query]
The tokens are:
the
key
is
what
will
be
covid
-
1
9
's
long
-
term
global
economic
impact
?<|
endoftext
|>
And the token ids are:
[[518,
1458,
533,
768,
751,
655,
622,
9284,
268,
272,
280,
568,
1538,
268,
4780,
2779,
5259,
3844,
30,
27,
347,
40786,
4160,
91,
285]]
Loading the trained models in pytorch 1.6 leads to errors like the following:
aten::_convolution(Tensor input, Tensor weight, Tensor? bias, int[] stride, int[] padding, int[] dilation, bool transposed, int[] output_padding, int groups, bool benchmark, bool de
terministic, bool cudnn_enabled) -> (Tensor):
Expected at most 12 arguments but found 13 positional arguments.
Fortunately, it seems like these are largely due to signature changes in pytorch 1.7, and the model does not use pytorch 1.7 specific features. I have written a script to patch these models, in case it is useful for others: https://gist.github.com/achalddave/12e82c3c879589ee287e9c2769c489f0
It would be nice if there were a way to save the models in a way that they work with earlier versions of torch, but I'm not sure if this is possible. Just filing (and closing) this issue in case it is helpful for others who, like me, are working in an environment where upgrading to torch 1.7 is not possible.
This is a question related to the paper instead of this codebase. In paper section 2.2, it briefly describes how the data are gathered by "...we constructed a new dataset of 400 million (image, text) pairs collected form a variety of publicly available sources on the Internet."
I was wondering what are the publicly available sources (e.g. Google image search, Flickr image search, etc.)?
How can I fine-tune using model.py with my custom dataset and the pre-trained model.pt?
I am using a private dataset. However, I discover that all the poppies images are detected as poodles. I used the given checkpoint to test. I wonder if there is any mistake in labeling during training.
Details:
Model used: "https://openaipublic.azureedge.net/clip/models/40d365715913c9da98579312b702a82c18be219cc2a73407c4526f58eba950af/ViT-B-32.pt"
Text used for poppies: 'an image of poppies, a type of flower'
Text used for poodles: 'an image of poodles, a type of dogs'
Hi, loving this research and have been trying out zero shot classification a lot. This is amazing work and thanks so much for releasing it in a way we can try it out.
I wanted to know if openai has any plans of releasing training code etc for us to play around with our own datasets.
Hi, this is a super great work!
Are there going to be anyway to fine tune this model for personal datasets?
Hello! I was wondering what is considered a "[MASK]" token in byte pair encoding / the tokenizer.py for CLIP / if there is a standard method for marking masked words?
Hi there,
I'm trying to reproduce evaluation scores in this paper, particularly table 10. A.3. Evaluation in Page 38 mentioned L2 regularization strength lambda is determined with a hyperparameter sweep.
(1) Only maximum 1,000 iterations is mentioned in L-BFGS. Do other parameters matter, like the learning rate?
(2) For parametric binary search, is the cost function monotonic to lambda?
Thank you!
Dear all,
the newly released RN50x4 model gives me an error -- but my exact same code works ok with the RN 101 or ViT-B/32 models.
What could be wrong?
perceptor, preprocess = clip.load('RN50x4', jit=True)#gives error in encode_image()
#perceptor, preprocess = clip.load('RN101', jit=True)#works OK with encode_image()
#perceptor, preprocess = clip.load('ViT-B/32', jit=True)# works OK with encode_image()
perceptor.encode_image(torch.zeros(1,3,224,224))
RuntimeError: The following operation failed in the TorchScript interpreter.
Traceback of TorchScript, serialized code (most recent call last):
File "code/__torch__/multimodal/model/multimodal_transformer.py", line 19, in encode_image
_0 = self.visual
x = torch.to(image, torch.device("cuda:0"), 5, False, False, None)
return (_0).forward(x, )
~~~~~~~~~~~ <--- HERE
def encode_text(self: __torch__.multimodal.model.multimodal_transformer.Multimodal,
input: Tensor) -> Tensor:
File "code/__torch__/multimodal/model/modified_resnet.py", line 39, in forward
_16 = (_10).forward2((_6).forward((_7).forward(_15, ), ), )
_17 = (_3).forward((_4).forward((_5).forward(_16, ), ), )
_18 = (_0).forward((_1).forward((_2).forward(_17, ), ), )
~~~~~~~~~~~ <--- HERE
return _18
def forward1(self: __torch__.multimodal.model.modified_resnet.ModifiedResNet,
File "code/__torch__/multimodal/model/modified_resnet.py", line 143, in forward
_81 = torch.slice(torch.unsqueeze(_80, 1), 2, 0, 9223372036854775807, 1)
_82 = torch.to(_81, 5, False, False, None)
x1 = torch.add(x0, _82, alpha=1)
~~~~~~~~~ <--- HERE
in_proj_bias = torch.cat([_70, _69, _68], 0)
tgt_len = ops.prim.NumToTensor(torch.size(x1, 0))
Traceback of TorchScript, original code (most recent call last):
/root/multimodal-pytorch/multimodal/model/modified_resnet.py(76): forward
/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py(709): _slow_forward
/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py(725): _call_impl
/root/multimodal-pytorch/multimodal/model/checkpointing.py(61): checkpoint
/root/multimodal-pytorch/multimodal/model/modified_resnet.py(154): forward
/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py(709): _slow_forward
/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py(725): _call_impl
/root/multimodal-pytorch/multimodal/model/multimodal_transformer.py(221): visual_forward
/opt/conda/lib/python3.7/site-packages/torch/jit/_trace.py(940): trace_module
export_torchscript_models.py(37): export_torchscript_models
/opt/conda/lib/python3.7/site-packages/fire/core.py(672): _CallAndUpdateTrace
/opt/conda/lib/python3.7/site-packages/fire/core.py(468): _Fire
/opt/conda/lib/python3.7/site-packages/fire/core.py(138): Fire
export_torchscript_models.py(43): <module>
RuntimeError: The size of tensor a (50) must match the size of tensor b (82) at non-singleton dimension 0
Hi!
Using PyTorch 1.7.1, I get NaN values after a single parameter update:
import torch
import torch.nn as nn
import torch.nn.functional as F
import clip
class Model(nn.Module):
def __init__(self):
super().__init__()
self.model, _ = clip.load('RN50')
def forward(self, imgs, tokens):
image_features = self.model.encode_image(imgs)
match_text_features = self.model.encode_text(tokens)
image_features = image_features / image_features.norm(dim=-1, keepdim=True)
match_text_features = match_text_features / match_text_features.norm(dim=-1, keepdim=True)
similarity_match = image_features @ match_text_features.T
return similarity_match
def compute_loss(similarity_match, labels):
loss1 = F.cross_entropy(similarity_match, labels)
loss2 = F.cross_entropy(similarity_match.T, labels)
loss = (loss1 + loss2) / 2
return loss
model = Model().cuda()
optimizer = torch.optim.Adam(model.parameters())
imgs = torch.randn(8, 3, 224, 224).cuda()
tokens = torch.randint(high=1000, size=(8, 77)).cuda()
labels = torch.arange(8).cuda()
similarity_match = model(imgs, tokens)
loss = compute_loss(similarity_match, labels)
loss.backward()
optimizer.step()
print(model(imgs, tokens))
Output:
[nan, nan, nan, nan, nan, nan, nan, nan],
[nan, nan, nan, nan, nan, nan, nan, nan],
[nan, nan, nan, nan, nan, nan, nan, nan],
[nan, nan, nan, nan, nan, nan, nan, nan],
[nan, nan, nan, nan, nan, nan, nan, nan],
[nan, nan, nan, nan, nan, nan, nan, nan],
[nan, nan, nan, nan, nan, nan, nan, nan]], device='cuda:0',
dtype=torch.float16, grad_fn=<MmBackward>)
Hello,
I have noticed a discrepancy between the dtype
of the features (both for images and texts) depending on the availability of a GPU.
image_features.dtype
returns torch.float32
.device
is set to cpu
.%pip install git+https://github.com/openai/CLIP.git
image_features.dtype
returns torch.float16
.torch==1.7.1+cu101
torchvision==0.8.2+cu101
Q1: Is there a good reason why both versions do not return the same dtype
? Is it due to AMP with GPU?
Moreover, if I wanted to store normalized features, float16
would allow me to cut in half the file size, so I would like to ensure that casting the float32
results (obtained with CPU only) to float16
would not actually lead to a loss of precision.
Q2: Would casting the results to float16
be totally safe? Or would it be safer to cast to float32
instead?
Finally, the discrepancy can be slightly confusing for people who would pre-compute features on a machine with GPU, and then use the pre-computed features along with features computed on the fly in a web app with CPU only. This is how I noticed the discrepancy when running this line:
logits = 100. * image_features @ zeroshot_weights
where image_features
were computed on the fly (float32
) and zeroshot_weights
had been pre-computed (float16
).
Hi,
Thanks for these amazing results and for releasing the code and ViT-B/32 weights!
Do you plan to also release the 3 bigger models you mention in the paper ?
Hi, thanks for the great work!
I'm collecting YFCC-100M dataset following the paper ("After filtering to keep only images with natural language titles and/or descriptions in English, the dataset shrunk by a factor of 6 to only 15 million photos.").
Could you share the way to filter out the non-natural language texts of YFCC-100M?
Hi, CLIP authors,
Really great work! Appreciate much for releasing the code!
Recently, I am trying to evaluate the released two models (RN50 and ViT-B/32.) on imagenet validation set. What I can get with prompt engineering without ensemble are shown below:
ResNet-50 top-1: 55.09, top-5: 83.59
ViT-B/32 top-1: 59.06, top-5: 85.59
Not sure whether these numbers match those on your side. As a reference for us to do trial-and-errors, can you report the validation accuracies for these two models?
thanks,
Jianwei
Hello, thanks for releasing the model!
I am observing different outputs on the same input (only between the first run and the second one, the subsequent ones agree with the second oe). The following code reproduces the problem.
import torch
print(f"Torch version: {torch.__version__}")
model = torch.jit.load("model.pt").cuda().eval()
torch.manual_seed(0)
x = torch.randn((1, 3, 224, 224), dtype=torch.float32).to("cuda")
with torch.no_grad():
image_features_1 = model.encode_image(x).float()
image_features_2 = model.encode_image(x).float()
image_features_3 = model.encode_image(x).float()
print(torch.max(torch.abs(image_features_1 - image_features_2)))
print(torch.max(torch.abs(image_features_3 - image_features_2)))
The output:
Torch version: 1.7.1
tensor(0.0039, device='cuda:0')
tensor(0., device='cuda:0')
We btw checked the model buffers and parameters and they do not change in-between the calls.
Incredible work as always you guys! In looking at the Colab, it seems it's possible to do image-to-text similarity but I'm curious if it's possible to compare image similarity as well.
For instance, if I just replace 'text_features' with 'image_features' would that work / be the best way to do this?
image_features /= image_features.norm(dim=-1, keepdim=True)
image_2_features /= image_2_features.norm(dim=-1, keepdim=True)
similarity = image_2_features.cpu().numpy() @ image_features.cpu().numpy().T
Now, the feature dimension is 512 for both image and text. I'd like to know whether it is possible to do dimension reduction to speed up the inference time. After I do PCA on the image feature, is it possible to use clip features to do clustering?
Could you recommend a way to setup batch processing with a bunch of text tokens per class?
For example, I am setting up a task where I have 3 classes, and 2 text sentences per class. For instance, for classes=[daisy, daffodil, lavender], I have the following sentences:
The task is then given an image, which class best describes the image.
Right now, I am tokenizing each sentence in the classes. This gives me a batch of 3 x 2 x 77: 3 classes by 2 sentences by 77 context number.
Then for an image, I would iterate over the 2 sentences, doing logits_per_img, logits_per_text = model(image, batch of text[:, i, :]) where i is index of a sentence in the list of sentence descriptors for the class list.
After this, I take the softmax, then the maximum.
My question is: is there an efficient batch processing method I can use for this? And can I increase the context number for longer sentences?
As many details about how you collect the datasets are missed, I would like to know whether it is possible to apply the pretrained CLIP on text-image pairs to retrieve videos (given a text)?
When running the first usage example on a machine without GPU (on Heroku), I have encountered an error message.
This is the usage example:
import torch
import clip
from PIL import Image
device = "cuda" if torch.cuda.is_available() else "cpu"
model, preprocess = clip.load("ViT-B/32", device=device)
image = preprocess(Image.open("CLIP.png")).unsqueeze(0).to(device)
text = clip.tokenize(["a diagram", "a dog", "a cat"]).to(device)
with torch.no_grad():
image_features = model.encode_image(image)
text_features = model.encode_text(text)
logits_per_image, logits_per_text = model(image, text)
probs = logits_per_image.softmax(dim=-1).cpu().numpy()
print("Label probs:", probs) # prints: [[0.9927937 0.00421068 0.00299572]]
This is the error message:
RuntimeError: Cannot call numpy() on Tensor that requires grad
I have followed the recommendation displayed along the error message, and changed this line from:
probs = logits_per_image.softmax(dim=-1).cpu().numpy()
to:
probs = logits_per_image.softmax(dim=-1).cpu().detach().numpy()
I have no idea whether the issue is specific to my machine, but I might as well mention it here.
Thank you for your paper. It is very interesting. I especially like the additional sections on Broader Impact and Limitations as they are very detailed.
I have a question on the surveillance section 7.2: Could you expand on what you did in setting up the coarse and fine grained classification? I don't understand what the "close" text description is.
If I understood correctly, in section 7.2, you collect 515 CCTV images. You get groundtruth captions by hand captioning these images. Then giving the CLIP model a CCTV image, and 6 different input text descriptions, you predict the closest matching text to the input image. In addition, you sometimes include a "close" text description.
So, how is the "close" text description different from the given 6 options and the groundtruth? Does the "close" text contain an element of the groundtruth hence why the model keeps choosing the "close" text?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.