kaiyangzhou / coop Goto Github PK
View Code? Open in Web Editor NEWPrompt Learning for Vision-Language Models (IJCV'22, CVPR'22)
License: MIT License
Prompt Learning for Vision-Language Models (IJCV'22, CVPR'22)
License: MIT License
In main.sh, i think these should be .COOP instead of . OURS.
TRAINER.OURS.N_CTX ${NCTX}
TRAINER.OURS.CSC ${CSC}
TRAINER.OURS.CLASS_TOKEN_POSITION ${CTP}
Greetings!
I can only get 46.81% accuracy and 47.05% per-class accuracy after running bash zeroshot.sh stanford_cars rn50
. However, the reported accuracy on StanfordCars dataset is ~55%. What's wrong?
Hi, congratulations on your wonderful work!
Could you please provide the raw data you used in Figure 3 of your paper? My email is [email protected]
Many thanks!
Hello, split_ zhou_ Caltech101.json,split_ zhou_ OxfordFlowers. JSON, and split_ zhou_ DescribableTextures. JSON page link can't be opened. Is there any other download link?
base base2new_train.sh imagenet 1
base base2new_test.sh imagenet 1
base base2new_train.sh imagenet 2
base base2new_test.sh imagenet 2
base base2new_train.sh imagenet 3
base base2new_test.sh imagenet 3
the problem is that “bash” error spell to “base”?
Thank you for your contribution.
I found that the training is slower when using multi-gpus (e.g., 8 gpus) than single gpu. Do you know why is it and how to speed up the training process?
I have tried to change the optimizer attributes to an ADAM optimizer with different LR scheduling and ADAM specific parameters, but when run, it overwrites the LR Scheduler parameters and the betas.
The config file:
DATALOADER:
TRAIN_X:
BATCH_SIZE: 8
TEST:
BATCH_SIZE: 100
NUM_WORKERS: 4
INPUT:
SIZE: (224, 224)
INTERPOLATION: "bicubic"
PIXEL_MEAN: [0.48145466, 0.4578275, 0.40821073]
PIXEL_STD: [0.26862954, 0.26130258, 0.27577711]
TRANSFORMS: ["random_resized_crop", "random_flip", "normalize"]
OPTIM:
NAME: "adam"
LR: 0.0002
ADAM_BETA1: 0.5
ADAM_BETA2: 0.999
MAX_EPOCH: 100
LR_SCHEDULER: "single_step"
GAMMA: 0.1
STEPSIZE: 0
WARMUP_EPOCH: 0
WARMUP_TYPE: "constant"
WARMUP_CONS_LR: 1e-5
TRAIN:
PRINT_FREQ: 20
MODEL:
BACKBONE:
NAME: "videoclip"
TRAINER:
COCOOP:
N_CTX: 4
CTX_INIT: ''
PREC: 'amp'
The log output once run:
OPTIM:
ADAM_BETA1: 0.9
ADAM_BETA2: 0.999
BASE_LR_MULT: 0.1
GAMMA: 0.1
LR: 0.0003
LR_SCHEDULER: single_step
MAX_EPOCH: 10
MOMENTUM: 0.9
NAME: adam
NEW_LAYERS: ()
RMSPROP_ALPHA: 0.99
SGD_DAMPNING: 0
SGD_NESTEROV: False
STAGED_LR: False
STEPSIZE: (-1,)
WARMUP_CONS_LR: 1e-05
WARMUP_EPOCH: -1
WARMUP_MIN_LR: 1e-05
WARMUP_RECOUNT: True
WARMUP_TYPE: linear
WEIGHT_DECAY: 0.0005
Is this a problem with DASSL or a problem with the CoCoOp code base?
Thanks!
Do you plan to release the pretrained weight for CoCoop anytime soon?
Thanks!
Can you provide the few shot results of different few shot settings of the 11 dataset , with the vit-B image backbone CLIP?
I tried the settings in the paper but some results can not be achieved (food , for instance )
Hi Kaiyang,
thanks for you amazing work!
I obtain.cross_entropy(output, label) is used in your training. I wonder if it is possible to replace it with similarity between text and image, by adding corresponding classname embedding in the the learned prompt. And do one-shot like clip. Is it possible to do so?
Thanks!
这需要自己重头训练吗?不能比如加载训练好的权重去预测文字或图片的向量?像原始clip那种
Thank you for your work. I encountered this kind of error when running the Imagenet dataset. Have you encountered any similar errors? How did you solve it?
Traceback (most recent call last):
File "train.py", line 207, in
main(args)
File "train.py", line 142, in main
trainer = build_trainer(cfg)
File "/home/dpsh/Dassl.pytorch/dassl/engine/build.py", line 11, in build_trainer
return TRAINER_REGISTRY.get(cfg.TRAINER.NAME)(cfg)
File "/home/dpsh/Dassl.pytorch/dassl/engine/trainer.py", line 319, in init
self.build_data_loader()
File "/home/dpsh/Dassl.pytorch/dassl/engine/trainer.py", line 342, in build_data_loader
dm = DataManager(self.cfg)
File "/home/dpsh/Dassl.pytorch/dassl/data/data_manager.py", line 128, in init
test_loader = build_data_loader(
File "/home/dpsh/Dassl.pytorch/dassl/data/data_manager.py", line 45, in build_data_loader
assert len(data_loader) > 0
AssertionError
So, you might find OpenAI's code produces around 59% accuracy for zero-shot CLIP (vision_model=RN50
) on ImageNet with prompt ensembling, but CoOp's code gives only 57.81% for the same model (see Table 7 in the paper).
This difference is caused by using different transforms: OpenAI's code applies Resize(224)
to an image while CoOp's code (the previous version) uses Resize((224, 224))
. More specifically, the former keeps the image aspect ratio while the latter doesn't. To allow the results produced by CoOp's code to be comparable to OpenAI's code, we have made our transforms consistent with theirs. So the transforms in the config files have now been changed from ["random_flip", "random_translation", "center_crop", "normalize"]
to ["random_resized_crop", "random_flip", "normalize"]
.
If you are using our Dassl-based CoOp code, please update the code to the latest version. If you want to use your own code, you can simple copy CoOp's model code (i.e. CustomCLIP) and do the comparison on the same ground with whatever pipelines you are using.
For your reference, we have rerun CoOp using the new config files and put below the comparison of Table 7's results.
Method | RN50 | Rn101 | ViT-B/32 | ViT-B/16 |
---|---|---|---|---|
Prompt engineering | 55.41 | 58.72 | 59.88 | 64.71 |
Prompt ensembling | 57.81 | 60.49 | 62.01 | 67.31 |
CoOp | 60.46 | 64.39 | 64.92 | 70.13 |
Method | RN50 | Rn101 | ViT-B/32 | ViT-B/16 |
---|---|---|---|---|
Prompt engineering | 58.18 | 61.26 | 62.05 | 66.73 |
Prompt ensembling | 60.41 | 62.54 | 63.71 | 68.74 |
CoOp | 62.95 | 66.60 | 66.85 | 71.92 |
Hi, thanks for the nice code.
I found the performance is poor when full fine-tuning the ResNet-based CLIP on ImageNet while for ViT-based CLIP the performance is good. Do you have some insightful comments on why full fine-tuning or linear probing the ResNet-based CLIP makes the performance worse?
when training on 1000 classes imagenet, the GPU memory of prompts seems very large and results in the Out Of Memory error on 16GB GPU Card.
How to solve this problem?
Dear Zhou:
When I train my network on oxford_flower(epoch=200), it get a great result as follows:
=> result
Thanks for your great work! And I want to know how to run on cifar_100?
Thanks for your great work. I would like to ask you whether you have considered using a CNN network such as ResNet as the backbone in CoCoOp and whether it is possible to use it?
Sorry for dumb questions. I didn't really understood the description. Is it possible to use your checkpoints for ViT models to do classification tasks? Can I just load them into these models (from openai git), without using your script? Are they better than original openai weights?
When I finished testing the other ten data sets, I tried the Imagenet data set. After the training was completed, an error occurred during the test:
File "/home/dpsh/Dassl.pytorch/dassl/evaluation/evaluator.py", line 69, in evaluate
acc = 100.0 * self._ correct / self._ total
ZeroDivisionError: float division by zero
Is there any similar situation? How was it resolved?
If the token prefixes and suffix are just the slice of the embedding, for example, replacing self.register_buffer("token_prefix", embedding[:, :1, :])
with self.token_prefix=embedding[:, :1, :])
in this line, we will not have to ignore those when loading. Therefore, why do token prefixes have to be a buffer type? Thanks a lot!
I trained on my customed dataset.
python interpret_prompt.py give the output below:
Return the top-3 matched words Size of token embedding: torch.Size([49408, 512]) Size of context: torch.Size([16, 512]) Size of distance matrix: torch.Size([16, 49408]) 1: ['onic</w>', 'yc</w>', 'bet'] ['0.5414', '0.5420', '0.5438'] 2: ['bat', 'cap', 'advising</w>'] ['0.6391', '0.6398', '0.6403'] 3: ['hell</w>', 'ta</w>', 'shaman</w>'] ['0.5795', '0.5809', '0.5836'] 4: ['regram</w>', '-$</w>', 'marketing</w>'] ['0.5730', '0.5778', '0.5782'] 5: ['fied</w>', 'sighted</w>', 'promote</w>'] ['0.5591', '0.5609', '0.5611'] 6: ['potus</w>', 'ghi</w>', 'ongi</w>'] ['0.6077', '0.6081', '0.6112'] 7: ['taya</w>', 'tive</w>', 'ica</w>'] ['0.6110', '0.6179', '0.6198'] 8: ['believe</w>', 'lies</w>', 'worked</w>'] ['0.5304', '0.5321', '0.5331'] 9: ['dess</w>', 'mariti', 'end'] ['0.5861', '0.5861', '0.5895'] 10: ['cooking</w>', 'coach</w>', 'awesome</w>'] ['0.5644', '0.5723', '0.5734'] 11: ['takeover</w>', 'artworks</w>', 'doctors</w>'] ['0.5982', '0.6008', '0.6010'] 12: ['ig', 'vino</w>', 'inas</w>'] ['0.5264', '0.5279', '0.5305'] 13: ['ame</w>', 'ella</w>', 'ed'] ['0.5310', '0.5341', '0.5401'] 14: ['6</w>', '3</w>', 'met</w>'] ['0.5557', '0.5569', '0.5574'] 15: ['meanings</w>', 'signage</w>', 'trade'] ['0.6725', '0.6733', '0.6736'] 16: ['arrived</w>', 'credits</w>', 'desire</w>'] ['0.6162', '0.6256', '0.6264']
if directly use these words, the length of tokenized word is not 16.
I want to know, what is the meaning of ? how can i use these output? thanks
how to train zero-shot model?
Dear Zhou,
Thank you for sharing Dassl!
I encounter some problem when implementing 'coop' with multi-label classification.
My label in One-hot presentation is like: [0,1,0,1,0,0,0,1], so how to define '_classnames' and '_lab2cname' in 'base_dataset.py'?
I have already reshape my data like:{train:classname:[[name1],[name7]], impath: xxx, label: [0,1,0,0,0,0,0,1]} and feed it into 'Datum'
Do you have any good suggestions, or is it possible to update dassl to be compatible with multi-label tasks?
Many Thanks
Thanks for your great job!
I want to ask why the input is not (image, text) at forward function, such as output = self.model(image, text)
.
And what is the scheme of matching text logits and image logits?
First of all, thank you for open sourcing such an easy to use code :)
I reproduced your reported results in CoOp on two datasets, DTD and Flower101. I ran the code with three random seeds,1,2 and 3 for both datasets, as your default setting in ./scripts/main.sh.
The performance of model on DTD is as well as the result in paper (acc: 63.46) when trained with seed=3, but the results of seed 1 and 2 are poor (acc: ~15).
As for Flower101, the result of seed 2 and 3 are ~94, but seed 1's result is 44.50
I wonder if this is a normal situation for this few shot training setting? Thanks for any suggestion :)
Thanks for providing such an outstanding work!
I have a question related to the cocoop.
In my experience, I need to use a lot of images to use cocoop, and it is somewhat time-consuming to use cocoop since it always needs to extract text embedding for each image.
Have you tried to aggregate the image embedding, not at the input of the text encoder, but after the text embedding is extracted?
Thanks
Hi, may I ask if the gradients of the original CLIP text encoder are frozen or not? The paper mentioned that the gradients of text encoder is frozen, but I couldn't find that part in the code... Thanks a lot for your help!
Thanks for your great work!
I tried to use your code to reproduce some results of CoOp reported in your CoCoOp paper.
I tried this model on the dtd
dataset with:
bash main.sh dtd vit_b16_ep50 end 4 16 False
.
Which is exactly same as the setting in the paper.
I got a much higher performance: accuracy: 67.38% +- 0.51%
.
But the paper report CoOp performance as 54.24
.
I have been successful in developing the train and test pipeline for my custom dataset. Can you help me out for making inference on a single image. I am using the trainer.model_inference(image) function. Is there a particular format this image needs to be in ? I am using PIL to read the image.
Error:
/ContextOptimization/CoOp/trainers/coop.py", line 196, in forward
image_features = self.image_encoder(image.type(self.dtype))
File "/home/chandan/anaconda3/envs/coop/lib/python3.8/site-packages/PIL/Image.py", line 519, in getattr
raise AttributeError(name)
AttributeError: type
Main function used:
def main(args):
cfg = setup_cfg(args)
if cfg.SEED >= 0:
print("Setting fixed seed: {}".format(cfg.SEED))
set_random_seed(cfg.SEED)
setup_logger(cfg.OUTPUT_DIR)
if torch.cuda.is_available() and cfg.USE_CUDA:
torch.backends.cudnn.benchmark = True
print_args(args, cfg)
print("Collecting env info ...")
print("** System info **\n{}\n".format(collect_env_info()))
trainer = build_trainer(cfg)
trainer.load_model(args.model_dir, epoch=args.load_epoch)
image = Image.open('/ContextOptimization/CoOp/data/0cd2ed50.png')
result = trainer.model_inference(image)
print(result)
return result
I am looking for the predicted class and predicted probabilities as output.
Any direction would be appreciative.
Thanks
Hi, thanks for the great work, but I found that it is hard to reproduce the results in the paper.
For example, using the released checkpoints in https://github.com/KaiyangZhou/CoOp#models-and-results, the results of vit-b32-ep50 (nctx=16, shots=16, ctp=end, csc=False) on ImageNet are:
transform | seed1 | seed2 | seed3 | |
---|---|---|---|---|
paper | - | 66.85 | - | - |
released checkpoint (inference only) | ["random_resized_crop", "random_flip", "normalize"] | 64.38 | 64.72 | 64.72 |
released checkpoint (inference only) | ["random_flip", "random_translation", "center_crop", "normalize"] | 65.11 | 65.32 | 65.34 |
our reproduce (training from scratch then inference) | ["random_resized_crop", "random_flip", "normalize"] | 65.21 | - | - |
they are all much lower (64.3~65.3) than the results in the paper (66.85), and using the updated transform in #8 (comment) for the released checkpoint achieves even worse performance.
For CoCoOp, the result of vit-b16-ep10 (nctx=4, shots=16, ctp=end) on ImageNet is 71.02, but our reproduce (training from scratch then inference) is 70.14, which is also underperformed.
Our environment informance:
V100-32G / Titan RTX
dassl=0.4.2
torch=1.7.1+cu110
torchvision=0.8.2+cu110
I wonder if I miss something? Thanks a lot.
Assuming the experimental dataset is coco2014, how should I define this classnames?
While I run my own dataset which has beed modified by your method.
There are two problems below:
Hello,
Thank you for sharing your great work.
I had a question regarding the few-shot setting in the CoCoOp experiments. In the paper, it is mentioned that CoCoOp follows a zero-shot evaluation (from base to novel classes) but for training the base classes, it uses a few-shot setting. However, generally for zero-shot evaluation, models are trained on the complete base classes.
Does this mean that, CoCoOp and CoOp requires only a few-shot setting to perform well on novel categories. Can the same training recipe of CoCoOp or CoOp be used by training all examples of the base classes?
Thank you and kind regards.
Thanks for sharing this code. Was curious if you had considered how to handle a regression task?
I thought I might try a few ideas out, perhaps starting with a simple percentage label. Like [V]1 [V]2 ... [V]M 40%
but was curious if you had tried this or had any intuitions.
Hello, I would like to ask you a question, when we do linear-probe-cilp experiment (vit-B/32), we should set which parameters to be tunable. Is it clip_model.visual.ln_post and clip_model.ln_final?
hello,感谢大佬开源!请问下你对比的clip是原版的clip吗?因为我看你的text encoder因该是微调过的,其泛化能力因该会比原版拿4亿数据训练出来的弱?
Hello, thanks for your excellent work! Here I have a code problem when I fine-tune the model on my own dataset. I just follow the organization manner in CoOp/datasets to write my dataset code, but failed to register the dataset. I also add xxx.yaml in CoOp/configs/datasets, and still failed to register, could you give me some advice if I need to add extra code? Thanks!
Thank you for sharing the code,
please ask, is it possible for one RTX3090 24Gb?
Thanks for your great work!
The previous issue has been solved, but I find a new issue.
If I change the code (
Line 248 in ff61507
Hi,
thanks for your great work!
You state in your paper that the one shot experiments are trained for 50 epochs, but after using your code, it looks like the results you report for one shot are consistent with training for 200 epochs. When training with 50 epochs, I obtain results that are much better than those reported in the paper.
Any idea on what causes this?
Thanks!
I found when I set CUDA_VISIBLE_DEVICES="1",
The Code will terminal at
self.model = CustomCLIP(cfg, classnames, clip_model)
self.model.to(self.device).
The to(self.device) function will wait a long time, and will not step over.
Thanks for your contributions!
I have a question about "classnames = self.dm.dataset.classnames" (the line 224 of "CoOp.build_model" in coop.py).
What is the value of "classnames"? I checked the configuration files and didn't find out.
raise ValueError(
ValueError: The requested one is expected to belong to ['SE', 'MCD', 'MME', 'ADDA', 'CDAC', 'DAEL', 'DANN', 'AdaBN', 'M3SDA', 'SourceOnly', 'DDAIG', 'DAELDG', 'Vanilla', 'CrossGrad', 'DomainMix', 'EntMin', 'FixMatch', 'MixMatch', 'Mean
Teacher', 'SupBaseline'], but got [CoOp] (do you mean [CrossGrad]?)
Hello!
As described in readme, CoOP is used for valid datasets in CoOp/configs/datasets/. If I want to try CoOP for my own datasets, How can I do?
Looking forward to your reply!
Thanks
Hello,
when I try to run one of your example bash commands, I somehow can't figure out how to get rid of this error.
(dassl) user@MBP scripts % bash main.sh caltech101 rn50_ep50 middle 16 1 True Run this job and save the output to output/caltech101/CoOp/rn50_ep50_1shots/nctx16_cscTrue_ctpmiddle/seed1 Traceback (most recent call last): File "train.py", line 27, in <module> import trainers.zsclip File "/Users/user/Projects/CoOp/CoOp/trainers/zsclip.py", line 10, in <module> from .ours import load_clip_to_cpu ModuleNotFoundError: No module named 'trainers.ours'
Hi, Thanks a lot for the excellent work and the easy-to-use code!
Recently I've been trying to use CoOp and CoCoOp in my research.
However I encounter a small problem: the GPU consumption of CoCoOp seems to be much larger (about 64X under my setting) than CoOp, resulting in small batch size and very long training time. Based on my understanding, the reason is that the prompts in CoCoOp should be given to each instance instead of each batch. I've seen the same problem reported in the paper.
May I ask whether there are any tricks during training to accelerate the training process? Thanks so much!
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.