kaiyangzhou / coop Goto Github PK

View Code? Open in Web Editor NEW

1.7K 15.0 192.0 1.41 MB

Prompt Learning for Vision-Language Models (IJCV'22, CVPR'22)

License: MIT License

Python 95.81% Shell 4.19%

foundation-models multimodal-learning prompt-learning

coop's People

Contributors

Stargazers

Watchers

Forkers

xiaodongsuper strategist922 zlapp pierrehao hongbo-sun choltz95 yuni1314 chenjun0210 zilunzhang mengqidyangge ksblk2116 paperfactory shiyuzh2007 starmemda junfengan1998 kimdaeung feobi1999 qiaogh97 anonymousdestroyer xujinglin tyut11103 elvishelvis liugengia pugangqiang wenzhihao666 xacheng1996 myt889 yangfukui nobelvictory williamium3000 ahdavies6 engyfan annanwangdaniel hyunmin-hwang awyys tomchen-ctj aliman80 haochenheheda linhuixiao wangjunxiao wangf30143014 wangf3014 bohao-lee harper-li peilab-federated-learning shizhediao xdweixia maxsxie kou35 zhangxgu alxemade yao-hongyu227 abdoiiii geekyutao omipan dyang39 muximuxi tigermall yueyang1996 chengy12 brandonhanx oe-heart zhaoxin94 michalsr gg-big-org akshaydudhane16 tranganhthuan alvinmingsf rshevchecape qddse haoranliang nashid mathewcrespo bala93 kjmillercuris briannlongzhao boyangaaaaa hit-chris sressers johncruyff14 useeclaudia shenzhiyang2000 lijichang joey61liuyi ske159 muyangly zanedurante oliverxuzy hikawas4y0 lianzhuotao xggnet kexingao42 jcl2018 adityakusupati maxzanella tony109060581 tod97 pinglmlcv jiaerxia top-master

coop's Issues

TRAINER.OURS.N_CTX, ... in main.sh

In main.sh, i think these should be .COOP instead of . OURS.

TRAINER.OURS.N_CTX ${NCTX} 
TRAINER.OURS.CSC ${CSC} 
TRAINER.OURS.CLASS_TOKEN_POSITION ${CTP}

Cannot reproduce the accuracy on StanfordCars dataset

Greetings!
I can only get 46.81% accuracy and 47.05% per-class accuracy after running bash zeroshot.sh stanford_cars rn50. However, the reported accuracy on StanfordCars dataset is ~55%. What's wrong?

Raw data of Figure 3

Hi, congratulations on your wonderful work!

Could you please provide the raw data you used in Figure 3 of your paper? My email is [email protected]

Many thanks!

can't download json

Hello, split_ zhou_ Caltech101.json，split_ zhou_ OxfordFlowers. JSON, and split_ zhou_ DescribableTextures. JSON page link can't be opened. Is there any other download link?

Is there any spell error in cocoop running scripts?

seed=1

base base2new_train.sh imagenet 1
base base2new_test.sh imagenet 1

seed=2

base base2new_train.sh imagenet 2
base base2new_test.sh imagenet 2

seed=3

base base2new_train.sh imagenet 3
base base2new_test.sh imagenet 3

the problem is that “bash” error spell to “base”?

training speed

Thank you for your contribution.
I found that the training is slower when using multi-gpus (e.g., 8 gpus) than single gpu. Do you know why is it and how to speed up the training process?

Config Optimizer Overwritten

I have tried to change the optimizer attributes to an ADAM optimizer with different LR scheduling and ADAM specific parameters, but when run, it overwrites the LR Scheduler parameters and the betas.

The config file:

DATALOADER:
  TRAIN_X:
    BATCH_SIZE: 8
  TEST:
    BATCH_SIZE: 100
  NUM_WORKERS: 4

INPUT:
  SIZE: (224, 224)
  INTERPOLATION: "bicubic"
  PIXEL_MEAN: [0.48145466, 0.4578275, 0.40821073]
  PIXEL_STD: [0.26862954, 0.26130258, 0.27577711]
  TRANSFORMS: ["random_resized_crop", "random_flip", "normalize"]

OPTIM:
  NAME: "adam"
  LR: 0.0002
  ADAM_BETA1: 0.5
  ADAM_BETA2: 0.999
  MAX_EPOCH: 100
  LR_SCHEDULER: "single_step"
  GAMMA: 0.1
  STEPSIZE: 0
  WARMUP_EPOCH: 0
  WARMUP_TYPE: "constant"
  WARMUP_CONS_LR: 1e-5

TRAIN:
  PRINT_FREQ: 20

MODEL:
  BACKBONE:
    NAME: "videoclip"

TRAINER:
  COCOOP:
    N_CTX: 4
    CTX_INIT: ''
    PREC: 'amp'

The log output once run:

OPTIM:
  ADAM_BETA1: 0.9
  ADAM_BETA2: 0.999
  BASE_LR_MULT: 0.1
  GAMMA: 0.1
  LR: 0.0003
  LR_SCHEDULER: single_step
  MAX_EPOCH: 10
  MOMENTUM: 0.9
  NAME: adam
  NEW_LAYERS: ()
  RMSPROP_ALPHA: 0.99
  SGD_DAMPNING: 0
  SGD_NESTEROV: False
  STAGED_LR: False
  STEPSIZE: (-1,)
  WARMUP_CONS_LR: 1e-05
  WARMUP_EPOCH: -1
  WARMUP_MIN_LR: 1e-05
  WARMUP_RECOUNT: True
  WARMUP_TYPE: linear
  WEIGHT_DECAY: 0.0005

Is this a problem with DASSL or a problem with the CoCoOp code base?

Thanks!

Pretrained weight for CoCoop

Do you plan to release the pretrained weight for CoCoop anytime soon?

Thanks!

More detailed few shot results

Can you provide the few shot results of different few shot settings of the 11 dataset ， with the vit-B image backbone CLIP？
I tried the settings in the paper but some results can not be achieved (food , for instance )

about replace ce loss with similarity.

Hi Kaiyang,

thanks for you amazing work!

I obtain.cross_entropy(output, label) is used in your training. I wonder if it is possible to replace it with similarity between text and image, by adding corresponding classname embedding in the the learned prompt. And do one-shot like clip. Is it possible to do so?

Thanks!

Can you provide parameters for different models and different data sets

很棒的工作，但要怎么简单直接能使用呢？

这需要自己重头训练吗？不能比如加载训练好的权重去预测文字或图片的向量？像原始clip那种

assert len(data_loader) > 0，AssertionError

Thank you for your work. I encountered this kind of error when running the Imagenet dataset. Have you encountered any similar errors? How did you solve it?
Traceback (most recent call last):
File "train.py", line 207, in
main(args)
File "train.py", line 142, in main
trainer = build_trainer(cfg)
File "/home/dpsh/Dassl.pytorch/dassl/engine/build.py", line 11, in build_trainer
return TRAINER_REGISTRY.get(cfg.TRAINER.NAME)(cfg)
File "/home/dpsh/Dassl.pytorch/dassl/engine/trainer.py", line 319, in init
self.build_data_loader()
File "/home/dpsh/Dassl.pytorch/dassl/engine/trainer.py", line 342, in build_data_loader
dm = DataManager(self.cfg)
File "/home/dpsh/Dassl.pytorch/dassl/data/data_manager.py", line 128, in init
test_loader = build_data_loader(
File "/home/dpsh/Dassl.pytorch/dassl/data/data_manager.py", line 45, in build_data_loader
assert len(data_loader) > 0
AssertionError

Important changes made to Dassl's transforms.py

So, you might find OpenAI's code produces around 59% accuracy for zero-shot CLIP (vision_model=RN50) on ImageNet with prompt ensembling, but CoOp's code gives only 57.81% for the same model (see Table 7 in the paper).

This difference is caused by using different transforms: OpenAI's code applies Resize(224) to an image while CoOp's code (the previous version) uses Resize((224, 224)). More specifically, the former keeps the image aspect ratio while the latter doesn't. To allow the results produced by CoOp's code to be comparable to OpenAI's code, we have made our transforms consistent with theirs. So the transforms in the config files have now been changed from ["random_flip", "random_translation", "center_crop", "normalize"] to ["random_resized_crop", "random_flip", "normalize"].

If you are using our Dassl-based CoOp code, please update the code to the latest version. If you want to use your own code, you can simple copy CoOp's model code (i.e. CustomCLIP) and do the comparison on the same ground with whatever pipelines you are using.

For your reference, we have rerun CoOp using the new config files and put below the comparison of Table 7's results.

Previous version

Method	RN50	Rn101	ViT-B/32	ViT-B/16
Prompt engineering	55.41	58.72	59.88	64.71
Prompt ensembling	57.81	60.49	62.01	67.31
CoOp	60.46	64.39	64.92	70.13

Current version

Method	RN50	Rn101	ViT-B/32	ViT-B/16
Prompt engineering	58.18	61.26	62.05	66.73
Prompt ensembling	60.41	62.54	63.71	68.74
CoOp	62.95	66.60	66.85	71.92

the performance about full fine-tuning on ResNet.

Hi, thanks for the nice code.
I found the performance is poor when full fine-tuning the ResNet-based CLIP on ImageNet while for ViT-based CLIP the performance is good. Do you have some insightful comments on why full fine-tuning or linear probing the ResNet-based CLIP makes the performance worse?

the GPU consumption trained on ImageNet

when training on 1000 classes imagenet, the GPU memory of prompts seems very large and results in the Out Of Memory error on 16GB GPU Card.
How to solve this problem?

When I train my network on oxford_flower(epoch=200), it get a different result.

Dear Zhou:
When I train my network on oxford_flower(epoch=200), it get a great result as follows:
=> result

total: 2,463
correct: 2,268
accuracy: 92.1%
error: 7.9%
macro_f1: 91.6%
Elapsed: 0:14:32
But if I run it again(as your code show, it will use the model I trained last time, which got good results), it gets a bad result as follows:
=> result
total: 2,463
correct: 876
accuracy: 35.6%
error: 64.4%
macro_f1: 30.1%.
I am not sure why it produces a bad result， can you give me some advice.
(May it does not use the BN of the trained model)?

How to run on cifar_100?

Thanks for your great work! And I want to know how to run on cifar_100?

If CoCoOp can use ResNet as backbone

Thanks for your great work. I would like to ask you whether you have considered using a CNN network such as ResNet as the backbone in CoCoOp and whether it is possible to use it?

Questions about checkpoints

Sorry for dumb questions. I didn't really understood the description. Is it possible to use your checkpoints for ViT models to do classification tasks? Can I just load them into these models (from openai git), without using your script? Are they better than original openai weights?

Zero division problem of Imagenet dataset

When I finished testing the other ten data sets, I tried the Imagenet data set. After the training was completed, an error occurred during the test:

File "/home/dpsh/Dassl.pytorch/dassl/evaluation/evaluator.py", line 69, in evaluate

acc = 100.0 * self._ correct / self._ total

ZeroDivisionError: float division by zero

Is there any similar situation? How was it resolved?

Why do token prefixes have to be a buffer type

If the token prefixes and suffix are just the slice of the embedding, for example, replacing self.register_buffer("token_prefix", embedding[:, :1, :]) with self.token_prefix=embedding[:, :1, :]) in this line, we will not have to ignore those when loading. Therefore, why do token prefixes have to be a buffer type? Thanks a lot!

about the output of interpret_prompt.py

I trained on my customed dataset.
python interpret_prompt.py give the output below:
Return the top-3 matched words Size of token embedding: torch.Size([49408, 512]) Size of context: torch.Size([16, 512]) Size of distance matrix: torch.Size([16, 49408]) 1: ['onic</w>', 'yc</w>', 'bet'] ['0.5414', '0.5420', '0.5438'] 2: ['bat', 'cap', 'advising</w>'] ['0.6391', '0.6398', '0.6403'] 3: ['hell</w>', 'ta</w>', 'shaman</w>'] ['0.5795', '0.5809', '0.5836'] 4: ['regram</w>', '-$</w>', 'marketing</w>'] ['0.5730', '0.5778', '0.5782'] 5: ['fied</w>', 'sighted</w>', 'promote</w>'] ['0.5591', '0.5609', '0.5611'] 6: ['potus</w>', 'ghi</w>', 'ongi</w>'] ['0.6077', '0.6081', '0.6112'] 7: ['taya</w>', 'tive</w>', 'ica</w>'] ['0.6110', '0.6179', '0.6198'] 8: ['believe</w>', 'lies</w>', 'worked</w>'] ['0.5304', '0.5321', '0.5331'] 9: ['dess</w>', 'mariti', 'end'] ['0.5861', '0.5861', '0.5895'] 10: ['cooking</w>', 'coach</w>', 'awesome</w>'] ['0.5644', '0.5723', '0.5734'] 11: ['takeover</w>', 'artworks</w>', 'doctors</w>'] ['0.5982', '0.6008', '0.6010'] 12: ['ig', 'vino</w>', 'inas</w>'] ['0.5264', '0.5279', '0.5305'] 13: ['ame</w>', 'ella</w>', 'ed'] ['0.5310', '0.5341', '0.5401'] 14: ['6</w>', '3</w>', 'met</w>'] ['0.5557', '0.5569', '0.5574'] 15: ['meanings</w>', 'signage</w>', 'trade'] ['0.6725', '0.6733', '0.6736'] 16: ['arrived</w>', 'credits</w>', 'desire</w>'] ['0.6162', '0.6256', '0.6264']

if directly use these words, the length of tokenized word is not 16.
I want to know, what is the meaning of ? how can i use these output? thanks

how to train zero-shot model

how to train zero-shot model?

Running on Multi-label Classification

Dear Zhou,
Thank you for sharing Dassl!

I encounter some problem when implementing 'coop' with multi-label classification.
My label in One-hot presentation is like: [0,1,0,1,0,0,0,1], so how to define '_classnames' and '_lab2cname' in 'base_dataset.py'?
I have already reshape my data like：{train：classname:[[name1],[name7]], impath: xxx, label: [0,1,0,0,0,0,0,1]} and feed it into 'Datum'

Do you have any good suggestions, or is it possible to update dassl to be compatible with multi-label tasks？

Many Thanks

About input of text

Thanks for your great job!
I want to ask why the input is not (image, text) at forward function, such as output = self.model(image, text) .
And what is the scheme of matching text logits and image logits?

Different random seeds lead to highly variable results.

First of all, thank you for open sourcing such an easy to use code :)
I reproduced your reported results in CoOp on two datasets, DTD and Flower101. I ran the code with three random seeds,1,2 and 3 for both datasets, as your default setting in ./scripts/main.sh.
The performance of model on DTD is as well as the result in paper (acc: 63.46) when trained with seed=3, but the results of seed 1 and 2 are poor (acc: ~15).
As for Flower101, the result of seed 2 and 3 are ~94, but seed 1's result is 44.50

I wonder if this is a normal situation for this few shot training setting? Thanks for any suggestion :)

other variant of COCOOP

Thanks for providing such an outstanding work!
I have a question related to the cocoop.
In my experience, I need to use a lot of images to use cocoop, and it is somewhat time-consuming to use cocoop since it always needs to extract text embedding for each image.
Have you tried to aggregate the image embedding, not at the input of the text encoder, but after the text embedding is extracted?
Thanks

question about gradients on text encoder

Hi, may I ask if the gradients of the original CLIP text encoder are frozen or not? The paper mentioned that the gradients of text encoder is frozen, but I couldn't find that part in the code... Thanks a lot for your help!

Much better CoOp performance

Thanks for your great work!
I tried to use your code to reproduce some results of CoOp reported in your CoCoOp paper.
I tried this model on the dtd dataset with:
bash main.sh dtd vit_b16_ep50 end 4 16 False.
Which is exactly same as the setting in the paper.
I got a much higher performance: accuracy: 67.38% +- 0.51%.
But the paper report CoOp performance as 54.24.

Inferencing on single image

I have been successful in developing the train and test pipeline for my custom dataset. Can you help me out for making inference on a single image. I am using the trainer.model_inference(image) function. Is there a particular format this image needs to be in ? I am using PIL to read the image.

Error:
/ContextOptimization/CoOp/trainers/coop.py", line 196, in forward
image_features = self.image_encoder(image.type(self.dtype))
File "/home/chandan/anaconda3/envs/coop/lib/python3.8/site-packages/PIL/Image.py", line 519, in getattr
raise AttributeError(name)
AttributeError: type

Main function used:

def main(args):
cfg = setup_cfg(args)
if cfg.SEED >= 0:
print("Setting fixed seed: {}".format(cfg.SEED))
set_random_seed(cfg.SEED)
setup_logger(cfg.OUTPUT_DIR)

if torch.cuda.is_available() and cfg.USE_CUDA:
    torch.backends.cudnn.benchmark = True

print_args(args, cfg)
print("Collecting env info ...")
print("** System info **\n{}\n".format(collect_env_info()))

trainer = build_trainer(cfg)

trainer.load_model(args.model_dir, epoch=args.load_epoch)
image = Image.open('/ContextOptimization/CoOp/data/0cd2ed50.png')
result = trainer.model_inference(image)
print(result)
return result

I am looking for the predicted class and predicted probabilities as output.

Any direction would be appreciative.

Thanks

zero-shot or fine-tune？

To my knowledge, CLIP can be directly used applied to zero-shot learning (i.e., unseen/novel classes).
coop and cocoop don't appear to be zero-shot learning, but require fine-tuning. However, I don't see the detials about how to fine-tuning in paper. Am I misunderstand it? In the meantime, I would like to know how the CLIP is fine-tuned.
I cannot understand the figure 1 in paper: why the performance of coop and cocoop can be compared to zero-shot learning.

Cannot reproduce the results of CoOp and CoCoOp

Hi, thanks for the great work, but I found that it is hard to reproduce the results in the paper.

For example, using the released checkpoints in https://github.com/KaiyangZhou/CoOp#models-and-results, the results of vit-b32-ep50 (nctx=16, shots=16, ctp=end, csc=False) on ImageNet are:

	transform	seed1	seed2	seed3
paper	-	66.85	-	-
released checkpoint (inference only)	["random_resized_crop", "random_flip", "normalize"]	64.38	64.72	64.72
released checkpoint (inference only)	["random_flip", "random_translation", "center_crop", "normalize"]	65.11	65.32	65.34
our reproduce (training from scratch then inference)	["random_resized_crop", "random_flip", "normalize"]	65.21	-	-

they are all much lower (64.3~65.3) than the results in the paper (66.85), and using the updated transform in #8 (comment) for the released checkpoint achieves even worse performance.

For CoCoOp, the result of vit-b16-ep10 (nctx=4, shots=16, ctp=end) on ImageNet is 71.02, but our reproduce (training from scratch then inference) is 70.14, which is also underperformed.

Our environment informance:
V100-32G / Titan RTX
dassl=0.4.2
torch=1.7.1+cu110
torchvision=0.8.2+cu110

I wonder if I miss something? Thanks a lot.

how to use this work on text retrieval?

Assuming the experimental dataset is coco2014, how should I define this classnames?

AttributeError: 'list' object has no attribute 'to'

While I run my own dataset which has beed modified by your method.
There are two problems below:

evaluating is no problem.However when I train, it showed 'AttributeError: 'list' object has no attribute 'to'';
Why the program can process continuously, even though the error happened?

Few-shot setting in CoCoOp Experiments

Hello,
Thank you for sharing your great work.

I had a question regarding the few-shot setting in the CoCoOp experiments. In the paper, it is mentioned that CoCoOp follows a zero-shot evaluation (from base to novel classes) but for training the base classes, it uses a few-shot setting. However, generally for zero-shot evaluation, models are trained on the complete base classes.

Does this mean that, CoCoOp and CoOp requires only a few-shot setting to perform well on novel categories. Can the same training recipe of CoCoOp or CoOp be used by training all examples of the base classes?

Thank you and kind regards.

regression

Thanks for sharing this code. Was curious if you had considered how to handle a regression task?

I thought I might try a few ideas out, perhaps starting with a simple percentage label. Like [V]1 [V]2 ... [V]M 40% but was curious if you had tried this or had any intuitions.

linear-probe-clip

Hello, I would like to ask you a question, when we do linear-probe-cilp experiment (vit-B/32), we should set which parameters to be tunable. Is it clip_model.visual.ln_post and clip_model.ln_final?

there is an error when i apply it on

Is the CLIP model the original version?

hello，感谢大佬开源！请问下你对比的clip是原版的clip吗？因为我看你的text encoder因该是微调过的，其泛化能力因该会比原版拿4亿数据训练出来的弱？

How can I register my own dataset?

Hello, thanks for your excellent work! Here I have a code problem when I fine-tune the model on my own dataset. I just follow the organization manner in CoOp/datasets to write my dataset code, but failed to register the dataset. I also add xxx.yaml in CoOp/configs/datasets, and still failed to register, could you give me some advice if I need to add extra code? Thanks!

is it possible for one RTX3090 24Gb?

Thank you for sharing the code,
please ask, is it possible for one RTX3090 24Gb?

When I change the code , the result will dropp considerably！

Thanks for your great work！
The previous issue has been solved， but I find a new issue.
If I change the code (

CoOp/trainers/coop.py

Line 248 in ff61507

 self.register_model("prompt_learner", self.model.prompt_learner, self.optim, self.sched) 

)
as
self.register_model("model", self.model, self.optim, self.sched).
The result will drop considerably(20%)！
Can you give me some advice?

Reproducing results for one shot case

Hi,

thanks for your great work!
You state in your paper that the one shot experiments are trained for 50 epochs, but after using your code, it looks like the results you report for one shot are consistent with training for 200 epochs. When training with 50 epochs, I obtain results that are much better than those reported in the paper.

Any idea on what causes this?

Thanks!

When I set CUDA_VISIBLE_DEVICES="1" not "0"，The Code dose not work well.

I found when I set CUDA_VISIBLE_DEVICES="1",
The Code will terminal at
self.model = CustomCLIP(cfg, classnames, clip_model)
self.model.to(self.device).

The to(self.device) function will wait a long time, and will not step over.

About the configuration of "classnames"

Thanks for your contributions!

I have a question about "classnames = self.dm.dataset.classnames" (the line 224 of "CoOp.build_model" in coop.py).
What is the value of "classnames"? I checked the configuration files and didn't find out.

raise valueError

raise ValueError(
ValueError: The requested one is expected to belong to ['SE', 'MCD', 'MME', 'ADDA', 'CDAC', 'DAEL', 'DANN', 'AdaBN', 'M3SDA', 'SourceOnly', 'DDAIG', 'DAELDG', 'Vanilla', 'CrossGrad', 'DomainMix', 'EntMin', 'FixMatch', 'MixMatch', 'Mean
Teacher', 'SupBaseline'], but got [CoOp] (do you mean [CrossGrad]?)

Using CoOP for my own dataset

Hello!
As described in readme, CoOP is used for valid datasets in CoOp/configs/datasets/. If I want to try CoOP for my own datasets, How can I do?

Looking forward to your reply!
Thanks

ModuleNotFoundError: No module named 'trainers.ours

Hello,
when I try to run one of your example bash commands, I somehow can't figure out how to get rid of this error.

(dassl) user@MBP scripts % bash main.sh caltech101 rn50_ep50 middle 16 1 True Run this job and save the output to output/caltech101/CoOp/rn50_ep50_1shots/nctx16_cscTrue_ctpmiddle/seed1 Traceback (most recent call last): File "train.py", line 27, in <module> import trainers.zsclip File "/Users/user/Projects/CoOp/CoOp/trainers/zsclip.py", line 10, in <module> from .ours import load_clip_to_cpu ModuleNotFoundError: No module named 'trainers.ours'

GPU Memory Consumption of CoCoOp

Hi, Thanks a lot for the excellent work and the easy-to-use code!
Recently I've been trying to use CoOp and CoCoOp in my research.
However I encounter a small problem: the GPU consumption of CoCoOp seems to be much larger (about 64X under my setting) than CoOp, resulting in small batch size and very long training time. Based on my understanding, the reason is that the prompts in CoCoOp should be given to each instance instead of each batch. I've seen the same problem reported in the paper.
May I ask whether there are any tricks during training to accelerate the training process? Thanks so much!

kaiyangzhou / coop Goto Github PK

coop's People

Contributors

Stargazers

Watchers

Forkers

coop's Issues

seed=1

seed=2

seed=3

Previous version

Current version

Recommend Projects

Recommend Topics

Recommend Org