syliz517 / clip-reid Goto Github PK

Official implementation for "CLIP-ReID: Exploiting Vision-Language Model for Image Re-identification without Concrete Text Labels" (AAAI 2023)

License: MIT License

Python 100.00%

clip reid

clip-reid's Introduction

CLIP-ReID: Exploiting Vision-Language Model for Image Re-Identification without Concrete Text Labels [pdf]

Pipeline

Installation

conda create -n clipreid python=3.8
conda activate clipreid
conda install pytorch==1.8.0 torchvision==0.9.0 torchaudio==0.8.0 cudatoolkit=10.2 -c pytorch
pip install yacs
pip install timm
pip install scikit-image
pip install tqdm
pip install ftfy
pip install regex

Prepare Dataset

Download the datasets (Market-1501, MSMT17, DukeMTMC-reID, Occluded-Duke, VehicleID, VeRi-776), and then unzip them to your_dataset_dir.

Training

For example, if you want to run CNN-based CLIP-ReID-baseline for the Market-1501, you need to modify the bottom of configs/person/cnn_base.yml to

DATASETS:
   NAMES: ('market1501')
   ROOT_DIR: ('your_dataset_dir')
OUTPUT_DIR: 'your_output_dir'

then run

CUDA_VISIBLE_DEVICES=0 python train.py --config_file configs/person/cnn_base.yml

if you want to run ViT-based CLIP-ReID for MSMT17, you need to modify the bottom of configs/person/vit_clipreid.yml to

DATASETS:
   NAMES: ('msmt17')
   ROOT_DIR: ('your_dataset_dir')
OUTPUT_DIR: 'your_output_dir'

then run

CUDA_VISIBLE_DEVICES=0 python train_clipreid.py --config_file configs/person/vit_clipreid.yml

if you want to run ViT-based CLIP-ReID+SIE+OLP for MSMT17, run:

CUDA_VISIBLE_DEVICES=0 python train_clipreid.py --config_file configs/person/vit_clipreid.yml  MODEL.SIE_CAMERA True MODEL.SIE_COE 1.0 MODEL.STRIDE_SIZE '[12, 12]'

Evaluation

For example, if you want to test ViT-based CLIP-ReID for MSMT17

CUDA_VISIBLE_DEVICES=0 python test_clipreid.py --config_file configs/person/vit_clipreid.yml TEST.WEIGHT 'your_trained_checkpoints_path/ViT-B-16_60.pth'

Acknowledgement

Codebase from TransReID, CLIP, and CoOp.

The veri776 viewpoint label is from https://github.com/Zhongdao/VehicleReIDKeyPointData.

Trained models and test logs

Datasets	MSMT17	Market	Duke	Occ-Duke	VeRi	VehicleID
CNN-baseline	model\|test	model\|test	model\|test	model\|test	model\|test	model\|test
CNN-CLIP-ReID	model\|test	model\|test	model\|test	model\|test	model\|test	model\|test
ViT-baseline	model\|test	model\|test	model\|test	model\|test	model\|test	model\|test
ViT-CLIP-ReID	model\|test	model\|test	model\|test	model\|test	model\|test	model\|test
ViT-CLIP-ReID-SIE-OLP	model\|test	model\|test	model\|test	model\|test	model\|test	model\|test

Note that all results listed above are without re-ranking.

With re-ranking, ViT-CLIP-ReID-SIE-OLP achieves 86.7% mAP and 91.1% R1 on MSMT17.

Citation

If you use this code for your research, please cite

@article{li2022clip,
  title={CLIP-ReID: Exploiting Vision-Language Model for Image Re-Identification without Concrete Text Labels},
  author={Li, Siyuan and Sun, Li and Li, Qingli},
  journal={arXiv preprint arXiv:2211.13977},
  year={2022}
}

clip-reid's People

Contributors

Stargazers

Watchers

clip-reid's Issues

Training CLIP-ReID on a Custom Dataset: Player Re-identification Challenge

Hello CLIP-ReID maintainers,

First off, I want to thank you all for creating and maintaining this incredible repository.

I'm writing this issue to seek guidance on a particular aspect of using CLIP-ReID: training the model on a custom dataset. The dataset I'm interested in is from the 'Player Re-identification Challenge' repository, which you can find here.

I've gone through the code, but I couldn't find specific instructions on how to use a custom dataset for training. I have been able to train CLIP-ReID with the Market1501 dataset with no problem.

text encoder is not fixed in first stage training

As the paper describe, in first stage the text and image encoder is fixed, only optimize the text tokens. However, in the code, it seems the text encoder is optimized during training. Could I ask if I misunderstand?

Error when training

HI!
if get the following error:

As far as I understand, this error to custom dataset, but to the model architecture itself.

感谢作者

About training process

Hi. Thanks for your great work!

Can I ask about the explanation of the code execution?

If I want to reproduce the Market1501 result of your paper with CNN baseline,

do I first need to training img encoder with Strong re-ID method, using code below?

CUDA_VISIBLE_DEVICES=0 python train.py --config_file configs/person/cnn_base.yml

And I if have both pretrained image encoder and text encoder, does the below code run stage 1 training to optimize learnable tokens
and also stage 2 training?

CUDA_VISIBLE_DEVICES=0 python train_clipreid.py --config_file configs/person/cnn_clipreid.yml

But where is the text encoder training? Is it automatically loaded in the code?

Also, how should I test the CNN based model after training stage2.

Thanks in advance.

Failed to export the model to ONNX

I want to use my model in the ONNX format for deployment purposes.

Description:

I encountered a runtime error while trying to export a PyTorch model to ONNX format. The error message indicates an internal assertion failure related to the aten::eq operation. Here are the details of the error and the code snippet used for the export.

Error Message:

RuntimeError: 0 INTERNAL ASSERT FAILED at "/Users/runner/work/pytorch/pytorch/pytorch/torch/csrc/jit/ir/alias_analysis.cpp":621, please report a bug to PyTorch. We don't have an op for aten::eq but it isn't a special case.  Argument types: Tensor, bool, 
Candidates:
    aten::eq.Tensor(Tensor self, Tensor other) -> Tensor
    aten::eq.Scalar(Tensor self, Scalar other) -> Tensor
    aten::eq.Scalar_out(Tensor self, Scalar other, *, Tensor(a!) out) -> Tensor(a!)
    aten::eq.Tensor_out(Tensor self, Tensor other, *, Tensor(a!) out) -> Tensor(a!)
    aten::eq.int_list(int[] a, int[] b) -> bool
    aten::eq.device(Device a, Device b) -> bool
    aten::eq.bool(bool a, bool b) -> bool
    aten::eq.enum(AnyEnumType a, AnyEnumType b) -> bool
    aten::eq.int(int a, int b) -> bool
    aten::eq.complex(complex a, complex b) -> bool
    aten::eq.float(float a, float b) -> bool
    aten::eq.int_float(int a, float b) -> bool
    aten::eq.float_int(float a, int b) -> bool
    aten::eq.float_complex(float a, complex b) -> bool
    aten::eq.complex_float(complex a, float b) -> bool
    aten::eq(Scalar a, Scalar b) -> bool
    aten::eq.str(str a, str b) -> bool
    aten::eq.float_list(float[] a, float[] b) -> bool
    aten::eq.Tensor_list(Tensor[] a, Tensor[] b) -> bool
    aten::eq.bool_list(bool[] a, bool[] b) -> bool
    aten::eq.str_list(str[] a, str[] b) -> bool

Code Snippet:

# cnn_clipreid and vit_clipreid models both failed.
model.eval()

inp = [torch.randn((1, 3, 256, 128), requires_grad=False)]
model(*inp)
torch.onnx.export(
    model, tuple(inp), onnx_path,
    export_params=True,
    training=torch.onnx.TrainingMode.PRESERVE,
    do_constant_folding=False,
    opset_version=17
)

System Information:

OS: ubuntu 20.04
PyTorch Version: Tried with pytorch==1.8.0 (failed) and torch==2.3.1 (failed)
Python Version: 3.10.9

I would greatly appreciate any insights into which part of the model might be causing the above error.

Duke数据集

请问文章被接收了吗，有Duke数据集会不会被要求删除这个数据集的试验

Issue in evaluating the models

Hey, thanks for this excellent work of yours.
I have trained a model on the custom dataset, when I try to load the model for evaluation the script raises an error saying

  Resized position embedding: %s to %s torch.Size([197, 768]) torch.Size([129, 768])
  Position embedding resize to height:16 width: 8
  Traceback (most recent call last):
  File "test_clipreid.py", line 44, in <module>
  model.load_param_finetune(cfg.TEST.WEIGHT)
  File "/app/model/make_model_clipreid.py", line 173, in load_param_finetune
  self.state_dict()[i].copy_(param_dict[i])
  RuntimeError: The size of tensor a (129) must match the size of tensor b (211) at non-singleton dimension 0

Resized position embedding: %s to %s torch.Size([197, 768]) torch.Size([129, 768]). This is during evaluation, whereas the position embedding size during the training is Resized position embedding: %s to %s torch.Size([197, 768]) torch.Size([211, 768]). Please can you check on this part?

The same is the case when I try to load the VeRi finetuned model and Market1501 model, with the scripts you have provided.

作者你好，请问不可以多卡训练吗

Thanks for your Great Work！Suppose I want to add a visual prompt learning based on your model，Will the model perform better?

idea from this paper. And can you give me some guidence... thanks again!

How to use CLIP-ReID as feature extractor?

Hi! Would first of all like to know whether your are okay with me implementing these models here: https://github.com/mikel-brostrom/yolo_tracking. Then I would also like to know if there is any easy way of extracting features with these models. Keep up the great work!

Is it still possible to use the text encoder to generate a vector from sentence ?

I would Like to ask that are there a way to access to the text encoder for this model or I can just simply use the encoder directly from CLIP at the same Model type. BTW, I am so sory for creating an issue; It quite confuse to find your text encoder in the code.

Request for attention visualization

Hello, I appreciate your kind words about the excellent results and research sharing.

Regarding the Visualization of CLIP-ReID mentioned in the Ablation Studies and Analysis section of the paper by

Chefer, H.; Gur, S.; and Wolf, L. 2021 titled "Transformer interpretability beyond attention visualization" in the Proceedings of the IEEE/CVF CVPR, pages 782–791.

I would like to visualize my training results similar to what you did in Figure 3 by referring to your paper. Could I please get access to the code you used for visualization?

[Question] Unexpected Performance Drop with ViT/L14?

I have been playing about with your CLIP ReID model and I appreciate the effectiveness of your approach.

Recently, I conducted an experiment on Market-1501 to investigate whether we can further improve the performance of the model by using a larger model architecture. Specifically, I replaced the ViT-B16 backbone in the model with ViT/L14 (I changed the projection planes in make_clip_reid.py to make it work etc.). Intuitively, one might expect that a larger model would deliver better performance. However, the results were counterintuitive.

Here are the results obtained with the original ViT-B/16:

mAP: 89.8%
CMC curve, Rank-1  :95.3%
CMC curve, Rank-5  :98.6%
CMC curve, Rank-10 :99.2%

And here are the results obtained with the ViT-L/14:

mAP: 79.1%
CMC curve, Rank-1  :90.7%
CMC curve, Rank-5  :96.7%
CMC curve, Rank-10 :98.1%

It appears that the performance with the ViT/L14 architecture is significantly lower than with the ViT-B16. I double-checked the modifications and ensured that the experiment settings were identical, save for the architecture swap.

For reference, I'm attaching the training logs of both models:

train_log-market1501-V14.txt
train_log-market1501-B16.txt

I would greatly appreciate your insights into why the ViT/L14 architecture might underperform compared to ViT-B16 in this context. I am new to using ViT models in ReID so any guidance on how the model could potentially be fine-tuned for the larger architecture would also be appreciated!

一个关于prompt训练代码的问题

作者您好，很抱歉打扰你。模型中PromptLearner的forward函数第一句是cls_ctx = self.cls_ctx[label]。这一句我不太明白，以market1501举例子，训练时self.cls_ctx是一个（751,4,512）的向量，这里batchsize=64的时候，这句代码就会把对应label的self.cls_ctx取出来，而第一阶段训练过程的self.cls_ctx在不断更新，那么就会分别对应到这751个人，也就是self.cls_ctx最后结果相当于是每个个体的prompt向量为（4,512），但是到了推理阶段又是新的750个人，那这个是怎么泛化的呢？

抱歉，问的问题可能有点愚蠢，我看了CoCoOp的论文和源码也没有看懂，而且任务也不太一样，希望能得到您的指点，best wishes

Why apply triplet loss to img_feature_last?

Why apply triplet loss to img_feature_last? here, img_feature_last is the output of the second-to-last module of the ViT model.

Custom Dataset

HI! thank you for Your work. Do you have any guidance on how to train the model on custom dataset? Thanks

Thank you for your work,I have a small question

I don't know where this 'your_trained_checkpoints_path/Vit-B-16_60.pth' is stored, can someones help me , thank you so much.

MSMT17: train += val

Hi @Syliz517 ,
First, thank for your great work.
One point I just would like to ask you regarding the training data of MSMT17, it shows that both train and val sets are used for training, train += val, but this info is not presented in the your paper (text)? Hence, I am not sure if only train set were used, clip-reid would be SoTA on MSMT17. Have you ever done any exps about this?
Thank you for your time.

How to fix KeyError: 'cv_embed'?

When testing, I faced a problem that calls cv_embed is lost, but I printed the key and I think cv_embed is existing.
the code and the result is listed, thanks for any help!

**def load_param(self, trained_path):
    param_dict = torch.load(trained_path)
    for i in param_dict:
        print("Keys in param_dict:", param_dict.keys())
        self.state_dict()[i.replace('module.', '')].copy_(param_dict[i])
    print('Loading pretrained model from {}'.format(trained_path))**

"Argument 'interpolation' of type int is deprecated since 0.13 and will be removed in 0.15. "
=> Market1501 loaded
Dataset statistics:

subset | # ids | # images | # cameras

train | 751 | 12936 | 6
query | 750 | 3368 | 6
gallery | 751 | 15913 | 6

Resized position embedding: %s to %s torch.Size([197, 768]) torch.Size([129, 768])
Position embedding resize to height:16 width: 8
Keys in param_dict: odict_keys(['cv_embed', 'classifier.weight', 'classifier_proj.weight', 'bottleneck.weight', 'bottleneck.bias', 'bottleneck.running_mean', 'bottleneck.running_var', 'bottleneck.num_batches_tracked', 'bottleneck_proj.weight', 'bottleneck_proj.bias', 'bottleneck_proj.running_mean', 'bottleneck_proj.running_var', 'bottleneck_proj.num_batches_tracked', 'image_encoder.class_embedding', 'image_encoder.positional_embedding', 'image_encoder.proj', 'image_encoder.conv1.weight', 'image_encoder.ln_pre.weight', 'image_encoder.ln_pre.bias', 'image_encoder.transformer.resblocks.0.attn.in_proj_weight', 'image_encoder.transformer.resblocks.0.attn.in_proj_bias', 'image_encoder.transformer.resblocks.0.attn.out_proj.weight', 'image_encoder.transformer.resblocks.0.attn.out_proj.bias', 'image_encoder.transformer.resblocks.0.ln_1.weight', 'image_encoder.transformer.resblocks.0.ln_1.bias', 'image_encoder.transformer.resblocks.0.mlp.c_fc.weight', 'image_encoder.transformer.resblocks.0.mlp.c_fc.bias', 'image_encoder.transformer.resblocks.0.mlp.c_proj.weight', 'image_encoder.transformer.resblocks.0.mlp.c_proj.bias', 'image_encoder.transformer.resblocks.0.ln_2.weight', 'image_encoder.transformer.resblocks.0.ln_2.bias', 'image_encoder.transformer.resblocks.1.attn.in_proj_weight', 'image_encoder.transformer.resblocks.1.attn.in_proj_bias', 'image_encoder.transformer.resblocks.1.attn.out_proj.weight', 'image_encoder.transformer.resblocks.1.attn.out_proj.bias', 'image_encoder.transformer.resblocks.1.ln_1.weight', 'image_encoder.transformer.resblocks.1.ln_1.bias', 'image_encoder.transformer.resblocks.1.mlp.c_fc.weight', 'image_encoder.transformer.resblocks.1.mlp.c_fc.bias', 'image_encoder.transformer.resblocks.1.mlp.c_proj.weight', 'image_encoder.transformer.resblocks.1.mlp.c_proj.bias', 'image_encoder.transformer.resblocks.1.ln_2.weight', 'image_encoder.transformer.resblocks.1.ln_2.bias', 'image_encoder.transformer.resblocks.2.attn.in_proj_weight', 'image_encoder.transformer.resblocks.2.attn.in_proj_bias', 'image_encoder.transformer.resblocks.2.attn.out_proj.weight', 'image_encoder.transformer.resblocks.2.attn.out_proj.bias', 'image_encoder.transformer.resblocks.2.ln_1.weight', 'image_encoder.transformer.resblocks.2.ln_1.bias', 'image_encoder.transformer.resblocks.2.mlp.c_fc.weight', 'image_encoder.transformer.resblocks.2.mlp.c_fc.bias', 'image_encoder.transformer.resblocks.2.mlp.c_proj.weight', 'image_encoder.transformer.resblocks.2.mlp.c_proj.bias', 'image_encoder.transformer.resblocks.2.ln_2.weight', 'image_encoder.transformer.resblocks.2.ln_2.bias', 'image_encoder.transformer.resblocks.3.attn.in_proj_weight', 'image_encoder.transformer.resblocks.3.attn.in_proj_bias', 'image_encoder.transformer.resblocks.3.attn.out_proj.weight', 'image_encoder.transformer.resblocks.3.attn.out_proj.bias', 'image_encoder.transformer.resblocks.3.ln_1.weight', 'image_encoder.transformer.resblocks.3.ln_1.bias', 'image_encoder.transformer.resblocks.3.mlp.c_fc.weight', 'image_encoder.transformer.resblocks.3.mlp.c_fc.bias', 'image_encoder.transformer.resblocks.3.mlp.c_proj.weight', 'image_encoder.transformer.resblocks.3.mlp.c_proj.bias', 'image_encoder.transformer.resblocks.3.ln_2.weight', 'image_encoder.transformer.resblocks.3.ln_2.bias', 'image_encoder.transformer.resblocks.4.attn.in_proj_weight', 'image_encoder.transformer.resblocks.4.attn.in_proj_bias', 'image_encoder.transformer.resblocks.4.attn.out_proj.weight', 'image_encoder.transformer.resblocks.4.attn.out_proj.bias', 'image_encoder.transformer.resblocks.4.ln_1.weight', 'image_encoder.transformer.resblocks.4.ln_1.bias', 'image_encoder.transformer.resblocks.4.mlp.c_fc.weight', 'image_encoder.transformer.resblocks.4.mlp.c_fc.bias', 'image_encoder.transformer.resblocks.4.mlp.c_proj.weight', 'image_encoder.transformer.resblocks.4.mlp.c_proj.bias', 'image_encoder.transformer.resblocks.4.ln_2.weight', 'image_encoder.transformer.resblocks.4.ln_2.bias', 'image_encoder.transformer.resblocks.5.attn.in_proj_weight', 'image_encoder.transformer.resblocks.5.attn.in_proj_bias', 'image_encoder.transformer.resblocks.5.attn.out_proj.weight', 'image_encoder.transformer.resblocks.5.attn.out_proj.bias', 'image_encoder.transformer.resblocks.5.ln_1.weight', 'image_encoder.transformer.resblocks.5.ln_1.bias', 'image_encoder.transformer.resblocks.5.mlp.c_fc.weight', 'image_encoder.transformer.resblocks.5.mlp.c_fc.bias', 'image_encoder.transformer.resblocks.5.mlp.c_proj.weight', 'image_encoder.transformer.resblocks.5.mlp.c_proj.bias', 'image_encoder.transformer.resblocks.5.ln_2.weight', 'image_encoder.transformer.resblocks.5.ln_2.bias', 'image_encoder.transformer.resblocks.6.attn.in_proj_weight', 'image_encoder.transformer.resblocks.6.attn.in_proj_bias', 'image_encoder.transformer.resblocks.6.attn.out_proj.weight', 'image_encoder.transformer.resblocks.6.attn.out_proj.bias', 'image_encoder.transformer.resblocks.6.ln_1.weight', 'image_encoder.transformer.resblocks.6.ln_1.bias', 'image_encoder.transformer.resblocks.6.mlp.c_fc.weight', 'image_encoder.transformer.resblocks.6.mlp.c_fc.bias', 'image_encoder.transformer.resblocks.6.mlp.c_proj.weight', 'image_encoder.transformer.resblocks.6.mlp.c_proj.bias', 'image_encoder.transformer.resblocks.6.ln_2.weight', 'image_encoder.transformer.resblocks.6.ln_2.bias', 'image_encoder.transformer.resblocks.7.attn.in_proj_weight', 'image_encoder.transformer.resblocks.7.attn.in_proj_bias', 'image_encoder.transformer.resblocks.7.attn.out_proj.weight', 'image_encoder.transformer.resblocks.7.attn.out_proj.bias', 'image_encoder.transformer.resblocks.7.ln_1.weight', 'image_encoder.transformer.resblocks.7.ln_1.bias', 'image_encoder.transformer.resblocks.7.mlp.c_fc.weight', 'image_encoder.transformer.resblocks.7.mlp.c_fc.bias', 'image_encoder.transformer.resblocks.7.mlp.c_proj.weight', 'image_encoder.transformer.resblocks.7.mlp.c_proj.bias', 'image_encoder.transformer.resblocks.7.ln_2.weight', 'image_encoder.transformer.resblocks.7.ln_2.bias', 'image_encoder.transformer.resblocks.8.attn.in_proj_weight', 'image_encoder.transformer.resblocks.8.attn.in_proj_bias', 'image_encoder.transformer.resblocks.8.attn.out_proj.weight', 'image_encoder.transformer.resblocks.8.attn.out_proj.bias', 'image_encoder.transformer.resblocks.8.ln_1.weight', 'image_encoder.transformer.resblocks.8.ln_1.bias', 'image_encoder.transformer.resblocks.8.mlp.c_fc.weight', 'image_encoder.transformer.resblocks.8.mlp.c_fc.bias', 'image_encoder.transformer.resblocks.8.mlp.c_proj.weight', 'image_encoder.transformer.resblocks.8.mlp.c_proj.bias', 'image_encoder.transformer.resblocks.8.ln_2.weight', 'image_encoder.transformer.resblocks.8.ln_2.bias', 'image_encoder.transformer.resblocks.9.attn.in_proj_weight', 'image_encoder.transformer.resblocks.9.attn.in_proj_bias', 'image_encoder.transformer.resblocks.9.attn.out_proj.weight', 'image_encoder.transformer.resblocks.9.attn.out_proj.bias', 'image_encoder.transformer.resblocks.9.ln_1.weight', 'image_encoder.transformer.resblocks.9.ln_1.bias', 'image_encoder.transformer.resblocks.9.mlp.c_fc.weight', 'image_encoder.transformer.resblocks.9.mlp.c_fc.bias', 'image_encoder.transformer.resblocks.9.mlp.c_proj.weight', 'image_encoder.transformer.resblocks.9.mlp.c_proj.bias', 'image_encoder.transformer.resblocks.9.ln_2.weight', 'image_encoder.transformer.resblocks.9.ln_2.bias', 'image_encoder.transformer.resblocks.10.attn.in_proj_weight', 'image_encoder.transformer.resblocks.10.attn.in_proj_bias', 'image_encoder.transformer.resblocks.10.attn.out_proj.weight', 'image_encoder.transformer.resblocks.10.attn.out_proj.bias', 'image_encoder.transformer.resblocks.10.ln_1.weight', 'image_encoder.transformer.resblocks.10.ln_1.bias', 'image_encoder.transformer.resblocks.10.mlp.c_fc.weight', 'image_encoder.transformer.resblocks.10.mlp.c_fc.bias', 'image_encoder.transformer.resblocks.10.mlp.c_proj.weight', 'image_encoder.transformer.resblocks.10.mlp.c_proj.bias', 'image_encoder.transformer.resblocks.10.ln_2.weight', 'image_encoder.transformer.resblocks.10.ln_2.bias', 'image_encoder.transformer.resblocks.11.attn.in_proj_weight', 'image_encoder.transformer.resblocks.11.attn.in_proj_bias', 'image_encoder.transformer.resblocks.11.attn.out_proj.weight', 'image_encoder.transformer.resblocks.11.attn.out_proj.bias', 'image_encoder.transformer.resblocks.11.ln_1.weight', 'image_encoder.transformer.resblocks.11.ln_1.bias', 'image_encoder.transformer.resblocks.11.mlp.c_fc.weight', 'image_encoder.transformer.resblocks.11.mlp.c_fc.bias', 'image_encoder.transformer.resblocks.11.mlp.c_proj.weight', 'image_encoder.transformer.resblocks.11.mlp.c_proj.bias', 'image_encoder.transformer.resblocks.11.ln_2.weight', 'image_encoder.transformer.resblocks.11.ln_2.bias', 'image_encoder.ln_post.weight', 'image_encoder.ln_post.bias', 'prompt_learner.cls_ctx', 'prompt_learner.token_prefix', 'prompt_learner.token_suffix', 'text_encoder.positional_embedding', 'text_encoder.text_projection', 'text_encoder.transformer.resblocks.0.attn.in_proj_weight', 'text_encoder.transformer.resblocks.0.attn.in_proj_bias', 'text_encoder.transformer.resblocks.0.attn.out_proj.weight', 'text_encoder.transformer.resblocks.0.attn.out_proj.bias', 'text_encoder.transformer.resblocks.0.ln_1.weight', 'text_encoder.transformer.resblocks.0.ln_1.bias', 'text_encoder.transformer.resblocks.0.mlp.c_fc.weight', 'text_encoder.transformer.resblocks.0.mlp.c_fc.bias', 'text_encoder.transformer.resblocks.0.mlp.c_proj.weight', 'text_encoder.transformer.resblocks.0.mlp.c_proj.bias', 'text_encoder.transformer.resblocks.0.ln_2.weight', 'text_encoder.transformer.resblocks.0.ln_2.bias', 'text_encoder.transformer.resblocks.1.attn.in_proj_weight', 'text_encoder.transformer.resblocks.1.attn.in_proj_bias', 'text_encoder.transformer.resblocks.1.attn.out_proj.weight', 'text_encoder.transformer.resblocks.1.attn.out_proj.bias', 'text_encoder.transformer.resblocks.1.ln_1.weight', 'text_encoder.transformer.resblocks.1.ln_1.bias', 'text_encoder.transformer.resblocks.1.mlp.c_fc.weight', 'text_encoder.transformer.resblocks.1.mlp.c_fc.bias', 'text_encoder.transformer.resblocks.1.mlp.c_proj.weight', 'text_encoder.transformer.resblocks.1.mlp.c_proj.bias', 'text_encoder.transformer.resblocks.1.ln_2.weight', 'text_encoder.transformer.resblocks.1.ln_2.bias', 'text_encoder.transformer.resblocks.2.attn.in_proj_weight', 'text_encoder.transformer.resblocks.2.attn.in_proj_bias', 'text_encoder.transformer.resblocks.2.attn.out_proj.weight', 'text_encoder.transformer.resblocks.2.attn.out_proj.bias', 'text_encoder.transformer.resblocks.2.ln_1.weight', 'text_encoder.transformer.resblocks.2.ln_1.bias', 'text_encoder.transformer.resblocks.2.mlp.c_fc.weight', 'text_encoder.transformer.resblocks.2.mlp.c_fc.bias', 'text_encoder.transformer.resblocks.2.mlp.c_proj.weight', 'text_encoder.transformer.resblocks.2.mlp.c_proj.bias', 'text_encoder.transformer.resblocks.2.ln_2.weight', 'text_encoder.transformer.resblocks.2.ln_2.bias', 'text_encoder.transformer.resblocks.3.attn.in_proj_weight', 'text_encoder.transformer.resblocks.3.attn.in_proj_bias', 'text_encoder.transformer.resblocks.3.attn.out_proj.weight', 'text_encoder.transformer.resblocks.3.attn.out_proj.bias', 'text_encoder.transformer.resblocks.3.ln_1.weight', 'text_encoder.transformer.resblocks.3.ln_1.bias', 'text_encoder.transformer.resblocks.3.mlp.c_fc.weight', 'text_encoder.transformer.resblocks.3.mlp.c_fc.bias', 'text_encoder.transformer.resblocks.3.mlp.c_proj.weight', 'text_encoder.transformer.resblocks.3.mlp.c_proj.bias', 'text_encoder.transformer.resblocks.3.ln_2.weight', 'text_encoder.transformer.resblocks.3.ln_2.bias', 'text_encoder.transformer.resblocks.4.attn.in_proj_weight', 'text_encoder.transformer.resblocks.4.attn.in_proj_bias', 'text_encoder.transformer.resblocks.4.attn.out_proj.weight', 'text_encoder.transformer.resblocks.4.attn.out_proj.bias', 'text_encoder.transformer.resblocks.4.ln_1.weight', 'text_encoder.transformer.resblocks.4.ln_1.bias', 'text_encoder.transformer.resblocks.4.mlp.c_fc.weight', 'text_encoder.transformer.resblocks.4.mlp.c_fc.bias', 'text_encoder.transformer.resblocks.4.mlp.c_proj.weight', 'text_encoder.transformer.resblocks.4.mlp.c_proj.bias', 'text_encoder.transformer.resblocks.4.ln_2.weight', 'text_encoder.transformer.resblocks.4.ln_2.bias', 'text_encoder.transformer.resblocks.5.attn.in_proj_weight', 'text_encoder.transformer.resblocks.5.attn.in_proj_bias', 'text_encoder.transformer.resblocks.5.attn.out_proj.weight', 'text_encoder.transformer.resblocks.5.attn.out_proj.bias', 'text_encoder.transformer.resblocks.5.ln_1.weight', 'text_encoder.transformer.resblocks.5.ln_1.bias', 'text_encoder.transformer.resblocks.5.mlp.c_fc.weight', 'text_encoder.transformer.resblocks.5.mlp.c_fc.bias', 'text_encoder.transformer.resblocks.5.mlp.c_proj.weight', 'text_encoder.transformer.resblocks.5.mlp.c_proj.bias', 'text_encoder.transformer.resblocks.5.ln_2.weight', 'text_encoder.transformer.resblocks.5.ln_2.bias', 'text_encoder.transformer.resblocks.6.attn.in_proj_weight', 'text_encoder.transformer.resblocks.6.attn.in_proj_bias', 'text_encoder.transformer.resblocks.6.attn.out_proj.weight', 'text_encoder.transformer.resblocks.6.attn.out_proj.bias', 'text_encoder.transformer.resblocks.6.ln_1.weight', 'text_encoder.transformer.resblocks.6.ln_1.bias', 'text_encoder.transformer.resblocks.6.mlp.c_fc.weight', 'text_encoder.transformer.resblocks.6.mlp.c_fc.bias', 'text_encoder.transformer.resblocks.6.mlp.c_proj.weight', 'text_encoder.transformer.resblocks.6.mlp.c_proj.bias', 'text_encoder.transformer.resblocks.6.ln_2.weight', 'text_encoder.transformer.resblocks.6.ln_2.bias', 'text_encoder.transformer.resblocks.7.attn.in_proj_weight', 'text_encoder.transformer.resblocks.7.attn.in_proj_bias', 'text_encoder.transformer.resblocks.7.attn.out_proj.weight', 'text_encoder.transformer.resblocks.7.attn.out_proj.bias', 'text_encoder.transformer.resblocks.7.ln_1.weight', 'text_encoder.transformer.resblocks.7.ln_1.bias', 'text_encoder.transformer.resblocks.7.mlp.c_fc.weight', 'text_encoder.transformer.resblocks.7.mlp.c_fc.bias', 'text_encoder.transformer.resblocks.7.mlp.c_proj.weight', 'text_encoder.transformer.resblocks.7.mlp.c_proj.bias', 'text_encoder.transformer.resblocks.7.ln_2.weight', 'text_encoder.transformer.resblocks.7.ln_2.bias', 'text_encoder.transformer.resblocks.8.attn.in_proj_weight', 'text_encoder.transformer.resblocks.8.attn.in_proj_bias', 'text_encoder.transformer.resblocks.8.attn.out_proj.weight', 'text_encoder.transformer.resblocks.8.attn.out_proj.bias', 'text_encoder.transformer.resblocks.8.ln_1.weight', 'text_encoder.transformer.resblocks.8.ln_1.bias', 'text_encoder.transformer.resblocks.8.mlp.c_fc.weight', 'text_encoder.transformer.resblocks.8.mlp.c_fc.bias', 'text_encoder.transformer.resblocks.8.mlp.c_proj.weight', 'text_encoder.transformer.resblocks.8.mlp.c_proj.bias', 'text_encoder.transformer.resblocks.8.ln_2.weight', 'text_encoder.transformer.resblocks.8.ln_2.bias', 'text_encoder.transformer.resblocks.9.attn.in_proj_weight', 'text_encoder.transformer.resblocks.9.attn.in_proj_bias', 'text_encoder.transformer.resblocks.9.attn.out_proj.weight', 'text_encoder.transformer.resblocks.9.attn.out_proj.bias', 'text_encoder.transformer.resblocks.9.ln_1.weight', 'text_encoder.transformer.resblocks.9.ln_1.bias', 'text_encoder.transformer.resblocks.9.mlp.c_fc.weight', 'text_encoder.transformer.resblocks.9.mlp.c_fc.bias', 'text_encoder.transformer.resblocks.9.mlp.c_proj.weight', 'text_encoder.transformer.resblocks.9.mlp.c_proj.bias', 'text_encoder.transformer.resblocks.9.ln_2.weight', 'text_encoder.transformer.resblocks.9.ln_2.bias', 'text_encoder.transformer.resblocks.10.attn.in_proj_weight', 'text_encoder.transformer.resblocks.10.attn.in_proj_bias', 'text_encoder.transformer.resblocks.10.attn.out_proj.weight', 'text_encoder.transformer.resblocks.10.attn.out_proj.bias', 'text_encoder.transformer.resblocks.10.ln_1.weight', 'text_encoder.transformer.resblocks.10.ln_1.bias', 'text_encoder.transformer.resblocks.10.mlp.c_fc.weight', 'text_encoder.transformer.resblocks.10.mlp.c_fc.bias', 'text_encoder.transformer.resblocks.10.mlp.c_proj.weight', 'text_encoder.transformer.resblocks.10.mlp.c_proj.bias', 'text_encoder.transformer.resblocks.10.ln_2.weight', 'text_encoder.transformer.resblocks.10.ln_2.bias', 'text_encoder.transformer.resblocks.11.attn.in_proj_weight', 'text_encoder.transformer.resblocks.11.attn.in_proj_bias', 'text_encoder.transformer.resblocks.11.attn.out_proj.weight', 'text_encoder.transformer.resblocks.11.attn.out_proj.bias', 'text_encoder.transformer.resblocks.11.ln_1.weight', 'text_encoder.transformer.resblocks.11.ln_1.bias', 'text_encoder.transformer.resblocks.11.mlp.c_fc.weight', 'text_encoder.transformer.resblocks.11.mlp.c_fc.bias', 'text_encoder.transformer.resblocks.11.mlp.c_proj.weight', 'text_encoder.transformer.resblocks.11.mlp.c_proj.bias', 'text_encoder.transformer.resblocks.11.ln_2.weight', 'text_encoder.transformer.resblocks.11.ln_2.bias', 'text_encoder.ln_final.weight', 'text_encoder.ln_final.bias'])
Traceback (most recent call last):
File "test_clipreid.py", line 44, in
model.load_param(cfg.TEST.WEIGHT)
File "/home/disk/hgk/CLIP-ReID-master/model/make_model_clipreid.py", line 160, in load_param
self.state_dict()[i.replace('module.', '')].copy_(param_dict[i])
KeyError: 'cv_embed'

代码运行问题

Traceback (most recent call last):
File "/media/lele/c/zuozhigang/CLIP_ReID/Base/train_clipreid.py", line 89, in
do_train_stage2(
File "/media/lele/c/zuozhigang/CLIP_ReID/Base/processor/processor_clipreid_stage2.py", line 98, in do_train_stage2
loss = loss_fn(score, feat, target, target_cam, logits)
TypeError: loss_func() takes 3 positional arguments but 5 were given

请问一下，你们有没有遇到这个问题，如果有是怎么解决的？

Missing Comparisons

Thanks for releasing this code for CLIP-based Re-ID. This is a good try for improving Re-ID. However, I find some key concerns:

Overclaims
In fact, your work is not the first work that adopts CLIP for Re-ID. Please check the following paper in MMSports ’22, October 10, 2022,
Konrad Habel et al., CLIP-ReIdent: Contrastive Training for Player Re-Identification
Besides, I think the part of related work is not full. In fact, there are many other Transformer-based methods should be discussed. For example, HAT ([HAT: Hierarchical Aggregation Transformers for Person Re-identification]) has already used multi-level supervison ( Similarly highligted in the last sentence of your Training details). LAFomer use local-aware tranformer for re-identification.
It is better for the authors to modify these contents.

2.Missing the key comparisions
I appreciate the authors provide the ablations. However, what is the effect of using the multi-level supervison (Note that we also employ Ltri after the 11th transformer layer of ViT-B/16 and the 3rd residual layer of ResNet-50.)？ In fact, this supervison generally shows better results than supervision with the last layer. This may lead to unfair comparisons.

Since the training need feeding all images. what are the training times and test speed with your devices (also not listed)?

About Fig3 in the paper

Hi. Thank you for sharing your work.

How did you visualize Figure3?! Did you

Could you also provide the code for it?

Will L2 normalization for image and text leads to better results?

When aligning image and text, why don't you need to l2 normalize the image and text features? Will this not cause the module length of the image feature to become very large in order to reduce the i2t loss in the second stage of training?

Will using arcface loss performs better than CE ?

Performance on MSMT17 is lower than the paper claims

Hi @Syliz517 ,
Thanks for sharing your great work.

I have downloaded some of pretrained model weights and evaluated them. On Market1501 and Duke, the results are the same as your paper claims, but it is not correct on MSMT17.
Regarding MSMT17, I downloaded the weights from ViT-CLIP-ReID-SIE-OLP and here is the config I used: msmt17.log. Please a look if I misconfig.
The result is here

=> MSMT17 loaded
Dataset statistics:
  ----------------------------------------
  subset   | # ids | # images | # cameras
  ----------------------------------------
  train    |  1041 |    32621 |        15
  query    |  3060 |    11659 |        15
  gallery  |  3060 |    82161 |        15
  ----------------------------------------
Resized position embedding: %s to %s torch.Size([197, 768]) torch.Size([211, 768])
Position embedding resize to height:21 width: 10
camera number is : 15
Loading pretrained model from checkpoints/MSMT17_clipreid_12x12sie_ViT-B-16_60.pth
2024-05-13 10:24:50,597 transreid.test INFO: Enter inferencing
The test feature is normalized
=> Computing DistMat with euclidean_distance
2024-05-13 10:30:45,340 transreid.test INFO: Validation Results 
2024-05-13 10:30:45,355 transreid.test INFO: mAP: 70.7%
2024-05-13 10:30:45,355 transreid.test INFO: CMC curve, Rank-1  :87.0%
2024-05-13 10:30:45,355 transreid.test INFO: CMC curve, Rank-5  :93.4%
2024-05-13 10:30:45,355 transreid.test INFO: CMC curve, Rank-10 :94.9%

Thanks for your time.

Confusion about adapting position embedding with resolution change in ViT backbone

Hello Author,

Thank you for your work on Clip-Reid. I'm facing some confusion regarding the position embedding adaptation of Vision Transformer (ViT) backbone when altering the resolution and the correct loading of CLIP model weights.

The original CLIP model is trained with a certain resolution, and I understand that the position embeddings are tied to this specific resolution. When the input resolution is changed, it's unclear to me how the position embeddings should be adapted.

When loading the CLIP weights with a modified input resolution, are there any special considerations or steps to ensure the weights are loaded correctly?

I've gone through the documentation and issues but haven't found a clear explanation on this topic. Any guidance, documentation references, or examples would be greatly appreciated.

Thank you for your time and assistance.

Best regards

ValueError: Type mismatch (<class 'yacs.config.CfgNode'> vs. <class 'NoneType'>) with values (NAMES: market1501 ROOT_DIR: ../data vs. None) for config key: DATASETS

Hello, I met some problems, when I run your work. As you can see, when I run ViT-based CLIP-ReID+SIE+OLP for market1501, I got a bug "ValueError: Type mismatch (<class 'yacs.config.CfgNode'> vs. <class 'NoneType'>) with values (NAMES: market1501 ROOT_DIR: ../data vs. None) for config key: DATASETS", I cannot figure it out, can you tell how to solve it?
I just change
DATASETS:
NAMES: ('market1501')
ROOT_DIR: '../Market-1501-v15.09.15'
OUTPUT_DIR: '../market1501_out'

Thank you very much!

感谢您的工作！有一些困惑请教！

在代码中，第一阶段的训练中image encoder是冻结的，可学习的text tokens和和text encoder是可学习的。这和论文里描述的只有text tokens是可学习的，image encoder和text encoder是冻结的不匹配呀。

Failure to replicate results？

Use the veri data set to train and evaluate. The results are as follows：

2023-10-24 21:53:58,401 transreid.test INFO: Validation Results
2023-10-24 21:53:58,402 transreid.test INFO: mAP: 75.5%
2023-10-24 21:53:58,402 transreid.test INFO: CMC curve, Rank-1 :92.0%
2023-10-24 21:53:58,402 transreid.test INFO: CMC curve, Rank-5 :94.4%
2023-10-24 21:53:58,402 transreid.test INFO: CMC curve, Rank-10 :95.9%

Rank_1 is 4.8% lower than in the paper...I double-checked the configs and ensured that the experiment settings were identical.

For reference, I'm attaching the training logs of the models:

train_log_cnn_prom_veri.txt

pre-trained CLIP-ReID for evaluation when having no train data

Hi,

How can I use a pre-trained CLIP-ReID model to evaluate or extract features on a custom dataset when I don't have data to train?

I know that I need to chnage an existing configuration file like vit_clipreid.yml and to my eval data. but any exmaple on how to run the evaluation script? when having no TEST.WEIGHT parameter.

Thanks

About the reid model of the vehicle

First of all, thank you very much for your contribution in the field of re-identification!
I had some problems when using your model. When I read the vehicle model into it, some tensor size mismatch problems were displayed. When training vehicle data, what should you do for the ViT backbone? What modifications were made?

Training with .CSV Files: Image Paths and Text Descriptions

Hi there,

is there a way to train with a .csv file that has image paths and text descriptions?

Interesting work! Can I use your pre-trained model for my method?

I currently have a Prompt-CLIP work, which is a similar idea to yours, but at the moment, I've only experimented with CLIP-CNN, which is also proven to work. I've created my pseudo-text prompts for each identity in the six datasets. I am very inspired by the experimental results in your paper and would like to use your model for fine-tuning. I will be citing your paper in the future!

Compared to fast-reid

Thanks for releasing CLIP-based Re-ID code. I'm doing a work related to person reid and followed the code in https://github.com/JDAI-CV/fast-reid/ . Comparing with the results in https://github.com/JDAI-CV/fast-reid/blob/master/MODEL_ZOO.md, it looks like that the results of CLIP-REID doesn't outperform CNN-based baseline's?

训练的模型被保存在哪里了？

纯小白，我在train的过程中为什么在我的out put dir中只生成了train_log.txt训练的模型本身去了哪里？

Center loss being ignored

I have noticed that you do not use center loss, despite setting up optimizer for it and adding centroid scaling.

CLIP-ReID/processor/processor_clipreid_stage2.py

Line 13 in 5b92124

def do_train_stage2(cfg,

It's just not being called with loss function.
Can I ask why? I have implemented it myself, but I am yet to see how it works

How to train stage1 and stage2 independently?

Dear @Syliz517 , @awarebayes,
Thanks for share your excellent work.
In the training process, sometimes it completely finishes stage1, but is interrupted in stage2, so I want to directly load weights trained in stage1 to to train stage2, but I am still not successful yet.
If you have any suggestion, please give a hint.
Thanks for your time.

    is_train_stage1 = False
    if is_train_stage1:
        do_train_stage1(...)
    else: # load weights that are trained in stage 1
        # model.load_param_finetune(os.path.join(cfg.OUTPUT_DIR, cfg.MODEL.NAME + '_stage1_{}.pth'.format(cfg.SOLVER.STAGE1.MAX_EPOCHS)))
        model.load_param(os.path.join(cfg.OUTPUT_DIR, cfg.MODEL.NAME + '_stage1_{}.pth'.format(cfg.SOLVER.STAGE1.MAX_EPOCHS)))

    optimizer_2stage, optimizer_center_2stage = make_optimizer_2stage(cfg, model, center_criterion)
    scheduler_2stage = WarmupMultiStepLR(optimizer_2stage, cfg.SOLVER.STAGE2.STEPS, cfg.SOLVER.STAGE2.GAMMA, cfg.SOLVER.STAGE2.WARMUP_FACTOR,
                                  cfg.SOLVER.STAGE2.WARMUP_ITERS, cfg.SOLVER.STAGE2.WARMUP_METHOD)

    do_train_stage2(, model, ...)

Here is a piece of logs

Resized position embedding: %s to %s torch.Size([197, 768]) torch.Size([211, 768])
Position embedding resize to height:21 width: 10
/usr/local/lib/python3.8/dist-packages/torch/nn/functional.py:3631: UserWarning: Default upsampling behavior when mode=bilinear is changed to align_corners=False since 0.4.0. Please specify align_corners=True if the old behavior is desired. See the documentation of nn.Upsample for details.
  warnings.warn(
using triplet loss with margin:0.3
label smooth on, numclasses: 11196
Loading pretrained model from ./std/ViT-B-16_stage1_60.pth
2024-07-02 02:07:05,410 transreid.train INFO: start training
/usr/local/lib/python3.8/dist-packages/torch/optim/lr_scheduler.py:129: UserWarning: Detected call of `lr_scheduler.step()` before `optimizer.step()`. In PyTorch 1.1.0 and later, you should call them in the opposite order: `optimizer.step()` before `lr_scheduler.step()`.  Failure to do this will result in PyTorch skipping the first value of the learning rate schedule. See more details at https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate
  warnings.warn("Detected call of `lr_scheduler.step()` before `optimizer.step()`. "
2024-07-02 02:07:28,361 transreid.train INFO: Epoch[1] Iteration[50/8579] Loss: 21.576, Acc: 0.000, Base Lr: 9.50e-07
2024-07-02 02:07:44,494 transreid.train INFO: Epoch[1] Iteration[100/8579] Loss: 19.558, Acc: 0.000, Base Lr: 9.50e-07
2024-07-02 02:08:00,583 transreid.train INFO: Epoch[1] Iteration[150/8579] Loss: 18.516, Acc: 0.000, Base Lr: 9.50e-07
2024-07-02 02:08:16,713 transreid.train INFO: Epoch[1] Iteration[200/8579] Loss: 17.855, Acc: 0.000, Base Lr: 9.50e-07
2024-07-02 02:08:32,870 transreid.train INFO: Epoch[1] Iteration[250/8579] Loss: 17.391, Acc: 0.000, Base Lr: 9.50e-07
2024-07-02 02:08:48,930 transreid.train INFO: Epoch[1] Iteration[300/8579] Loss: 17.052, Acc: 0.000, Base Lr: 9.50e-07

Fine-tune on new small dataset

Hi, I have a new small person re-id dataset(~100 id). And I want to fine-tune your models.
Should I fine-tune on both stage, or just fine-tune some epoch on stage 2?
Have you try merge all re-id dataset and training ?
thank you!

Is it possible to apply CLIP-ReID to visible-infrared person re-identification?

People tracking in videos?

Is it possible to use CLIP-ReID with YOLOv8n for people tracking in videos or it is worked only with datasets? I looked through your code and didn't find anything to work with videos.

Can't load the weights of VehicleID

Hi, I am unable to load the weights of the vehicleID model Can you help me to solve it?

Resized position embedding: %s to %s torch.Size([197, 768]) torch.Size([257, 768])
Position embedding resize to height:16 width: 16
Traceback (most recent call last):
  File "test_clipreid.py", line 42, in <module>
    model.load_param(cfg.TEST.WEIGHT)
  File "/Users/shreejaltrivedi/Documents/Repos/CLIP-ReID/model/make_model_clipreid.py", line 159, in load_param
    self.state_dict()[i.replace('module.', '')].copy_(param_dict[i])
RuntimeError: The size of tensor a (576) must match the size of tensor b (13164) at non-singleton dimension 0

Datasets	MSMT17	Market	Duke	Occ-Duke	VeRi	VehicleID
CNN-baseline	model\|test	model\|test	model\|test	model\|test	model\|test	model\|test
CNN-CLIP-ReID	model\|test	model\|test	model\|test	model\|test	model\|test	model\|test
ViT-baseline	model\|test	model\|test	model\|test	model\|test	model\|test	model\|test
ViT-CLIP-ReID	model\|test	model\|test	model\|test	model\|test	model\|test	model\|test
ViT-CLIP-ReID-SIE-OLP	model\|test	model\|test	model\|test	model\|test	model\|test	model\|test