lightdxy / ft-clip Goto Github PK
View Code? Open in Web Editor NEWCLIP Itself is a Strong Fine-tuner: Achieving 85.7% and 88.0% Top-1 Accuracy with ViT-B and ViT-L on ImageNet
CLIP Itself is a Strong Fine-tuner: Achieving 85.7% and 88.0% Top-1 Accuracy with ViT-B and ViT-L on ImageNet
Hello! Thank you for your fine-tuning code firstly. However, I met some problems in performance of the model.
I implemented the code and finetune the model "CLIP_L14" on datasets: Oxford Pets, Caltech101 and ImageNet with the same fine-tuning config in the paper except the batch size (Due to the limitation of the device, I set the batch size as 32). But the model performance bad on the validation set with accuracies around 1-5%, but on the train set, the accuracies are around 90%. It seems a typical overfit problem. I changed the learning rate, regulation config, epochs and other related config but failed to solve the problems.
So, I wonder that do you meet the same problem on similar datasets or if there are some methods to solve this problem.
您好,拜读了您的大作,很受启发。
您的论文的摘要中提到:Recent studies have shown that CLIP has achieved remarkable success in performing zero-shot inference while its fine-tuning performance is not satisfactory.... These observations challenge the conventional conclusion that CLIP is not
suitable for fine-tuning...
指出CLIP虽然在零样本推理,尤其是分类任务中效果很好,但是一般认为它不适合微调。我也使用过CLIP,中文和英文版的都试过,将某些自定义的token词和图像进行配对,效果确实不错,感觉CLIP学到的特征很稳健。但是我没有在自己的数据集上微调过CLIP,感觉上,如果特征是好的,那么微调应该效果更好,网络使用的激活都是连续的非线性函数,不会出现类似阶梯函数那样的不连续情况,微调过后,学到的特征应该表达的更好,应该也不会激活后类别判断出错,这只是我的感觉,没有任何证据,所以读到您上面的话有些不解,希望能得到您的解答,非常感谢!
您好,在论文里面的Table 11,记录了FLOPs的数据:
Model B/16_224 B/16_384 L/16_384 L/14_224 L/14_336
FLOPs 17.5G 55.4G 190.7G 80.7G 190.6G
我利用thop库的profile测试第一个B/16_224的FLOPs只有11.3G,要远小于论文中的17.5G。我猜想可能是我测试的方法不一致,因为profile里面确实默认有些模块没有计算。
所以麻烦想问一下作者在测试FLOPs的具体实现细节,非常感激。
Thanks for sharing your nice work!
I don't have sufficient computational resources to train the models. May I know if the pre-trained (fine-tuned) weights will be released?
有一个问题是关于按照代码lr decay设置,最后一层transformer block的lr scale并不是1,这是有意设置的吗还是?求教
@LightDXY
Firstly, thank you for sharing the fine-tuning code. However, after completing the fine-tuning, in my own dataset, I only fine-tuned the image encoding part, while I did not fine-tune the text encoding part. I used vit base 16 as the pre training weight, but after fine-tuning, the. pt increased by 5 times. Also, how should the. pt model generated after fine-tuning be used for inference? Looking forward to your guidance, thank you.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.