Giter VIP home page Giter VIP logo

dime-fm's Introduction

DIME-FM

Implementation of "DIME-FM: DIstilling Multimodal and Efficient Foundation Models" (ICCV 2023)

Abstract

Large Vision-Language Foundation Models (VLFM), such as CLIP, ALIGN and Florence, are trained on large-scale datasets of image-caption pairs and achieve superior transferability and robustness on downstream tasks, but they are difficult to use in many practical applications due to their large size, high latency and fixed architectures. Unfortunately, recent work shows training a small custom VLFM for resource-limited applications is currently very difficult using public and smaller-scale data. In this paper, we introduce a new distillation mechanism (DIME-FM) that allows us to transfer the knowledge contained in large VLFMs to smaller, customized foundation models using a relatively small amount of inexpensive, unpaired images and sentences. We transfer the knowledge from the pre-trained CLIP-ViT- L/14 model to a ViT-B/32 model, with only 40M public im- ages and 28.4M unpaired public sentences. The resulting model “Distill-ViT-B/32” rivals the CLIP-ViT-B/32 model pre-trained on its private WiT dataset (400M image-text pairs): Distill-ViT-B/32 achieves similar results in terms of zero-shot and linear-probing performance on both Ima- geNet and the ELEVATER (20 image classification tasks) benchmarks. It also displays comparable robustness when evaluated on five datasets with natural distribution shifts from ImageNet.

Links: Arxiv/Project Page/Poster/Slides

Welcome to cite our work if you find it is helpful to your research.

@article{sun2023dime,
  title={DIME-FM: DIstilling Multimodal and Efficient Foundation Models},
  author={Sun, Ximeng and Zhang, Pengchuan and Zhang, Peizhao and Shah, Hardik and Saenko, Kate and Xia, Xide},
  journal={arXiv preprint arXiv:2303.18232},
  year={2023}
}

Release TODO List

  • Checkpoints
  • Evaluation code
  • Training code (Expected by the end of Oct)

Checkpoints

Model Image Training Set Text Training Set ZS on IN-1K ZS on ELEVATER LP on ELEVATER Robustness Download
ViT-B/32 IN-21K + GCC-15M + YFCC-14M Filtered Roberta NLP Corpus 66.5% 56.4% 79.2% 50.2% ckpt
ViT-B/32 IN-21K + GCC-15M + YFCC-14M IN-21K Prompts + GCC-15M + YFCC-14M + Downstream Tasks' Prompts 66.1% 57.7% 79.4% - ckpt

Evaluation

Our evaluation is based on ELEVATER benchmark (Please refer to README in Evaluation folder). We extend ELEVATER benchmark to include the ImageNet Variants as the Robustness Evaluation. The download links of these datasets can be found HERE.

Training

We provide the training code in Training folder.

Environment

conda install pytorch torchvision torchaudio pytorch-cuda=11.8 -c pytorch -c nvidia
pip install timm 
pip install ftfy regex tqdm
pip install git+https://github.com/openai/CLIP.git
pip install transformers
pip install yacs

Prepare the Data

For faster dataloading speed, we pack all data in tsv files according to this repo. We extract the features of image and text using CLIP-ViT-L/14 models. All features are also stored in tsv form.

We provide examples of image, text, image features, text features.

Data Download Link
Image tsv / lineidx
Text tsv / lineidx
Image Feature tsv / lineidx
Text Feature tsv / lineidx

Training command

cd Training
python train_amp.py  --amp --dataroot <your_dataroot> --tsv_file_list  configs/datalists/cc3m/image.list \
configs/datalists/cc3m/image_feat.list configs/datalists/cc3m/text.list  configs/datalists/cc3m/text_feat.list \
--batch_size <your_batch_size> --use_pvl_loss 

Please refer to Data Parallel Example to extend to multiple gpus and nodes.

dime-fm's People

Contributors

sunxm2357 avatar

Stargazers

Xuefen avatar bitbanana avatar KwanYong Park avatar Youngtaek Oh avatar Meredith avatar chr1ce avatar Nidham Gazagnadou avatar  avatar Kaicheng Yang avatar Arijit Ray avatar  avatar

Watchers

 avatar

dime-fm's Issues

Inquiry about the Reproducibility of Results and Required GPU for Open-Source Code

I am highly interested in your project and am planning to replicate the results mentioned in the associated paper. I am currently reading the relevant research paper and would like to inquire whether the open-source code provided in your project is capable of reproducing the results reported in the paper.

Furthermore, I would appreciate it if you could provide information regarding the recommended GPU specifications necessary to run the code efficiently. Understanding the GPU requirements will help me ensure that I have access to the appropriate hardware resources for successful replication.

Thank you for your attention to this matter. I look forward to your response.

Batch size and number of GPUs

In paper you mentioned batch sizes 8192 and 12288.
But in training script you got batch size 640 and num_workers 8 = 5120

How large the actual batch size and how many GPUs used to train models from paper?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.