Giter VIP home page Giter VIP logo

open_clip's Introduction

OpenCLIP

[Paper] [Citations] [Clip Colab] [Coca Colab] pypi

Welcome to an open source implementation of OpenAI's CLIP (Contrastive Language-Image Pre-training).

Using this codebase, we have trained several models on a variety of data sources and compute budgets, ranging from small-scale experiments to larger runs including models trained on datasets such as LAION-400M, LAION-2B and DataComp-1B. Many of our models and their scaling properties are studied in detail in the paper reproducible scaling laws for contrastive language-image learning. Some of our best models and their zero-shot ImageNet-1k accuracy are shown below, along with the ViT-L model trained by OpenAI. We provide more details about our full collection of pretrained models here, and zero-shot results for 38 datasets here.

Model Training data Resolution # of samples seen ImageNet zero-shot acc.
ConvNext-Base LAION-2B 256px 13B 71.5%
ConvNext-Large LAION-2B 320px 29B 76.9%
ConvNext-XXLarge LAION-2B 256px 34B 79.5%
ViT-B/32 DataComp-1B 256px 34B 72.8%
ViT-B/16 DataComp-1B 224px 13B 73.5%
ViT-L/14 LAION-2B 224px 32B 75.3%
ViT-H/14 LAION-2B 224px 32B 78.0%
ViT-L/14 DataComp-1B 224px 13B 79.2%
ViT-G/14 LAION-2B 224px 34B 80.1%
ViT-L/14 OpenAI's WIT 224px 13B 75.5%

Model cards with additional model specific details can be found on the Hugging Face Hub under the OpenCLIP library tag: https://huggingface.co/models?library=open_clip.

If you found this repository useful, please consider citing. We welcome anyone to submit an issue or send an email if you have any other requests or suggestions.

Note that portions of src/open_clip/ modelling and tokenizer code are adaptations of OpenAI's official repository.

Approach

CLIP
Image Credit: https://github.com/openai/CLIP

Usage

pip install open_clip_torch
import torch
from PIL import Image
import open_clip

model, _, preprocess = open_clip.create_model_and_transforms('ViT-B-32', pretrained='laion2b_s34b_b79k')
tokenizer = open_clip.get_tokenizer('ViT-B-32')

image = preprocess(Image.open("docs/CLIP.png")).unsqueeze(0)
text = tokenizer(["a diagram", "a dog", "a cat"])

with torch.no_grad(), torch.cuda.amp.autocast():
    image_features = model.encode_image(image)
    text_features = model.encode_text(text)
    image_features /= image_features.norm(dim=-1, keepdim=True)
    text_features /= text_features.norm(dim=-1, keepdim=True)

    text_probs = (100.0 * image_features @ text_features.T).softmax(dim=-1)

print("Label probs:", text_probs)  # prints: [[1., 0., 0.]]

See also this [Clip Colab].

To compute billions of embeddings efficiently, you can use clip-retrieval which has openclip support.

Pretrained models

We offer a simple model interface to instantiate both pre-trained and untrained models. To see which pretrained models are available, use the following code snippet. More details about our pretrained models are available here.

>>> import open_clip
>>> open_clip.list_pretrained()

You can find more about the models we support (e.g. number of parameters, FLOPs) in this table.

NOTE: Many existing checkpoints use the QuickGELU activation from the original OpenAI models. This activation is actually less efficient than native torch.nn.GELU in recent versions of PyTorch. The model defaults are now nn.GELU, so one should use model definitions with -quickgelu postfix for the OpenCLIP pretrained weights. All OpenAI pretrained weights will always default to QuickGELU. One can also use the non -quickgelu model definitions with pretrained weights using QuickGELU but there will be an accuracy drop, for fine-tune that will likely vanish for longer runs. Future trained models will use nn.GELU.

Loading models

Models can be loaded with open_clip.create_model_and_transforms, as shown in the example below. The model name and corresponding pretrained keys are compatible with the outputs of open_clip.list_pretrained().

The pretrained argument also accepts local paths, for example /path/to/my/b32.pt. You can also load checkpoints from huggingface this way. To do so, download the open_clip_pytorch_model.bin file (for example, https://huggingface.co/laion/CLIP-ViT-L-14-DataComp.XL-s13B-b90K/tree/main), and use pretrained=/path/to/open_clip_pytorch_model.bin.

# pretrained also accepts local paths
model, _, preprocess = open_clip.create_model_and_transforms('ViT-B-32', pretrained='laion2b_s34b_b79k') 

Fine-tuning on classification tasks

This repository is focused on training CLIP models. To fine-tune a trained zero-shot model on a downstream classification task such as ImageNet, please see our other repository: WiSE-FT. The WiSE-FT repository contains code for our paper on Robust Fine-tuning of Zero-shot Models, in which we introduce a technique for fine-tuning zero-shot models while preserving robustness under distribution shift.

Data

To download datasets as webdataset, we recommend img2dataset.

Conceptual Captions

See cc3m img2dataset example.

YFCC and other datasets

In addition to specifying the training data via CSV files as mentioned above, our codebase also supports webdataset, which is recommended for larger scale datasets. The expected format is a series of .tar files. Each of these .tar files should contain two files for each training example, one for the image and one for the corresponding text. Both files should have the same name but different extensions. For instance, shard_001.tar could contain files such as abc.jpg and abc.txt. You can learn more about webdataset at https://github.com/webdataset/webdataset. We use .tar files with 1,000 data points each, which we create using tarp.

You can download the YFCC dataset from Multimedia Commons. Similar to OpenAI, we used a subset of YFCC to reach the aforementioned accuracy numbers. The indices of images in this subset are in OpenAI's CLIP repository.

Training CLIP

Install

We advise you first create a virtual environment with:

python3 -m venv .env
source .env/bin/activate
pip install -U pip

You can then install openclip for training with pip install 'open_clip_torch[training]'.

Development

If you want to make changes to contribute code, you can clone openclip then run make install in openclip folder (after creating a virtualenv)

Install pip PyTorch as per https://pytorch.org/get-started/locally/

You may run make install-training to install training deps

Testing

Test can be run with make install-test then make test

python -m pytest -x -s -v tests -k "training" to run a specific test

Running regression tests against a specific git revision or tag:

  1. Generate testing data

    python tests/util_test.py --model RN50 RN101 --save_model_list models.txt --git_revision 9d31b2ec4df6d8228f370ff20c8267ec6ba39383

    WARNING: This will invoke git and modify your working tree, but will reset it to the current state after data has been generated!
    Don't modify your working tree while test data is being generated this way.

  2. Run regression tests

    OPEN_CLIP_TEST_REG_MODELS=models.txt python -m pytest -x -s -v -m regression_test

Sample single-process running code:

python -m training.main \
    --save-frequency 1 \
    --zeroshot-frequency 1 \
    --report-to tensorboard \
    --train-data="/path/to/train_data.csv"  \
    --val-data="/path/to/validation_data.csv"  \
    --csv-img-key filepath \
    --csv-caption-key title \
    --imagenet-val=/path/to/imagenet/root/val/ \
    --warmup 10000 \
    --batch-size=128 \
    --lr=1e-3 \
    --wd=0.1 \
    --epochs=30 \
    --workers=8 \
    --model RN50

Note: imagenet-val is the path to the validation set of ImageNet for zero-shot evaluation, not the training set! You can remove this argument if you do not want to perform zero-shot evaluation on ImageNet throughout training. Note that the val folder should contain subfolders. If it does not, please use this script.

Multi-GPU and Beyond

This code has been battle tested up to 1024 A100s and offers a variety of solutions for distributed training. We include native support for SLURM clusters.

As the number of devices used to train increases, so does the space complexity of the the logit matrix. Using a naïve all-gather scheme, space complexity will be O(n^2). Instead, complexity may become effectively linear if the flags --gather-with-grad and --local-loss are used. This alteration results in one-to-one numerical results as the naïve method.

Epochs

For larger datasets (eg Laion2B), we recommend setting --train-num-samples to a lower value than the full epoch, for example --train-num-samples 135646078 to 1/16 of an epoch in conjunction with --dataset-resampled to do sampling with replacement. This allows having frequent checkpoints to evaluate more often.

Patch Dropout

Recent research has shown that one can dropout half to three-quarters of the visual tokens, leading to up to 2-3x training speeds without loss of accuracy.

You can set this on your visual transformer config with the key patch_dropout.

In the paper, they also finetuned without the patch dropout at the end. You can do this with the command-line argument --force-patch-dropout 0.

Multiple data sources

OpenCLIP supports using multiple data sources, by separating different data paths with ::. For instance, to train on CC12M and on LAION, one might use --train-data "/data/cc12m/cc12m-train-{0000..2175}.tar::/data/LAION-400M/{00000..41455}.tar". Using --dataset-resampled is recommended for these cases.

By default, on expectation the amount of times the model will see a sample from each source is proportional to the size of the source. For instance, when training on one data source with size 400M and one with size 10M, samples from the first source are 40x more likely to be seen in expectation.

We also support different weighting of the data sources, by using the --train-data-upsampling-factors flag. For instance, using --train-data-upsampling-factors=1::1 in the above scenario is equivalent to not using the flag, and --train-data-upsampling-factors=1::2 is equivalent to upsampling the second data source twice. If you want to sample from data sources with the same frequency, the upsampling factors should be inversely proportional to the sizes of the data sources. For instance, if dataset A has 1000 samples and dataset B has 100 samples, you can use --train-data-upsampling-factors=0.001::0.01 (or analogously, --train-data-upsampling-factors=1::10).

Single-Node

We make use of torchrun to launch distributed jobs. The following launches a a job on a node of 4 GPUs:

cd open_clip/src
torchrun --nproc_per_node 4 -m training.main \
    --train-data '/data/cc12m/cc12m-train-{0000..2175}.tar' \
    --train-num-samples 10968539 \
    --dataset-type webdataset \
    --batch-size 320 \
    --precision amp \
    --workers 4 \
    --imagenet-val /data/imagenet/validation/

Multi-Node

The same script above works, so long as users include information about the number of nodes and host node.

cd open_clip/src
torchrun --nproc_per_node=4 \
    --rdzv_endpoint=$HOSTE_NODE_ADDR \
    -m training.main \
    --train-data '/data/cc12m/cc12m-train-{0000..2175}.tar' \
    --train-num-samples 10968539 \
    --dataset-type webdataset \
    --batch-size 320 \
    --precision amp \
    --workers 4 \
    --imagenet-val /data/imagenet/validation/

SLURM

This is likely the easiest solution to utilize. The following script was used to train our largest models:

#!/bin/bash -x
#SBATCH --nodes=32
#SBATCH --gres=gpu:4
#SBATCH --ntasks-per-node=4
#SBATCH --cpus-per-task=6
#SBATCH --wait-all-nodes=1
#SBATCH --job-name=open_clip
#SBATCH --account=ACCOUNT_NAME
#SBATCH --partition PARTITION_NAME

eval "$(/path/to/conda/bin/conda shell.bash hook)" # init conda
conda activate open_clip
export CUDA_VISIBLE_DEVICES=0,1,2,3
export MASTER_PORT=12802

master_addr=$(scontrol show hostnames "$SLURM_JOB_NODELIST" | head -n 1)
export MASTER_ADDR=$master_addr

cd /shared/open_clip
export PYTHONPATH="$PYTHONPATH:$PWD/src"
srun --cpu_bind=v --accel-bind=gn python -u src/training/main.py \
    --save-frequency 1 \
    --report-to tensorboard \
    --train-data="/data/LAION-400M/{00000..41455}.tar" \
    --warmup 2000 \
    --batch-size=256 \
    --epochs=32 \
    --workers=8 \
    --model ViT-B-32 \
    --name "ViT-B-32-Vanilla" \
    --seed 0 \
    --local-loss \
    --gather-with-grad

Resuming from a checkpoint:

python -m training.main \
    --train-data="/path/to/train_data.csv" \
    --val-data="/path/to/validation_data.csv"  \
    --resume /path/to/checkpoints/epoch_K.pt

Training CoCa:

Training CoCa models is enabled through specifying a CoCa config using the --model parameter of the training script. Currently available configs are "coca_base", "coca_ViT-B-32", and "coca_roberta-ViT-B-32" (which uses RoBERTa as the text encoder). CoCa configs are different from CLIP configs because they have an additional "multimodal_cfg" component which specifies parameters for the multimodal text decoder. Here's an example from the coca_ViT-B-32 config:

"multimodal_cfg": {
	"context_length": 76,
	"vocab_size": 49408,
	"width": 512,
	"heads": 8,
	"layers": 12,
	"latent_dim": 512,
	"attn_pooler_heads": 8
}

Credit to lucidrains for initial code, gpucce for adapting the code to open_clip, and iejMac for training the models.

Generating text with CoCa

import open_clip
import torch
from PIL import Image

model, _, transform = open_clip.create_model_and_transforms(
  model_name="coca_ViT-L-14",
  pretrained="mscoco_finetuned_laion2B-s13B-b90k"
)

im = Image.open("cat.jpg").convert("RGB")
im = transform(im).unsqueeze(0)

with torch.no_grad(), torch.cuda.amp.autocast():
  generated = model.generate(im)

print(open_clip.decode(generated[0]).split("<end_of_text>")[0].replace("<start_of_text>", ""))

See also this [Coca Colab]

Fine Tuning CoCa

To fine-tune coca on mscoco, first create the dataset, one way is using a csvdataset and perhaps the simplest way to do it is using CLIP_benchmark which in turn uses pycocotools (that can be used also by itself).

from clip_benchmark.datasets.builder import build_dataset
import pandas as pd
import os

root_path = "path/to/data/dir" # set this to smth meaningful
ds = build_dataset("mscoco_captions", root=root_path, split="train") # this downloads the dataset if it is not there already
coco = ds.coco
imgs = coco.loadImgs(coco.getImgIds())
future_df = {"filepath":[], "title":[]}
for img in imgs:
    caps = coco.imgToAnns[img["id"]]
    for cap in caps:
        future_df["filepath"].append(img["file_name"])
        future_df["title"].append(cap["caption"])
pd.DataFrame.from_dict(future_df).to_csv(
  os.path.join(root_path, "train2014.csv"), index=False, sep="\t"
)

This should create a csv dataset that one can use to fine-tune coca with open_clip

python -m training.main \
    --dataset-type "csv" \
    --train-data "path/to/data/dir/train2014.csv" \
    --warmup 1000 \
    --batch-size 128 \
    --lr 1e-5 \
    --wd 0.1 \
    --epochs 1 \
    --workers 3 \
    --model "coca_ViT-L-14" \
    --report-to "wandb" \
    --coca-contrastive-loss-weight 0 \
    --coca-caption-loss-weight 1 \
    --log-every-n-steps 100

This is a general setting, open_clip has very parameters that can be set, python -m training.main --help should show them. The only relevant change compared to pre-training are the two arguments

--coca-contrastive-loss-weight 0
--coca-caption-loss-weight 1

which make the model only train the generative side.

Training with pre-trained language models as text encoder:

If you wish to use different language models as the text encoder for CLIP you can do so by using one of the Hugging Face model configs in src/open_clip/model_configs and passing in it's tokenizer as the --model and --hf-tokenizer-name parameters respectively. Currently we only support RoBERTa ("test-roberta" config), however adding new models should be trivial. You can also determine how many layers, from the end, to leave unfrozen with the --lock-text-unlocked-layers parameter. Here's an example command to train CLIP with the RoBERTa LM that has it's last 10 layers unfrozen:

python -m training.main \
         --train-data="pipe:aws s3 cp s3://s-mas/cc3m/{00000..00329}.tar -" \
         --train-num-samples 3000000 \
         --val-data="pipe:aws s3 cp s3://s-mas/cc3m/{00330..00331}.tar -" \
         --val-num-samples 10000 \
         --dataset-type webdataset \
         --batch-size 256 \
         --warmup 2000 \
         --epochs 10 \
         --lr 5e-4 \
         --precision amp \
         --workers 6 \
         --model "roberta-ViT-B-32" \
         --lock-text \
         --lock-text-unlocked-layers 10 \
         --name "10_unfrozen" \
         --report-to "tensorboard" \

Loss Curves

When run on a machine with 8 GPUs the command should produce the following training curve for Conceptual Captions:

CLIP zero shot training curve

More detailed curves for Conceptual Captions are given at /docs/clip_conceptual_captions.md.

When training a RN50 on YFCC the same hyperparameters as above are used, with the exception of lr=5e-4 and epochs=32.

Note that to use another model, like ViT-B/32 or RN50x4 or RN50x16 or ViT-B/16, specify with --model RN50x4.

Logging

For tensorboard logging, run:

tensorboard --logdir=logs/tensorboard/ --port=7777

For wandb logging, we recommend looking at the step variable instead of Step, since the later was not properly set in earlier versions of this codebase. For older runs with models trained before #613, the Step variable should be ignored. For newer runs, after that PR, the two variables are the same.

Evaluation / Zero-Shot

We recommend https://github.com/LAION-AI/CLIP_benchmark#how-to-use for systematic evaluation on 40 datasets.

Evaluating local checkpoint:

python -m training.main \
    --val-data="/path/to/validation_data.csv"  \
    --model RN101 \
    --pretrained /path/to/checkpoints/epoch_K.pt

Evaluating hosted pretrained checkpoint on ImageNet zero-shot prediction:

python -m training.main \
    --imagenet-val /path/to/imagenet/validation \
    --model ViT-B-32-quickgelu \
    --pretrained laion400m_e32

Model distillation

You can distill from a pre-trained by using --distill-model and --distill-pretrained to specify the model you'd like to distill from. For instance, to distill from OpenAI ViT-L/14 use --distill-model ViT-L-14 --distill-pretrained openai.

Gradient accumulation

To simulate larger batches use --accum-freq k. If per gpu batch size, --batch-size, is m, then the effective batch size will be k * m * num_gpus.

When increasing --accum-freq from its default of 1, samples/s will remain approximately constant (batch size will double, as will time-per-batch). It is recommended to use other features to reduce batch size such as --grad-checkpointing --local-loss --gather-with-grad before increasing --accum-freq. --accum-freq can be used in addition to these features.

Instead of 1 forward pass per example, there are now 2 forward passes per-example. However, the first is done with torch.no_grad.

There is some additional GPU memory required --- the features and data from all m batches are stored in memory.

There are also m loss computations instead of the usual 1.

For more information see Cui et al. (https://arxiv.org/abs/2112.09331) or Pham et al. (https://arxiv.org/abs/2111.10050).

Int8 Support

We have beta support for int8 training and inference. You can enable int8 training with --use-bnb-linear SwitchBackLinearGlobal or --use-bnb-linear SwitchBackLinearGlobalMemEfficient. Please see the bitsandbytes library for definitions for these layers. For CLIP VIT-Huge this should currently correspond to a 10% training speedup with no accuracy loss. More speedups comin when the attention layer is refactored so that linear layers man be replaced there, too.

See the tutorial https://github.com/mlfoundations/open_clip/blob/main/tutorials/int8_tutorial.ipynb or paper.

Support for remote loading/training

It is always possible to resume directly from a remote file, e.g., a file in an s3 bucket. Just set --resume s3://<path-to-checkpoint> . This will work with any filesystem supported by fsspec.

It is also possible to train open_clip models while continuously backing up to s3. This can help to avoid slow local file systems.

Say that your node has a local ssd /scratch, an s3 bucket s3://<path-to-bucket>.

In that case, set --logs /scratch and --remote-sync s3://<path-to-bucket>. Then, a background process will sync /scratch/<run-name> to s3://<path-to-bucket>/<run-name>. After syncing, the background process will sleep for --remote-sync-frequency seconds, which defaults to 5 minutes.

There is also experimental support for syncing to other remote file systems, not just s3. To do so, specify --remote-sync-protocol fsspec. However, this is currently very slow and not recommended.

Also, to optionally avoid saving too many checkpoints locally when using these features, you can use --delete-previous-checkpoint which deletes the previous checkpoint after saving a new one.

Note: if you are using this feature with --resume latest, there are a few warnings. First, use with --save-most-recent is not supported. Second, only s3 is supported. Finally, since the sync happens in the background, it is possible that the most recent checkpoint may not be finished syncing to the remote.

Pushing Models to Hugging Face Hub

The module open_clip.push_to_hf_hub includes helpers for pushing models /w weights and config to the HF Hub.

The tool can be run from command line, ex: python -m open_clip.push_to_hf_hub --model convnext_large_d_320 --pretrained /train/checkpoints/epoch_12.pt --repo-id laion/CLIP-convnext_large_d_320.laion2B-s29B-b131K-ft

Acknowledgments

We gratefully acknowledge the Gauss Centre for Supercomputing e.V. (www.gauss-centre.eu) for funding this part of work by providing computing time through the John von Neumann Institute for Computing (NIC) on the GCS Supercomputer JUWELS Booster at Jülich Supercomputing Centre (JSC).

The Team

Current development of this repository is led by Ross Wightman, Romain Beaumont, Cade Gordon, and Vaishaal Shankar.

The original version of this repository is from a group of researchers at UW, Google, Stanford, Amazon, Columbia, and Berkeley.

Gabriel Ilharco*, Mitchell Wortsman*, Nicholas Carlini, Rohan Taori, Achal Dave, Vaishaal Shankar, John Miller, Hongseok Namkoong, Hannaneh Hajishirzi, Ali Farhadi, Ludwig Schmidt

Special thanks to Jong Wook Kim and Alec Radford for help with reproducing CLIP!

Citing

If you found this repository useful, please consider citing:

@software{ilharco_gabriel_2021_5143773,
  author       = {Ilharco, Gabriel and
                  Wortsman, Mitchell and
                  Wightman, Ross and
                  Gordon, Cade and
                  Carlini, Nicholas and
                  Taori, Rohan and
                  Dave, Achal and
                  Shankar, Vaishaal and
                  Namkoong, Hongseok and
                  Miller, John and
                  Hajishirzi, Hannaneh and
                  Farhadi, Ali and
                  Schmidt, Ludwig},
  title        = {OpenCLIP},
  month        = jul,
  year         = 2021,
  note         = {If you use this software, please cite it as below.},
  publisher    = {Zenodo},
  version      = {0.1},
  doi          = {10.5281/zenodo.5143773},
  url          = {https://doi.org/10.5281/zenodo.5143773}
}
@inproceedings{cherti2023reproducible,
  title={Reproducible scaling laws for contrastive language-image learning},
  author={Cherti, Mehdi and Beaumont, Romain and Wightman, Ross and Wortsman, Mitchell and Ilharco, Gabriel and Gordon, Cade and Schuhmann, Christoph and Schmidt, Ludwig and Jitsev, Jenia},
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
  pages={2818--2829},
  year={2023}
}
@inproceedings{Radford2021LearningTV,
  title={Learning Transferable Visual Models From Natural Language Supervision},
  author={Alec Radford and Jong Wook Kim and Chris Hallacy and A. Ramesh and Gabriel Goh and Sandhini Agarwal and Girish Sastry and Amanda Askell and Pamela Mishkin and Jack Clark and Gretchen Krueger and Ilya Sutskever},
  booktitle={ICML},
  year={2021}
}
@inproceedings{schuhmann2022laionb,
  title={{LAION}-5B: An open large-scale dataset for training next generation image-text models},
  author={Christoph Schuhmann and
          Romain Beaumont and
          Richard Vencu and
          Cade W Gordon and
          Ross Wightman and
          Mehdi Cherti and
          Theo Coombes and
          Aarush Katta and
          Clayton Mullis and
          Mitchell Wortsman and
          Patrick Schramowski and
          Srivatsa R Kundurthy and
          Katherine Crowson and
          Ludwig Schmidt and
          Robert Kaczmarczyk and
          Jenia Jitsev},
  booktitle={Thirty-sixth Conference on Neural Information Processing Systems Datasets and Benchmarks Track},
  year={2022},
  url={https://openreview.net/forum?id=M3Y74vmsMcY}
}

DOI

open_clip's People

Contributors

bryant1410 avatar carlini avatar eify avatar gabrielilharco avatar gpucce avatar henryhzy avatar humzaiqbal avatar iejmac avatar jeniajitsev avatar liyz15 avatar lopho avatar lucidrains avatar ludwigschmidt avatar lysandrejik avatar mehdidc avatar mitchellnw avatar moro-no-kimi avatar orangesodahub avatar rfbr avatar rom1504 avatar rsomani95 avatar rwightman avatar sayakpaul avatar seyedalirezafatemi avatar taiqihe avatar vaishaal avatar visheratin avatar xwen99 avatar zasder3 avatar zw615 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

open_clip's Issues

logit_scale not referenced in get_metrics()/train.py?

Hi all,

thank you very much for providing this repository!

Python reports an unreferenced variable in the following code snippet (from train.py, lines 226-228):

def get_metrics(image_features, text_features):
    metrics = {}
    logits_per_image = (logit_scale * image_features @ text_features.t()).detach().cpu()

And even my IDE (PyCharm) complains about a missing reference.
Am I missing something?

My training parameters are like as follows:

Loading model from /home/thetaphipsi/MasterAI/src/open_clip/src/training/model_configs/RN50.json
2021-11-14,15:34:02 | INFO | Rank 0 | Params:
2021-11-14,15:34:02 | INFO | Rank 0 |   C: 3.16
2021-11-14,15:34:02 | INFO | Rank 0 |   aggregate: True
2021-11-14,15:34:02 | INFO | Rank 0 |   batch_size: 32
2021-11-14,15:34:02 | INFO | Rank 0 |   beta1: 0.9
2021-11-14,15:34:02 | INFO | Rank 0 |   beta2: 0.999
2021-11-14,15:34:02 | INFO | Rank 0 |   checkpoint_path: ./logs/lr=0.001_wd=0.1_agg=True_model=RN50_batchsize=32_workers=1_date=2021-11-14-14-34-01/checkpoints
2021-11-14,15:34:02 | INFO | Rank 0 |   copy_codebase: False
2021-11-14,15:34:02 | INFO | Rank 0 |   csv_caption_key: title
2021-11-14,15:34:02 | INFO | Rank 0 |   csv_img_key: filepath
2021-11-14,15:34:02 | INFO | Rank 0 |   csv_separator: 	
2021-11-14,15:34:02 | INFO | Rank 0 |   dataset_type: auto
2021-11-14,15:34:02 | INFO | Rank 0 |   debug: False
2021-11-14,15:34:02 | INFO | Rank 0 |   dist_backend: nccl
2021-11-14,15:34:02 | INFO | Rank 0 |   dist_url: tcp://127.0.0.1:6100
2021-11-14,15:34:02 | INFO | Rank 0 |   distributed: True
2021-11-14,15:34:02 | INFO | Rank 0 |   dp: False
2021-11-14,15:34:02 | INFO | Rank 0 |   epochs: 30
2021-11-14,15:34:02 | INFO | Rank 0 |   eps: 1e-08
2021-11-14,15:34:02 | INFO | Rank 0 |   gpu: 0
2021-11-14,15:34:02 | INFO | Rank 0 |   imagenet_v2: None
2021-11-14,15:34:02 | INFO | Rank 0 |   imagenet_val: None
2021-11-14,15:34:02 | INFO | Rank 0 |   log_level: 20
2021-11-14,15:34:02 | INFO | Rank 0 |   log_path: ./logs/lr=0.001_wd=0.1_agg=True_model=RN50_batchsize=32_workers=1_date=2021-11-14-14-34-01/out.log
2021-11-14,15:34:02 | INFO | Rank 0 |   logs: ./logs/
2021-11-14,15:34:02 | INFO | Rank 0 |   lr: 0.001
2021-11-14,15:34:02 | INFO | Rank 0 |   model: RN50
2021-11-14,15:34:02 | INFO | Rank 0 |   multigpu: None
2021-11-14,15:34:02 | INFO | Rank 0 |   name: lr=0.001_wd=0.1_agg=True_model=RN50_batchsize=32_workers=1_date=2021-11-14-14-34-01
2021-11-14,15:34:02 | INFO | Rank 0 |   ngpus_per_node: 1
2021-11-14,15:34:02 | INFO | Rank 0 |   openai_pretrained: False
2021-11-14,15:34:02 | INFO | Rank 0 |   precision: amp
2021-11-14,15:34:02 | INFO | Rank 0 |   rank: 0
2021-11-14,15:34:02 | INFO | Rank 0 |   regression_frequency: 2
2021-11-14,15:34:02 | INFO | Rank 0 |   report_to: tensorboard
2021-11-14,15:34:02 | INFO | Rank 0 |   resume: None
2021-11-14,15:34:02 | INFO | Rank 0 |   save_frequency: 1
2021-11-14,15:34:02 | INFO | Rank 0 |   save_most_recent: False
2021-11-14,15:34:02 | INFO | Rank 0 |   skip_aggregate: False
2021-11-14,15:34:02 | INFO | Rank 0 |   skip_scheduler: False
2021-11-14,15:34:02 | INFO | Rank 0 |   tensorboard: True
2021-11-14,15:34:02 | INFO | Rank 0 |   tensorboard_path: ./logs/lr=0.001_wd=0.1_agg=True_model=RN50_batchsize=32_workers=1_date=2021-11-14-14-34-01/tensorboard
2021-11-14,15:34:02 | INFO | Rank 0 |   train_data: ./data/Train_GCC-training_output.csv
2021-11-14,15:34:02 | INFO | Rank 0 |   use_bn_sync: False
2021-11-14,15:34:02 | INFO | Rank 0 |   val_data: ./data/Validation_GCC-1.1.0-Validation_output.csv
2021-11-14,15:34:02 | INFO | Rank 0 |   wandb: False
2021-11-14,15:34:02 | INFO | Rank 0 |   wandb_notes: 
2021-11-14,15:34:02 | INFO | Rank 0 |   warmup: 40000
2021-11-14,15:34:02 | INFO | Rank 0 |   wd: 0.1
2021-11-14,15:34:02 | INFO | Rank 0 |   workers: 1
2021-11-14,15:34:02 | INFO | Rank 0 |   world_size: 1
2021-11-14,15:34:02 | INFO | Rank 0 |   zeroshot_frequency: 1
2021-11-14,15:34:02 | INFO | Rank 0 | Added key: store_based_barrier_key:1 to store for rank: 0
2021-11-14,15:34:02 | INFO | Rank 0 | Rank 0: Completed store-based barrier for key:store_based_barrier_key:1 with 1 nodes.
2021-11-14,15:34:02 | INFO | Rank 0 | Use GPU: 0 for training

I fixed it temporarily by adding a logit_scale param to get_metrics().

Possible args

Is there a straightforward to find all the args supported by the code?

ModuleNotFoundError: No module named 'torch._C._distributed_rpc'; 'torch._C' is not a package

I get this strange error when attempting "import open_clip". I have tried reinstalling open clip, as well as various versions of pytorch. In this instance, I am using python 3.7.9 and pytorch 1.9.0.

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\Users\firewolf\AppData\Local\Programs\Python\Python37\lib\site-packages\open_clip_torch-1.0.1-py3.7.egg\open_clip\__init__.py", line 2, in <module>
    from .loss import ClipLoss
  File "C:\Users\firewolf\AppData\Local\Programs\Python\Python37\lib\site-packages\open_clip_torch-1.0.1-py3.7.egg\open_clip\loss.py", line 2, in <module>
    import torch.distributed.nn
  File "C:\Users\firewolf\AppData\Local\Programs\Python\Python37\lib\site-packages\torch\distributed\nn\__init__.py", line 1, in <module>
    from .api.remote_module import RemoteModule
  File "C:\Users\firewolf\AppData\Local\Programs\Python\Python37\lib\site-packages\torch\distributed\nn\api\remote_module.py", line 22, in <module>
    from torch.distributed.rpc.internal import _internal_rpc_pickler
  File "C:\Users\firewolf\AppData\Local\Programs\Python\Python37\lib\site-packages\torch\distributed\rpc\internal.py", line 12, in <module>
    from torch._C._distributed_rpc import _get_current_rpc_agent
ModuleNotFoundError: No module named 'torch._C._distributed_rpc'; 'torch._C' is not a package

Passing --imagenet-val (or --imagenet-v2) without --val crashes unnecessarily

In the current repository, you can evaluate a pretrained model by running

python src/training/main.py \
    --val-data="/path/to/validation_data.csv"  \
    --resume /path/to/checkpoints/epoch_K.pt

However, if you try to do the same thing and just try to get the imagenet-val (or imagenet-v2) accuracy

python src/training/main.py \
    --imagenet-val="/path/to/imagenet/val"  \
    --resume /path/to/checkpoints/epoch_K.pt

then it crashes:

Traceback (most recent call last):
  File "src/training/main.py", line 307, in <module>
    main()
  File "src/training/main.py", line 296, in main
    mp.spawn(main_worker, nprocs=ngpus_per_node, args=(ngpus_per_node, log_queue, args))
  File "/opt/conda/envs/open_clip/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 199, in spawn
    return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
  File "/opt/conda/envs/open_clip/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 157, in start_processes
    while not context.join():
  File "/opt/conda/envs/open_clip/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 118, in join
    raise Exception(msg)
Exception: 

-- Process 0 terminated with the following error:
Traceback (most recent call last):
  File "/opt/conda/envs/open_clip/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 19, in _wrap
    fn(i, *args)
  File "/home/ncarlini/open_clip/src/training/main.py", line 189, in main_worker
    evaluate(model, data, start_epoch, args, writer, 0)
  File "/home/ncarlini/open_clip/src/training/train.py", line 159, in evaluate
    dataloader = data['val'].dataloader
KeyError: 'val'

It should be allowed to get imagenet accuracy without getting using a val dataset.

Overfitting on validation loss okay?

I noticed that the CLIP validation loss curve begins to slope upwards about halfway through training on Conceptual Captions (~ epoch 15) from the figure here, but validation recall continues to increase until the end of training (epoch 30).

Does this mean that when doing contrastive training, the procedure for early stopping be based on the validation recall performance, rather than the validation loss, since they are not necessarily tied to one another like in standard supervised learning?

error with --openai_pretrained

hello, I face the problem "TypeError: init() takes 4 positional arguments but 11 were given" when calling the "build_model" function

if args.openai_pretrained:
  model, preprocess_train, preprocess_val = load(
      args.model,
      device=args.device,
      jit=False,
      is_train=True)
  if args.precision == "amp" or args.precision == "fp32":
      model = model.float()

def build_model(state_dict: dict):
    vit = "visual.proj" in state_dict
    if vit:
        vision_width = state_dict["visual.conv1.weight"].shape[0]
        vision_layers = len(
            [k for k in state_dict.keys() if k.startswith("visual.") and k.endswith(".attn.in_proj_weight")])
        vision_patch_size = state_dict["visual.conv1.weight"].shape[-1]
        grid_size = round((state_dict["visual.positional_embedding"].shape[0] - 1) ** 0.5)
        image_size = vision_patch_size * grid_size
    else:
        counts: list = [
            len(set(k.split(".")[2] for k in state_dict if k.startswith(f"visual.layer{b}"))) for b in [1, 2, 3, 4]]
        vision_layers = tuple(counts)
        vision_width = state_dict["visual.layer1.0.conv1.weight"].shape[0]
        output_width = round((state_dict["visual.attnpool.positional_embedding"].shape[0] - 1) ** 0.5)
        vision_patch_size = None
        assert output_width ** 2 + 1 == state_dict["visual.attnpool.positional_embedding"].shape[0]
        image_size = output_width * 32

    embed_dim = state_dict["text_projection"].shape[1]
    context_length = state_dict["positional_embedding"].shape[0]
    vocab_size = state_dict["token_embedding.weight"].shape[0]
    transformer_width = state_dict["ln_final.weight"].shape[0]
    transformer_heads = transformer_width // 64
    transformer_layers = len(set(k.split(".")[2] for k in state_dict if k.startswith(f"transformer.resblocks")))

    model = CLIP(
        embed_dim,
        image_size, vision_layers, vision_width, vision_patch_size,
        context_length, vocab_size, transformer_width, transformer_heads, transformer_layers
    )

    for key in ["input_resolution", "context_length", "vocab_size"]:
        if key in state_dict:
            del state_dict[key]

    convert_weights_to_fp16(model)
    model.load_state_dict(state_dict)
    return model.eval()

Error caused by the following code。

  model = CLIP(
      embed_dim,
      image_size, vision_layers, vision_width, vision_patch_size,
      context_length, vocab_size, transformer_width, transformer_heads, transformer_layers
  )

I see that you implement the CLIP model init function with only four arguments。Did I get the wrong version?

class CLIP(nn.Module):
    def __init__(
            self,
            embed_dim: int,
            vision_cfg: CLIPVisionCfg,
            text_cfg: CLIPTextCfg,
    ):

error in training

Hi, I encountered this error during training and I'm not sure what it means:

2022-02-09,21:22:00 | INFO | Rank 0 | Train Epoch: 9 [28800/43670 (66%)]        Loss: 0.493029  Data (t) 0.000  Batch (t) 0.235 LR: 0.000020    logit_scale 2.821
2022-02-09,21:22:24 | INFO | Rank 0 | Train Epoch: 9 [32000/43670 (73%)]        Loss: 0.642597  Data (t) 0.008  Batch (t) 0.274 LR: 0.000012    logit_scale 2.822
2022-02-09,21:22:48 | INFO | Rank 0 | Train Epoch: 9 [35200/43670 (81%)]        Loss: 0.442177  Data (t) 0.002  Batch (t) 0.243 LR: 0.000006    logit_scale 2.822
2022-02-09,21:23:13 | INFO | Rank 0 | Train Epoch: 9 [38400/43670 (88%)]        Loss: 0.435208  Data (t) 0.000  Batch (t) 0.255 LR: 0.000003    logit_scale 2.823
2022-02-09,21:23:37 | INFO | Rank 0 | Train Epoch: 9 [41600/43670 (95%)]        Loss: 0.295687  Data (t) 0.000  Batch (t) 0.240 LR: 0.000000    logit_scale 2.823
2022-02-09,21:24:36 | INFO | Rank 0 | Eval Epoch: 10 image_to_text_mean_rank: 40.2243   image_to_text_median_rank: 22.0000      image_to_text_R@1: 0.0628       image_to_text_R@5: 0.2063       image_to_text_R@10: 0.3273      text_to_image_mean_rank: 44.4849     text_to_image_median_rank: 25.0000      text_to_image_R@1: 0.0477       text_to_image_R@5: 0.1817       text_to_image_R@10: 0.2948      val_loss: 0.3798        epoch: 10.0000  num_elements: 6432.0000
Exception in thread Thread-5:
Traceback (most recent call last):
  File "C:\Users\nuzuegbunam\Anaconda3\envs\open_clip_3_9\lib\multiprocessing\connection.py", line 317, in _recv_bytes

Does anyone have any idea what this means?

Can CLIP be trained on Windows?

Hi,

Thanks for the tremendous effort!

Is it possible to set up this training code, for fine-tuning CLIP on a custom dataset, on a Windows 10 machine?

`logit_scale` in `CLIP`

Thanks for preparing this repo.
I was wondering how is self.logit_scale = nn.Parameter(torch.ones([]) * np.log(1 / 0.07)) decided? I mean where is this value np.log(1 / 0.07)) inspired from?

[bug] Dataloader does not work when num_worker=2

I only changed the line186-193 because we need audio input function get_wds_dataset. It always stuck when I set num_worker=2. https://github.com/mlfoundations/open_clip/blob/main/src/training/data.py#L150. Could you please check a little on this? @rwightman. Thank you! My modification version are as follows:

def preprocess(
    sample,
    audio_ext,
    samplerate,
    mono,
    max_len,
    dtype,
    res_type,
):
    for key, value in sample.items():
        if key == audio_ext:
            audio_data, orig_sr = sf.read(io.BytesIO(value))
            if samplerate is not None:
                audio_data = librosa.resample(
                    audio_data, orig_sr=orig_sr, target_sr=samplerate, res_type=res_type
                )
            if len(audio_data) > max_len:  # random clip if too long
                overflow = len(audio_data) - max_len
                idx = np.random.randint(0, overflow + 1)
                if np.random.rand() > 0.5:
                    audio_data = audio_data[idx : idx + max_len]
                else:
                    audio_data = audio_data[
                        len(audio_data) + 1 - idx - max_len : len(audio_data) + 1 - idx
                    ]
            else:  # padding if too short
                audio_data = np.pad(
                    audio_data,
                    (0, max_len - len(audio_data)),
                    mode="constant",
                    constant_values=0,
                )
            if mono:  # convert to mono
                audio_data = librosa.to_mono(audio_data)
            # sample["data"] = (audio_data, sample[text_ext], sample["__key__"])
            sample[audio_ext] = audio_data
    return sample


# def get_wds_dataset(args, preprocess_img, is_train):
def get_wds_dataset(
    args,
    is_train,
    file_path_type="local",
    audio_ext="flac",
    text_ext="json",
    samplerate=32000,
    mono=True,
    max_len=1000000,
    dtype="float64",
    res_type="kaiser_best",
):
    input_shards = args.train_data if is_train else args.val_data
    assert input_shards is not None

    num_samples, num_shards = get_dataset_size(input_shards)
    if not num_samples:
        if is_train:
            num_samples = args.train_num_samples
            if not num_samples:
                raise RuntimeError(
                    'Currently, number of dataset samples must be specified for training dataset. '
                    'Please specify via `--train-num-samples` if no dataset length info present.')
        else:
            num_samples = args.val_num_samples or 0  # eval will just exhaust the iterator if not specified

    pipeline = [wds.SimpleShardList(input_shards)]
    # at this point we have an iterator over all the shards
    if is_train:
        pipeline.extend([
            wds.detshuffle(bufsize=_SHARD_SHUFFLE_SIZE, initial=_SHARD_SHUFFLE_INITIAL, seed=args.seed),
            wds.split_by_node,
            wds.split_by_worker,
            # at this point, we have an iterator over the shards assigned to each worker at each node
            wds.tarfile_to_samples(handler=log_and_continue),
            wds.shuffle(
                bufsize=_SAMPLE_SHUFFLE_SIZE,
                initial=_SAMPLE_SHUFFLE_INITIAL,
                rng=random.Random(args.seed)),
            #wds.repeatedly,  # FIXME determine if this is beneficial
        ])
    else:
        pipeline.extend([
            wds.split_by_worker,
            # at this point, we have an iterator over the shards assigned to each worker
            wds.tarfile_to_samples(handler=log_and_continue),
        ])
    pipeline.extend([
        wds.map(
            partial(
                preprocess,
                audio_ext=audio_ext,
                samplerate=samplerate,
                mono=mono,
                max_len=max_len,
                dtype=dtype,
                res_type=res_type,
            )
        ),
        wds.to_tuple("flac", "json"),
        wds.batched(args.batch_size, partial=not is_train),
    ])

    dataset = wds.DataPipeline(*pipeline)
    if is_train:
        # roll over and repeat a few samples to get same number of full batches on each node
        global_batch_size = args.batch_size * args.world_size
        num_batches = math.ceil(num_samples / global_batch_size)
        num_workers = max(1, args.workers)
        num_worker_batches = math.ceil(num_batches / num_workers)  # per dataloader worker
        num_batches = num_worker_batches * num_workers
        num_samples = num_batches * global_batch_size
        dataset = dataset.with_epoch(num_worker_batches)  # each worker is iterating over this
    else:
        # last batches are partial, eval is done on single (master) node
        num_batches = math.ceil(num_samples / args.batch_size)

    dataloader = wds.WebLoader(dataset, batch_size=None, shuffle=False, num_workers=args.workers)

    # FIXME not clear which approach is better, with_epoch before vs after dataloader?
    # hoping to resolve via https://github.com/webdataset/webdataset/issues/169
    # if is_train:
    #     # roll over and repeat a few samples to get same number of full batches on each node
    #     global_batch_size = args.batch_size * args.world_size
    #     num_batches = math.ceil(num_samples / global_batch_size)
    #     num_workers = max(1, args.workers)
    #     num_batches = math.ceil(num_batches / num_workers) * num_workers
    #     num_samples = num_batches * global_batch_size
    #     dataloader = dataloader.with_epoch(num_batches)
    # else:
    #     # last batches are partial, eval is done on single (master) node
    #     num_batches = math.ceil(num_samples / args.batch_size)

    # add meta-data to dataloader instance for convenience
    dataloader.num_batches = num_batches
    dataloader.num_samples = num_samples

    return DataInfo(dataloader, None)

Possible to finetune?

Is it possible to finetune from the existing Open AI checkpoints rather than train them from scratch with this codebase?

Not able to run inference in fp16 mode

Thank you in advance for this amazing project :-)

I'm trying to run inference in fp16 mode (like in the original CLIP repo), but I'm failing to achieve it. This is the error message I get:

RuntimeError: Input type (torch.FloatTensor) and weight type (torch.cuda.HalfTensor) should be the same or input should be a MKLDNN tensor and weight is a dense tensor

And this is the code I'm trying:

import torch
from PIL import Image
import open_clip
import requests

model, _, preprocess = open_clip.create_model_and_transforms('ViT-B-32-quickgelu',
                                                             pretrained='laion400m_e32',
                                                             precision="fp16",
                                                             device=torch.device("cuda"))

url = "https://raw.githubusercontent.com/mlfoundations/open_clip/main/docs/CLIP.png"
image = preprocess(Image.open(requests.get(url, stream=True).raw)).unsqueeze(0)
text = open_clip.tokenize(["a diagram", "a dog", "a cat"])

with torch.no_grad():
    image_features = model.encode_image(image)
    text_features = model.encode_text(text)
    image_features /= image_features.norm(dim=-1, keepdim=True)
    text_features /= text_features.norm(dim=-1, keepdim=True)

    text_probs = (100.0 * image_features @ text_features.T).softmax(dim=-1)

print("Label probs:", text_probs)  # prints: [[1., 0., 0.]]

Note that I've also tried with model ViT-B-32 and pretrain openai and it doesn't work either. Am I doing something wrong?

Massive GPU memory usage during evaluation

Machine setup

Google cloud VM
Debian10
16 cores CPU, 60Gb of rams
4 nvidia T4

Error

Traceback (most recent call last):
  File "/opt/conda/envs/open_clip/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 19, in _wrap
    fn(i, *args)
  File "/home/jupyter/open_clip/src/training/main.py", line 192, in main_worker
    evaluate(model, data, 0, args, writer, 0)
  File "/home/jupyter/open_clip/src/training/train.py", line 197, in evaluate
    torch.cat(all_image_features), torch.cat(all_text_features)
  File "/home/jupyter/open_clip/src/training/train.py", line 228, in get_metrics
    logits_per_image = image_features @ text_features.t()
RuntimeError: CUDA out of memory. Tried to allocate 2269.88 GiB (GPU 0; 14.76 GiB total capacity; 7.11 GiB already allocated; 6.67 GiB free; 7.17 GiB reserved in total by PyTorch)

The script I use :

python -u src/training/main.py \
    --save-frequency 1 \
    --zeroshot-frequency 3 \
    --train-data "src/df_openclip_train.csv"  \
    --val-data "src/df_openclip_val.csv"  \
    --openai-pretrained \
    --csv-separator "," \
    --csv-img-key image_path \
    --csv-caption-key product_name \
    --warmup 10000 \
    --batch-size=128 \
    --lr=1e-3 \
    --wd=0.1 \
    --epochs=30 \
    --workers=4 \
    --model ViT-B/32

Full setting

2021-09-01,05:15:43 | INFO | Rank 0 | Params:
2021-09-01,05:15:43 | INFO | Rank 0 |   C: 3.16
2021-09-01,05:15:43 | INFO | Rank 0 |   aggregate: True
2021-09-01,05:15:43 | INFO | Rank 0 |   batch_size: 128
2021-09-01,05:15:43 | INFO | Rank 0 |   beta1: 0.9
2021-09-01,05:15:43 | INFO | Rank 0 |   beta2: 0.98
2021-09-01,05:15:43 | INFO | Rank 0 |   checkpoint_path: ./logs/lr=0.001_wd=0.1_agg=True_model=ViT-B/32_batchsize=128_workers=4_date=2021-09-01-05-15-41/checkpoints
2021-09-01,05:15:43 | INFO | Rank 0 |   copy_codebase: False
2021-09-01,05:15:43 | INFO | Rank 0 |   csv_caption_key: product_name
2021-09-01,05:15:43 | INFO | Rank 0 |   csv_img_key: image_path
2021-09-01,05:15:43 | INFO | Rank 0 |   csv_separator: ,
2021-09-01,05:15:43 | INFO | Rank 0 |   dataset_type: auto
2021-09-01,05:15:43 | INFO | Rank 0 |   debug: False
2021-09-01,05:15:43 | INFO | Rank 0 |   dist_backend: nccl
2021-09-01,05:15:43 | INFO | Rank 0 |   dist_url: tcp://127.0.0.1:6100
2021-09-01,05:15:43 | INFO | Rank 0 |   distributed: True
2021-09-01,05:15:43 | INFO | Rank 0 |   dp: False
2021-09-01,05:15:43 | INFO | Rank 0 |   epochs: 30
2021-09-01,05:15:43 | INFO | Rank 0 |   eps: 1e-06
2021-09-01,05:15:43 | INFO | Rank 0 |   gpu: 0
2021-09-01,05:15:43 | INFO | Rank 0 |   imagenet_v2: None
2021-09-01,05:15:43 | INFO | Rank 0 |   imagenet_val: None
2021-09-01,05:15:43 | INFO | Rank 0 |   log_level: 20
2021-09-01,05:15:43 | INFO | Rank 0 |   log_path: ./logs/lr=0.001_wd=0.1_agg=True_model=ViT-B/32_batchsize=128_workers=4_date=2021-09-01-05-15-41/out.log
2021-09-01,05:15:43 | INFO | Rank 0 |   logs: ./logs/
2021-09-01,05:15:43 | INFO | Rank 0 |   lr: 0.001
2021-09-01,05:15:43 | INFO | Rank 0 |   model: ViT-B/32
2021-09-01,05:15:43 | INFO | Rank 0 |   multigpu: None
2021-09-01,05:15:43 | INFO | Rank 0 |   name: lr=0.001_wd=0.1_agg=True_model=ViT-B/32_batchsize=128_workers=4_date=2021-09-01-05-15-41
2021-09-01,05:15:43 | INFO | Rank 0 |   ngpus_per_node: 4
2021-09-01,05:15:43 | INFO | Rank 0 |   openai_pretrained: True
2021-09-01,05:15:43 | INFO | Rank 0 |   precision: amp
2021-09-01,05:15:43 | INFO | Rank 0 |   rank: 0
2021-09-01,05:15:43 | INFO | Rank 0 |   regression_frequency: 2
2021-09-01,05:15:43 | INFO | Rank 0 |   report_to: 
2021-09-01,05:15:43 | INFO | Rank 0 |   resume: None
2021-09-01,05:15:43 | INFO | Rank 0 |   save_frequency: 1
2021-09-01,05:15:43 | INFO | Rank 0 |   skip_aggregate: False
2021-09-01,05:15:43 | INFO | Rank 0 |   skip_scheduler: False
2021-09-01,05:15:43 | INFO | Rank 0 |   tensorboard: False
2021-09-01,05:15:43 | INFO | Rank 0 |   tensorboard_path: 
2021-09-01,05:15:43 | INFO | Rank 0 |   train_data: src/df_openclip_train.csv
2021-09-01,05:15:43 | INFO | Rank 0 |   use_bn_sync: False
2021-09-01,05:15:43 | INFO | Rank 0 |   val_data: src/df_openclip_val.csv
2021-09-01,05:15:43 | INFO | Rank 0 |   wandb: False
2021-09-01,05:15:43 | INFO | Rank 0 |   wandb_notes: 
2021-09-01,05:15:43 | INFO | Rank 0 |   warmup: 10000
2021-09-01,05:15:43 | INFO | Rank 0 |   wd: 0.1
2021-09-01,05:15:43 | INFO | Rank 0 |   workers: 4
2021-09-01,05:15:43 | INFO | Rank 0 |   world_size: 4
2021-09-01,05:15:43 | INFO | Rank 0 |   zeroshot_frequency: 3
2021-09-01,05:15:47 | INFO | Rank 0 | Use GPU: 0 for training
2021-09-01,05:15:47 | INFO | Rank 1 | Use GPU: 1 for training
2021-09-01,05:15:47 | INFO | Rank 2 | Use GPU: 2 for training
2021-09-01,05:15:47 | INFO | Rank 3 | Use GPU: 3 for training

Info about the data :

Training data consist of 2.9 million pairs of text-image
Validation data consist of 780k pairs of text-image

Potential cause of the error

The get_metrics function is call on whole evaluation data embedding at once, which is massive. In my cases, the matrix multiplication involving 2 matrix with size of 780k x 512 which requires 2000 Gb of GPU memory

Error when training in DataParallel model.

Hi, after updating to your most recent code, I got an error when training in single machine (8 GPUs) in DataParrallel model. I simply changed the flag args.dp = True and got the following error message:

miniconda3/envs/env37_amp/lib/python3.7/site-packages/torch/nn/parallel/_functions.py:64: UserWarning: Was asked to gather along dimension 0, but all input tensors were scalars; will instead unsqueeze and return a vector.
warnings.warn('Was asked to gather along dimension 0, but all '

2022-03-18,06:20:51 | INFO | Start epoch 0
Traceback (most recent call last):
File "CLIP_model/training/main.py", line 304, in
main()
File "CLIP_model/training/main.py", line 243, in main
train_one_epoch(model, data, epoch, optimizer, scaler, scheduler, args, writer)
File "CLIP_model/training/train.py", line 149, in train_one_epoch
total_loss = loss(image_features, text_features, logit_scale)
File "miniconda3/envs/env37_amp/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "CLIP_model/training/train.py", line 97, in forward
logits_per_image = logit_scale * image_features @ text_features.T
RuntimeError: The size of tensor a (8) must match the size of tensor b (1024) at non-singleton dimension 1

Code works good when turning args.dp = False and training on a single GPU.
Thanks!

Captions of YFCC dirty

Hello there,

not really an issue but something i am interested in: How did you clean the captions of YFCC?
I did the steps explained in the closed issue, but still there are a lot of captions with URLs, camera names and settings, dates, and so on. Compared to CC (where the captions are really clean) it looks really bad. Still you get a big jump in performance on ImageNet, so before i start training i would like to know if you did clean the data?

If so, i would be very happy if you could provide the code or some snippets :)

Best regards

Performance of VIT-B/32 is worse than RN50 on YFCC15M

We are trying to re-implement CLIP ViT-B/32 pre-trained on YFCC15M provided by OpenAI. But our result is lower than RN50 reported by the paper and your repo (still under training, but almost finished, current ImageNet zero-shot accuracy is around 27% - 28%). So we wonder if you have tried to train a ViT-B/32 on YFCC? Do you have the same finding? Thanks.

Conceptual Captions Faster R-CNN features

Hi,
A sincere request.
Since it is very time taking, could you kindly provide the extracted faster R-CNN features for the conceptual captions dataset via drive or dropbox?
Thanks :)

corrupted weights for `RN50--yfcc15m` and `RN50-quickgelu--yfcc15m`

[/usr/local/lib/python3.7/dist-packages/mmc/loaders/mlfcliploader.py](https://localhost:8080/#) in load(self, device)
     43         model, _, preprocess_image = open_clip.create_model_and_transforms(
     44             model_name=model_name,
---> 45             pretrained=dataset)
     46 
     47         model.requires_grad_(False)

[/usr/local/lib/python3.7/dist-packages/open_clip/factory.py](https://localhost:8080/#) in create_model_and_transforms(model_name, pretrained, precision, device, jit, force_quick_gelu, pretrained_image)
    134         model_name, pretrained, precision, device, jit,
    135         force_quick_gelu=force_quick_gelu,
--> 136         pretrained_image=pretrained_image)
    137     preprocess_train = image_transform(model.visual.image_size, is_train=True)
    138     preprocess_val = image_transform(model.visual.image_size, is_train=False)

[/usr/local/lib/python3.7/dist-packages/open_clip/factory.py](https://localhost:8080/#) in create_model(model_name, pretrained, precision, device, jit, force_quick_gelu, pretrained_image)
    106             if checkpoint_path:
    107                 logging.info(f'Loading pretrained {model_name} weights ({pretrained}).')
--> 108                 model.load_state_dict(load_state_dict(checkpoint_path))
    109             else:
    110                 logging.warning(f'Pretrained weights ({pretrained}) not found for model {model_name}.')

[/usr/local/lib/python3.7/dist-packages/open_clip/factory.py](https://localhost:8080/#) in load_state_dict(checkpoint_path, map_location)
     48 
     49 def load_state_dict(checkpoint_path: str, map_location='cpu'):
---> 50     checkpoint = torch.load(checkpoint_path, map_location=map_location)
     51     if isinstance(checkpoint, dict) and 'state_dict' in checkpoint:
     52         state_dict = checkpoint['state_dict']

[/usr/local/lib/python3.7/dist-packages/torch/serialization.py](https://localhost:8080/#) in load(f, map_location, pickle_module, **pickle_load_args)
    711                     return torch.jit.load(opened_file)
    712                 return _load(opened_zipfile, map_location, pickle_module, **pickle_load_args)
--> 713         return _legacy_load(opened_file, map_location, pickle_module, **pickle_load_args)
    714 
    715 

[/usr/local/lib/python3.7/dist-packages/torch/serialization.py](https://localhost:8080/#) in _legacy_load(f, map_location, pickle_module, **pickle_load_args)
    938         typed_storage._storage._set_from_file(
    939             f, offset, f_should_read_directly,
--> 940             torch._utils._element_size(typed_storage.dtype))
    941         if offset is not None:
    942             offset = f.tell()

RuntimeError: unexpected EOF, expected 832488 more bytes. The file might be corrupted.

error when running training/main.py

I am getting a ModuleNotFoundError for training when running src/training/main.py. It points to line 19 in main.py, the import function.

Edit: Fixed it. Forgot to add pythonpath

CLIP training in Jax.

Would be nice if we could add a jax_src folder which supported training CLIP models in Jax.

This would also help with #20.

Usage of title and/or description column in YFCC100M

Hello,

In your training of CLIP, did you use only the description column as text input, or both the title and description columns?

The reason I am asking is because in the github folder where OpenAI provide info on their YFCC100M subset, there is a sentence that I find quite ambiguous:

[...] which have been filtered to only keep those with natural languag titles and/or descriptions in English

This seems to imply that it sufficed that only one of title and description was considered natural language for an observation (image) to be kept as part of the subset. However, they do not clarify whether they also proceeded to use the results of this natural language filter to choose whether to use only the title or only the description in the case that one of them was not deemed to be natural language. Alternatively, they may have concatenated the columns and used both of them in training.

Anyway, what I'm interested in knowing here is what you guys decided to do in your training. Did you use both columns or just the description?

Also, did you clean the text in any manner (e.g. remove html tags present in the text)?

training perf for single GPU is not good

Hi, I was training clip using single GPU. After profiling, I noticed that the perf of CLIP training was not good, as we can see from the picture below. GPU idle time is almost twice of GPU active due to the sem_timedwait as blocked in CPU. Any idea we can solve this unnecessary block? Thanks!
image

Performance of VIT-B/32 is worse than RN50 on CC3M

Here are my curves. RN50 roughly matchs the one shown in the repo, but the VITB/32 is worse. I am using the hyperparams from README. I am wondering could you also share the performance curves of ViTB/32 on CC?
Screen Shot 2021-09-08 at 12 39 26 PM

Bug in gather_cc

Hi there,

first of all thanks for the code, i appreciate your effort!

I think there is a bug in gather_cc.py:
In line 86 there is a hardcoded 'val', which should probably be split.

braceexpand has unexpected (non-bash-like) behavior with multiple expansions

If you provide bash a command like "foo{0..5} bar{1..6}" it will expand each of the brace expansions separately, give you a list of length 10. Braceexpand will do the cross product here though, and give a list of length 25. This isn't necessarily wrong in general, but in the case of how braceexpand is used in this project I think it's not what's expected.

In particular, if you provide --train-data="/dir1/files{1..10} /dir2/files{1..10}" it ends up trying to include 100 (!!) files. It would probably make more sense to do the bash-like expansion here.

Generating prompts from an image

so - I've been looking into some code for VQGAN
https://github.com/mehdidc/feed_forward_vqgan_clip
https://github.com/nerdyrodent/VQGAN-CLIP

and they let the user to pass a prompt to style / generate an image.
Here's some using code from @nerdyrodent
nerdyrodent/VQGAN-CLIP#13

Must see -
https://twitter.com/e08477/status/1418440857578098691?s=21
Here's theres only 4 images generated with a prompt
eg. Mushroom, spaceship,volcano, old english house on a hill(might be wrong)
image
But then as you look down - these have predicate prompts that style / shape image differently.

Mushroom + marble sculpture.

What I want is to give an image to CLIP and have it tell me what it thinks the words should be.
Is this feasible / achievable ? Does this repo provide any way into this? Does it need dimensionality reduction? It is like tsne problem (show word2vec in 2 dimensions?) - but under the hood it's 512 dimensions? I'm yet to look at the code - maybe it will become clearer.

TPU support.

Would be nice if this repo supported training on TPUs.

scripts of training on multiple nodes

Hi, is there any easy-using script for training clip on multiple nodes? I can set up training on one node(8GPUs) now. But I need to test the scaling efficient. Thanks for any insight~

Generalizable Text Transformer Usage

I've been chatting with some others interested in training CLIP for different domain tasks. They expressed interest in a simple way to use a pre-trained text transformer.

Some basic support for Hugging Face or generic classes of transformers shouldn't be too crazy of an extension to what is already fleshed out.

LAION 5B ?

Hi !

Just in cased you missed it there is a new 5.85B dataset from LAION.
Do you have any plan to fit a model on it?

Best.

Model name details

Hi,

Where can we find the details behind your model naming?

Best,
Theo

Loss slowly decreases

Hi, I am attempting to use open_clip for remote images on xview images. I've finding that in the first 2-3 epochs the loss decreases from 3.5 to 2.7 and stays around 2.7 for lr of 8e-6 (see training below). Would anyone have ideas on how I can motivate the learning?

image

Some background on xview:
My images are derived from xview which is an object detection dataset with images like this:

img_nobox_5_0_300

To generate captions for xview, for each image, I make a single caption for a single bounding box. Hence the same image may be several different captions for that image. Each caption is valid as there may be multiple objects in the image.

interpretation of debug output

Hi,

I'm running the src/training/main.py in debug mode, and I'm getting the following message in the terminal:


2021-12-21,18:02:32 | DEBUG | Rank 1 | tag: Software (305) - type: string (2) Tag Location: 22 - Data Location: 26 - value: b'www.meitu.com\x00'

2021-12-21,18:02:32 | DEBUG | Rank 1 | tag: Software (305) - type: string (2) Tag Location: 22 - Data Location: 26 - value: b'www.meitu.com\x00'

2021-12-21,18:02:32 | DEBUG | Rank 1 | tag: Software (305) - type: string (2) Tag Location: 22 - Data Location: 26 - value: b'www.meitu.com\x00'

2021-12-21,18:02:32 | DEBUG | Rank 1 | tag: Software (305) - type: string (2) Tag Location: 22 - Data Location: 26 - value: b'www.meitu.com\x00'

2021-12-21,18:02:32 | DEBUG | Rank 1 | tag: Software (305) - type: string (2) Tag Location: 22 - Data Location: 26 - value: b'www.meitu.com\x00'

2021-12-21,18:02:32 | DEBUG | Rank 1 | tag: Software (305) - type: string (2) Tag Location: 22 - Data Location: 26 - value: b'www.meitu.com\x00'

2021-12-21,18:02:32 | DEBUG | Rank 1 | tag: Software (305) - type: string (2) Tag Location: 22 - Data Location: 26 - value: b'www.meitu.com\x00'

2021-12-21,18:02:32 | DEBUG | Rank 1 | tag: Software (305) - type: string (2) Tag Location: 22 - Data Location: 26 - value: b'www.meitu.com\x00'

2021-12-21,18:02:32 | DEBUG | Rank 1 | tag: Software (305) - type: string (2) Tag Location: 22 - Data Location: 26 - value: b'www.meitu.com\x00'

2021-12-21,18:02:32 | DEBUG | Rank 1 | tag: Software (305) - type: string (2) Tag Location: 22 - Data Location: 26 - value: b'www.meitu.com\x00

What does it mean? I'm using 1 gpu, and running --debug flag.

Second question: how do I delete an experiment so I can reuse its name?

Error in demo of README.md?

Hi!

In the "Usage" part of the readme, we use model.encode_image() and model.encode_text() before computing the dot product of the features.

Hoewever those methods, by contrast with what is done during training,

image_features = F.normalize(image_features, dim=-1)

don't normalize the feature vectors.

Therefore it could bias the results. Am I wrong?

Best,
Théo

Expected time/epoch for conceptual captions (R50)

How long is a reasonable time for an epoch using 8 workers? I'm seeing about 8 hours/epoch, for the resnet50. Launch command from the README:

    --save-frequency 1 \
    --zeroshot-frequency 1 \
    --report-to tensorboard \
    --train-data="/path/to/train_data.csv"  \
    --val-data="/path/to/validation_data.csv"  \
    --csv-img-key filepath \
    --csv-caption-key title \
    --imagenet-val=/path/to/imagenet/root/val/ \
    --warmup 10000 \
    --batch-size=128 \
    --lr=1e-3 \
    --wd=0.1 \
    --epochs=30 \
    --workers=8 \
    --model RN50

Thank you!

Results of using different learning rates and more training epochs

Very nice code!

I'm able to reproduce the zero-shot results on imagenet using cc3m (2,862,387 images in total for me) and the provided sample code.

I'd like to ask if you have tried different learning rates other than 1e-3 for batch=128? Would you be able to give more insights on how you ended up using lr=1e-3?

Also, I'd like to know if you have tried more training epochs, i.e. larger than 30. I'm curious if training with more epochs would help improve the zero-shot accuracy.

Add ROOT to files written in gather_cc

Hi again,

it would make sense to append ROOT to the filepath in the csv-file? Because after running gather_cc.py the files are in the folder cc_data (eg. cc_data/val/00/0123.jpg), but the path in the csv-file is only val/00/0123.jpg.

BR Andreas

Inference for non-square images

Hello,

I would like to run the different CLIP models on high definition non-square images (e.g. 720p or 1080p).
Is there a simple way to do so without deforming the images into a smaller square resolution (336x336 or 224x224) ?

Thank you for your work on this repository, I found it very helpful,
Simon

Loss is constant

I'm using CLIP to train on my custom dataset with the following params:

Dataset size : 50k image-text pairs
Batch size : 128
Image Size : 224
Gpus : 1
Epochs : 500

It's been running for a while now, I'm on my 15th epoch, and the loss hasn't changed at all. It isn't a constant number, but its constantly at 4.8xxx. Should I be concerned? I'm not sure why this is happening.

image

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.