google-research / big_vision Goto Github PK

View Code? Open in Web Editor NEW

1.6K 32.0 101.0 3.31 MB

Official codebase used to develop Vision Transformer, SigLIP, MLP-Mixer, LiT and more.

License: Apache License 2.0

Python 26.43% Shell 0.05% Jupyter Notebook 71.97% JavaScript 0.02% TypeScript 1.23% SCSS 0.15% HTML 0.13%

big_vision's People

Stargazers

Watchers

Forkers

python-repository-hub tkukurin ahoyosid azizmgv shafiahmed stjordanis techthiyanes cv-ip itssmutnuri jamesthesnake flookkrup 00mjk mannmann2 k-nar dumpmemory mldl mbrukman sampathweb abhay5991 andsteing tanglespace bobwan1995 vedanuj ntlm1686 cathrine36 icodein lkhphuc metavai wataincdx colbybanbury dl-vit kirubeltadesse juntang-zhuang yaowuxie lucasb-eyer eltociear azong-hqu mitscha kevinmunson xhl-video ricklentz phoenixdigitalfx chhaviilli evelynmitchell evdcush yuyangshu yixinz-nus zhangy10 tahirashehzadi isabella232 winwinjjiang zzs4026 tomasz-gawron-wttech alro-cu syo093c lidi100 yadan-wei andrbaer paul2002 samvelkarapetyaan jaedukseo andresusanopinto thinh-huynh-re jiayugedede xdotproduct whuhxb nielsrogge cenkbircanoglu tonywhite11 jeffhsu3 horstao sorokinvld paperwave toffo10 thiwankajayasiri ahmadmustafaanis sakaili dearborn-open-ai kp-forks lihuibng kuz-man lisennlp rajumopidevi murali-kri5hna prazek wangxin52791 linnanwang dive-into-papers smandava98 sam-motamed kinddevil gptconsoledemo amine0110 zitengwangnyu bioinfomagic roboticschen jesimonbarreto mohanataraj mohammedelfatihsalah

big_vision's Issues

SigLIP and canonicalize

Сould you please clarify if canonicalization had been used during SigLIP training?

This demo https://github.com/google-research/big_vision/blob/main/big_vision/configs/proj/image_text/SigLIP_demo.ipynb does not use canonicalization.

But canonicalization used in this script https://github.com/google-research/big_vision/blob/main/big_vision/evaluators/proj/image_text/prompt_engineering.py

Question: Updating mask in classification evaluator

A padding mask with _mask=0 is built here for evaluation datasets, which also implicitly sets label to be a vector of all zeros for the fake example.

big_vision/big_vision/input_pipeline.py

Line 149 in 184d120

def _get_pad_data(data):

Why is there a need to update the mask here?

big_vision/big_vision/evaluators/classification.py

Line 39 in 184d120

mask *= labels.max(axis=1)

Negative rho values in GSAM training

Hi! I've been trying to reproduce the GSAM results. I noticed that in the code, the learning rate (LR) warmup starts from 0, which is lower than the minimum LR for the post-warmup decay. Because of this, the rho parameter, which is scheduled proportionally with the LR, has negative values early in training.

This does not seem intentional, as rho is never supposed to be negative according to the paper. I'm curious if this makes any difference to the results of the paper if fixed. My guess is that its a very small amount of training (1/3 of the first epoch) and wouldn't change anything.

@lucasb-eyer @juntang-zhuang

Colorization uvim model not working

Hi @andresusanopinto I hope you are well. I tried using the colorization model on my images but out of 4 it colorized only 1 image and the result is also not good at all. Can you please tell me what I am missing ?

About RL fine-tuning code release

Hi!

First of all, thanks for sharing those amazing and helpful codebases. I wonder if there is a plan to release the full code of the ICML'23 paper "Tuning computer vision models with task rewards", including the instructions to reproduce the results in the paper.

Thank you :)

Running out of RAM on cloud TPU when reading data from Cloud Storage

Hi! I am trying to run the vit_s16_i1k.py script on a TPU-v3-8 machine. I put the data in a google-storage bucket and I am running the following command

TFDS_DATA_DIR=gs://bucket-name/ python3 -m big_vision.train --config big_vision/configs/vit_s16_i1k.py --workdir workdirs/i1k_training_`date '+%m-%d_%H%M'`

The training runs for a few iterations, and then fails with the killed message. When I look at htop outputs, the memory used by the process grows all the way up to 335G available before the process crashes.

I have been able to work around this issue by creating a data disk, mounting it on the TPU VM and putting the data there. In that case the same process only uses 205G of RAM and runs normally.

Question about SigLIP

Hello, Google Research team!

Thanks a lot for your work! I came across your paper SigLIP and was curious to reproduce the results myself on another dataset. I checked the README and it says that the SigLIT code is in TODO status. However in the codebase both sigmoid_loss and chunked_sigmoid_loss are both implemented and integrated into the training script as well as config is defined in this .ipynb. So my question is the following: is there is still something missing in SigLIP, or I can already try to run it using the command like this:

big_vision.trainers.proj.image_text.contrastive \
    --config ... \
    --workdir ...

I also have another question about the paper itself, or it's rather an ask for a recommendation. You pre-trained some models with an image encoder frozen, and in 2 days you achieved very competitive scores. However, 20k batch size and 107k total steps which means that the model saw a total of 2.14B image-text examples with text model initialized from scratch and huge ViT-g/14. What do you think about an inverse experiment, how long it is gonna take to train a ViT from scratch having a nice pre-trained text representation? The reason why I'm asking is that in my research I'm dealing with pretty non-conventional images, but regular texts

Looking forward to more research, thank you!

FlexiVit is also flexible with image resolution?

Hello, big_vision team!

Thanks for your work on the repository.
I trained FlexiVit-B on a fine-grained dataset CUB-200-2011 using pretrained weights from in21k, on a fixed resolution, say 480r, but I found it also performs good when testing on a smaller resolution, 240r. And it almost aligns the testing performance if it was trained on 240r. So my question is, is FlexiVit flexible to image resolution as well?
Thank you and looking forward to your reply.

requirements issue

hi iam doing the pip install -r requirements.txt, but it keeps downloading new dev versions for tfds_nightly

Memory Efficient Attention integration

Hello, big_vision team!

Thanks for your work on the repository. Looking through the code I noticed that ViT is using classical attention (see line 91 of ViT implementation). It seems like it should be relatively easy to replace current attention implementation with a memory-efficient alternative from flaxformer (line 595 in flaxformer) just passing dot_product_attention_multihead as attention_fn in nn.MultiHeadDotProductAttnetion (line 221 in flax). I think such improvement is worth considering since Flesh Attention authors reported up to x2.4 speedup on long sequences (1k-4k tokens)

What do you think about it? Are there any limitations that make efficient attention integration harder than it seems? I'm not experienced in Jax, so your feedback would be very appreciated

Error with putting arrays on CPU in cloud TPUs

Hi I've been setting up big vision on a v4-32 TPU pod and I run into this error whenever I call u.put_cpu

jaxlib.xla_extension.XlaRuntimeError: INVALID_ARGUMENT: Cannot copy array to non-addressable device TFRT_CPU_0

I'm guessing the CPUs on the TPU pod aren't configured properly? Is there a way around this or a way to fix this issue?

Totally new to TPUs and let me know if you need more information.

Confusion on FlexiViT

Hi, thanks for bringing us such great work! I have two questions regarding the paper.

The PI-resize method does not introduce any learnable parameter, it should be compatible with any ViT model. Therefore, we can use the PI-resize in a zero-shot manner? Then, what's the point of training the FlexiViT? I know since the patch size can be (almost) any number with PI-resize, we can transfer the knowledge of ViT-8 through distillation. But is there any difference between training a FlexiViT and using PI-resize directly in the ViT-8 model (without training)? In Figure 3, the authors mentioned that "Standard ViTs (ViT-16/ViT-30) are not flexible", but the authors "simply resize the patch embedding weights ω and the position embeddings π with bilinear interpolation", not PI.
Will the weight of FlexiCLIP be released someday?

Thanks, I am really looking forward to the answers!

Best,

Zilun

Text lowering issue

I found an issue here https://github.com/google-research/big_vision/blob/main/big_vision/pp/ops_text.py#L165
When lowering UTF-8 non-latin text encoding ='utf-8' should be used as mentioned here https://www.tensorflow.org/api_docs/python/tf/strings/lower .

This at least can influence at i18n model.
But due to models already trained, i'm not sure if this issue should be fixed.

Any extra dataset prep needed?

I have followed the instructions from README. I have set up a TPU v3-8 machine which can be confirmed below:

I have hosted the ImageNet-1k (imagenet2012) in a separate bucket and it's structured like the below (following instructions from here):

While launching training, I am using the following command:

gcloud alpha compute tpus tpu-vm ssh $NAME --zone=$ZONE --worker=all --command "TFDS_DATA_DIR=gs://imagenet-1k/tensorflow_datasets bash big_vision/run_tpu.sh big_vision.train --config big_vision/configs/vit_s16_i1k.py  --workdir gs://$GS_BUCKET_NAME/big_vision/workdir/`date '+%m-%d_%H%M'`"

It results into the following:

SSH key found in project metadata; not updating instance.
SSH: Attempting to connect to worker 0...
2022-05-10 10:30:25.858388: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
2022-05-10 10:30:27.319919: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcuda.so.1'; dlerror: libcuda.so.1: cannot open shared object file: No such file or directory
2022-05-10 10:30:27.319952: W tensorflow/stream_executor/cuda/cuda_driver.cc:269] failed call to cuInit: UNKNOWN ERROR (303)
I0510 10:30:27.335715 140289404775488 xla_bridge.py:263] Unable to initialize backend 'tpu_driver': NOT_FOUND: Unable to find driver in registry given worker:
I0510 10:30:27.336199 140289404775488 xla_bridge.py:263] Unable to initialize backend 'gpu': NOT_FOUND: Could not find registered platform with name: "cuda". Available platform names are: Interpreter TPU Host
I0510 10:30:30.058175 140289404775488 train.py:65] Hello from process 0 holding 8/8 devices and writing to workdir gs://big_vision_exp/big_vision/workdir/05-10_1030.
I0510 10:30:30.568850 140289404775488 train.py:95] NOTE: Global batch size 1024 on 1 hosts results in 1024 local batch size. With 8 dev per host (8 dev total), that's a 128 per-device batch size.
I0510 10:30:30.570343 140289404775488 train.py:95] NOTE: Initializing train dataset...
I0510 10:30:31.039579 140289404775488 dataset_info.py:522] Load pre-computed DatasetInfo (eg: splits, num examples,...) from GCS: imagenet2012/5.1.0
I0510 10:30:31.303886 140289404775488 dataset_info.py:439] Load dataset info from /tmp/tmpggpl8znitfds
I0510 10:30:31.308489 140289404775488 dataset_info.py:492] Field info.description from disk and from code do not match. Keeping the one from code.
I0510 10:30:31.308714 140289404775488 dataset_info.py:492] Field info.release_notes from disk and from code do not match. Keeping the one from code.
I0510 10:30:31.308900 140289404775488 dataset_info.py:492] Field info.supervised_keys from disk and from code do not match. Keeping the one from code.
I0510 10:30:31.308959 140289404775488 dataset_info.py:492] Field info.module_name from disk and from code do not match. Keeping the one from code.
I0510 10:30:31.309248 140289404775488 logging_logger.py:44] Constructing tf.data.Dataset imagenet2012 for split _EvenSplit(split='train[:99%]', index=0, count=1, drop_remainder=False), from gs://imagenet-1k/tensorflow_datasets/imagenet2012/5.1.0
Traceback (most recent call last):
  File "/usr/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/home/spsayakpaul/big_vision/train.py", line 372, in <module>
    app.run(main)
  File "/home/spsayakpaul/bv_venv/lib/python3.8/site-packages/absl/app.py", line 312, in run
    _run_main(main, args)
  File "/home/spsayakpaul/bv_venv/lib/python3.8/site-packages/absl/app.py", line 258, in _run_main
    sys.exit(main(argv))
  File "/home/spsayakpaul/big_vision/train.py", line 122, in main
    train_ds = input_pipeline.make_for_train(
  File "/home/spsayakpaul/big_vision/input_pipeline.py", line 69, in make_for_train
    data, _ = get_dataset_tfds(dataset=dataset, split=split,
  File "/home/spsayakpaul/big_vision/input_pipeline.py", line 53, in get_dataset_tfds
    return builder.as_dataset(
  File "/home/spsayakpaul/bv_venv/lib/python3.8/site-packages/tensorflow_datasets/core/logging/__init__.py", line 81, in decorator
    return function(*args, **kwargs)
  File "/home/spsayakpaul/bv_venv/lib/python3.8/site-packages/tensorflow_datasets/core/dataset_builder.py", line 565, in as_dataset
    raise AssertionError(
AssertionError: Dataset imagenet2012: could not find data in gs://imagenet-1k/tensorflow_datasets. Please make sure to call dataset_builder.download_and_prepare(), or pass download=True to tfds.load() before trying to access the tf.data.Dataset object.

Is there anything I'm missing out here?

Is there Pytorch version CLIPPO?

Thank you for releasing code for these inspiring works!
Especially, I'm interested in CLIPPO.

Are there any plans to release a Pytorch version?

Question: Will SigLIP / SigLiT be added to this codebase?

I mean both model checkpoints and training/inference code

TPU utilization could be improved further?

Training details are in #2

I think the TPU utilization is a bit lower than expected:

Is this expected?

I understand there might be other network access factors that can contribute to this but wanted to know.

Contrastive Input Pipeline

Hi, this is an amazing codebase for big-vision tasks!

I wanted to re-implement MOCO, but I was unsure how to modify the input pipeline + data augmentation to allow for applying independent random augmentations to the same image. Is there a simple way to do this?

My current implementation just applies the augmentations after the batch of images leaves the input pipeline (without any augmentation), but this requires me to write new data augmentation functions in Jax, which isn't ideal. Do you have any ideas on how to integrate this into the input pipeline?

Any help would be appreciated and let me know if you need more information.

Load ViT with CLIPPO Weights

Hi,
I am trying to load the ViT model in big vision with the clippo weights.

!git clone --branch=main https://github.com/google-research/big_vision
!cd big_vision && git checkout fd2d3bd2efc9d89ea959f16cd2f58ae8a495cd44  # this is the clippo commit

now my script:

checkpoint_path = '/home/ahmad/Desktop/projects/big_vision/big_vision/weights/clippo_b16_yfcc100m_i21k_init_75c4.npz'
init_params = utils.load_checkpoint(None, checkpoint_path)['params']
model = vit.Model(
  num_classes=768, variant=None, name="img"
)
classifier = nn.Dense(10) # 10 classes
opt = optax.adam(1e-3)
# Training 
for epoch in range(10):

  for batch in ds['train'].batch(2):
    
    # Get images and labels
    images = batch['image'].numpy()
    # convert the image from (28, 28, 1) gray scale to (28, 28, 3) rgb
    images = np.repeat(images, 3, axis=-1)
    print(images.shape)
    labels = batch['label'].numpy()

    # Forward pass
    zimg, _ = model.apply({'params': init_params}, images)

This is the error I get

---------------------------------------------------------------------------
ScopeParamNotFoundError                   Traceback (most recent call last)
Cell In[12], [line 48](vscode-notebook-cell:?execution_count=12&line=48)
     [45](vscode-notebook-cell:?execution_count=12&line=45) labels = batch['label'].numpy()
     [47](vscode-notebook-cell:?execution_count=12&line=47) # Forward pass
---> [48](vscode-notebook-cell:?execution_count=12&line=48) zimg, _ = model.apply({'params': init_params}, images)  
     [50](vscode-notebook-cell:?execution_count=12&line=50) # Get logits
     [51](vscode-notebook-cell:?execution_count=12&line=51) logits = classifier.apply({'params': classifier.params}, zimg)

    [... skipping hidden 6 frame]

File [~/Desktop/projects/big_vision/big_vision/models/vit.py:186](https://file+.vscode-resource.vscode-cdn.net/home/ahmad/Desktop/projects/big_vision/big_vision/trainers/~/Desktop/projects/big_vision/big_vision/models/vit.py:186), in _Model.__call__(self, image, train)
    [183](https://file+.vscode-resource.vscode-cdn.net/home/ahmad/Desktop/projects/big_vision/big_vision/trainers/~/Desktop/projects/big_vision/big_vision/models/vit.py:183) out = {}
    [185](https://file+.vscode-resource.vscode-cdn.net/home/ahmad/Desktop/projects/big_vision/big_vision/trainers/~/Desktop/projects/big_vision/big_vision/models/vit.py:185) # Patch extraction
--> [186](https://file+.vscode-resource.vscode-cdn.net/home/ahmad/Desktop/projects/big_vision/big_vision/trainers/~/Desktop/projects/big_vision/big_vision/models/vit.py:186) x = out["stem"] = nn.Conv(
    [187](https://file+.vscode-resource.vscode-cdn.net/home/ahmad/Desktop/projects/big_vision/big_vision/trainers/~/Desktop/projects/big_vision/big_vision/models/vit.py:187)     self.width,
    [188](https://file+.vscode-resource.vscode-cdn.net/home/ahmad/Desktop/projects/big_vision/big_vision/trainers/~/Desktop/projects/big_vision/big_vision/models/vit.py:188)     self.patch_size,
    [189](https://file+.vscode-resource.vscode-cdn.net/home/ahmad/Desktop/projects/big_vision/big_vision/trainers/~/Desktop/projects/big_vision/big_vision/models/vit.py:189)     strides=self.patch_size,
    [190](https://file+.vscode-resource.vscode-cdn.net/home/ahmad/Desktop/projects/big_vision/big_vision/trainers/~/Desktop/projects/big_vision/big_vision/models/vit.py:190)     padding="VALID",
    [191](https://file+.vscode-resource.vscode-cdn.net/home/ahmad/Desktop/projects/big_vision/big_vision/trainers/~/Desktop/projects/big_vision/big_vision/models/vit.py:191)     name="embedding",
    [192](https://file+.vscode-resource.vscode-cdn.net/home/ahmad/Desktop/projects/big_vision/big_vision/trainers/~/Desktop/projects/big_vision/big_vision/models/vit.py:192) )(image)
    [194](https://file+.vscode-resource.vscode-cdn.net/home/ahmad/Desktop/projects/big_vision/big_vision/trainers/~/Desktop/projects/big_vision/big_vision/models/vit.py:194) n, h, w, c = x.shape
    [195](https://file+.vscode-resource.vscode-cdn.net/home/ahmad/Desktop/projects/big_vision/big_vision/trainers/~/Desktop/projects/big_vision/big_vision/models/vit.py:195) x = jnp.reshape(x, [n, h * w, c])

    [... skipping hidden 2 frame]

File [~/anaconda3/envs/robustness/lib/python3.8/site-packages/flax/linen/linear.py:480](https://file+.vscode-resource.vscode-cdn.net/home/ahmad/Desktop/projects/big_vision/big_vision/trainers/~/anaconda3/envs/robustness/lib/python3.8/site-packages/flax/linen/linear.py:480), in _Conv.__call__(self, inputs)
    [474](https://file+.vscode-resource.vscode-cdn.net/home/ahmad/Desktop/projects/big_vision/big_vision/trainers/~/anaconda3/envs/robustness/lib/python3.8/site-packages/flax/linen/linear.py:474) if self.mask is not None and self.mask.shape != kernel_shape:
    [475](https://file+.vscode-resource.vscode-cdn.net/home/ahmad/Desktop/projects/big_vision/big_vision/trainers/~/anaconda3/envs/robustness/lib/python3.8/site-packages/flax/linen/linear.py:475)   raise ValueError(
    [476](https://file+.vscode-resource.vscode-cdn.net/home/ahmad/Desktop/projects/big_vision/big_vision/trainers/~/anaconda3/envs/robustness/lib/python3.8/site-packages/flax/linen/linear.py:476)       'Mask needs to have the same shape as weights. '
    [477](https://file+.vscode-resource.vscode-cdn.net/home/ahmad/Desktop/projects/big_vision/big_vision/trainers/~/anaconda3/envs/robustness/lib/python3.8/site-packages/flax/linen/linear.py:477)       f'Shapes are: {self.mask.shape}, {kernel_shape}'
    [478](https://file+.vscode-resource.vscode-cdn.net/home/ahmad/Desktop/projects/big_vision/big_vision/trainers/~/anaconda3/envs/robustness/lib/python3.8/site-packages/flax/linen/linear.py:478)   )
--> [480](https://file+.vscode-resource.vscode-cdn.net/home/ahmad/Desktop/projects/big_vision/big_vision/trainers/~/anaconda3/envs/robustness/lib/python3.8/site-packages/flax/linen/linear.py:480) kernel = self.param(
    [481](https://file+.vscode-resource.vscode-cdn.net/home/ahmad/Desktop/projects/big_vision/big_vision/trainers/~/anaconda3/envs/robustness/lib/python3.8/site-packages/flax/linen/linear.py:481)     'kernel', self.kernel_init, kernel_shape, self.param_dtype
    [482](https://file+.vscode-resource.vscode-cdn.net/home/ahmad/Desktop/projects/big_vision/big_vision/trainers/~/anaconda3/envs/robustness/lib/python3.8/site-packages/flax/linen/linear.py:482) )
    [484](https://file+.vscode-resource.vscode-cdn.net/home/ahmad/Desktop/projects/big_vision/big_vision/trainers/~/anaconda3/envs/robustness/lib/python3.8/site-packages/flax/linen/linear.py:484) if self.mask is not None:
    [485](https://file+.vscode-resource.vscode-cdn.net/home/ahmad/Desktop/projects/big_vision/big_vision/trainers/~/anaconda3/envs/robustness/lib/python3.8/site-packages/flax/linen/linear.py:485)   kernel *= self.mask

    [... skipping hidden 1 frame]

File [~/anaconda3/envs/robustness/lib/python3.8/site-packages/flax/core/scope.py:896](https://file+.vscode-resource.vscode-cdn.net/home/ahmad/Desktop/projects/big_vision/big_vision/trainers/~/anaconda3/envs/robustness/lib/python3.8/site-packages/flax/core/scope.py:896), in Scope.param(self, name, init_fn, unbox, *init_args)
    [894](https://file+.vscode-resource.vscode-cdn.net/home/ahmad/Desktop/projects/big_vision/big_vision/trainers/~/anaconda3/envs/robustness/lib/python3.8/site-packages/flax/core/scope.py:894)   if self.is_collection_empty('params'):
    [895](https://file+.vscode-resource.vscode-cdn.net/home/ahmad/Desktop/projects/big_vision/big_vision/trainers/~/anaconda3/envs/robustness/lib/python3.8/site-packages/flax/core/scope.py:895)     raise errors.ScopeCollectionNotFound('params', name, self.path_text)
--> [896](https://file+.vscode-resource.vscode-cdn.net/home/ahmad/Desktop/projects/big_vision/big_vision/trainers/~/anaconda3/envs/robustness/lib/python3.8/site-packages/flax/core/scope.py:896)   raise errors.ScopeParamNotFoundError(name, self.path_text)
    [897](https://file+.vscode-resource.vscode-cdn.net/home/ahmad/Desktop/projects/big_vision/big_vision/trainers/~/anaconda3/envs/robustness/lib/python3.8/site-packages/flax/core/scope.py:897) value = init_fn(self.make_rng('params'), *init_args)
    [898](https://file+.vscode-resource.vscode-cdn.net/home/ahmad/Desktop/projects/big_vision/big_vision/trainers/~/anaconda3/envs/robustness/lib/python3.8/site-packages/flax/core/scope.py:898) self.put_variable('params', name, value)

ScopeParamNotFoundError: Could not find parameter named "kernel" in scope "/embedding". (https://flax.readthedocs.io/en/latest/api_reference/flax.errors.html#flax.errors.ScopeParamNotFoundError)

The npz file has the following keys:

dict_keys(['chrono', 'opt', 'params'])

and in the params parameter keys are:

dict_keys(['t', 'img'])

and in the img we have

init_params['params']['img'].keys()
>>>dict_keys(['pos_embedding', 'MAPHead_0', 'Transformer', 'embedding', 'head'])

My Jax version is:

'0.4.13'

and flax version is:

'0.7.2'

Question About Listed ViT Models in the configs/proj/flexivit/README.md

Hello,

First of all, thank you very much for releasing many helpful materials and code samples of the interesting work FlexiVit.

When I went through the paper, the models referred to as ViT-B-16 and ViT-B-30 seems to be the baseline ViT models trained with the fixed patch sizes (16 and 30 respectively). Moreover, accordingly, their positional embedding sizes should be 15 and 8 if I am not wrong (img_size divided by the patch_size).
However, when I downloaded and loaded the .npz files of these models from the README file, I encountered that the patch size and the positional embedding size were 32 and 7 which matches the setup of the flexiVit-B model mentioned in the paper but not that of the baseline ViT models (given my understanding).

Thus, I was curious whether the links map to the wrong models or I misunderstood the setup mentioned in the paper regarding these models.

Could you please help me with this matter.
Thanks!

Clarification: SigLIP Image Transform

Thanks for open-sourcing the SigLIP models!

Clarification question: in the demo IPython notebook, the image transform function has the form pp_img = pp_builder.get_preprocess_fn(f'resize({RES})|value_range(-1, 1)').

Looking at the code here, this seems to be resizing an image to RES x RES (warping aspect ratio).

Is this the expected behavior? Were the SigLIP models trained with this transform (aspect ratio warping)?

Could you provide the checkpoint of the CLIPPO model?

I noticed that you have provided the CLIPPO training code. I hope to explore some downstream task based on the pre-trained CLIPPO model. Could you please release the checkpoint?

Thank you!

Question regarding value range (-1,1)

Hi, thank you so much for releasing code for these inspiring works. I notice that the config file uses value_range(-1, 1) instead of vgg_value_range. Is (-1,1) necessary for reproducing results on a normal imagenet dataset?

Thank you very much for your time and help.

Question about ViT-augreg ("How to train?") fine-tuning transfer

We got the following question by e-mail by @alexlioralexli but think it's of general interest:

Details about fine-tuning process?
Commands to reproduce pre-training and fine-tuning runs from the paper?

Gradient accumulation

Errors in notebooks

Hi.

When running notebook in colab uvim_depth_task.ipynb on line
oracle_params, oracle_state = vit.load(None, "depth_stageI_params.npz") the error is raised

AttributeError: module 'big_vision.utils' has no attribute 'load_checkpoint'

2. The same error in clippo_colab.ipynb on line

params = utils.load_checkpoint(None, checkpoint_path)['params']

When running lit.ipynb on line

params0 = model.init(jax.random.PRNGKey(42), *init_params)['params'].unfreeze()

the error is raised

AttributeError: 'dict' object has no attribute 'unfreeze'

Announcement: big_vision is transitioning from jax.pmap to jax.jit.

In the coming 1-2 weeks big_vision is expected to transition from pmap-based parallelism to jit-parallelism. This will enable more flexible parallelisation strategies, including, but not limited to, ZERO* and fully-sharded data-parallel (aka fsdp) training.

This transition may temporarily break project-specific code (or we will just remove such code). If you want to read/run the old code, please see the table at the end of the README for the project-specific commits to sync to.

Optax update breaks due to jax.Array migration

A recent change in Optax is breaking due to jax.Array Migration

Optax commit

Error:

File "bv_venv/lib/python3.8/site-packages/optax/_src/clipping.py" in <module>
    ) -> Tuple[chex.Array, jax.Array]:

AttributeError: module 'jax' has no attribute 'Array'

I believe the fix would either be to pin optax to a release. e.g.
git+https://github.com/deepmind/[email protected] here

or increase the minimum Jax version e.g.
pip install "jax[tpu]>=0.4.1" here

questions about t-SNE visualization in FlexiViT

Hi, FlexiViT is a very inspirational idea.

However, I'm kind of stuck at the t-SNE visualization in Fig. 6 of the paper.

Does t-SNE employ the arccosine-transformed CKA as the precomputed metrics ?

If so, how do we calculate the CKA similarity? Is it between a FlexiViT at different patch sizes and a standard ViT at a fixed patch size？

AttributeError pp_img in lit notebook

Hi.
An AttributeError is raised when running big_vision/blob/main/big_vision/configs/proj/image_text/lit.ipynb notebook in colab:

P.S: raised here config.pp_img.
P.S.S: here also will be AttributeError: config.pp_txt

Add other custom optimizers

I noticed that there are no other choices of optimizers other than scale_by_adafactor()

This github issue would serve as a placeholder for other optimizers such asAdam#Lion or others in the future.

TODO:

Add Adam#Lion jax code

augmentation and regularization used in MLP-Mixer

Hello! Thank you for your work!
In the paper of MLP-Mixer, when training mixer-B/16 on imagenet1k from scratch, it is said extra regularization is applied to gain the the accuracy of 76% , I want to know what detailed augmentation and regularization strategy is used for the experiment? Is there any config file can be found?
Thank you for your help! : )

Mixup Per Example?

Hi! I was wondering why the implementation of mixup uses a single sampled $a$ per batch as opposed to using a different sample $a$ per batch element. Intuitively, it seems that doing this should lead to higher variance in the optimization process.

big_vision/big_vision/utils.py

Line 1086 in 474dd2e

def get_mixup(rng, p):

Where do I find the new ViT baseline param?

Regarding https://github.com/google-research/big_vision#vit-baseline

bfloat16 Training

Thank you for releasing code for these inspiring works!

I tried to use bfloat16 for model parameters, and manually converted images and labels from float32 to bfloat16 before feeding them for training, but noticed that training slowed down by about 3 times. Also, the performance becomes noticeably worse. I'm wondering if it is wrong to use bfloat16 in this way?

Thank you very much for your help.

question about FlexiViT

FlexiViT is a very imaginative work.
I was also bothered by the flexible patch size.
I want to know how to implement PI-resize in Section 3.4 in the code.
And how to optimize the PI-resize in training.

Does PI-resize need to set learnable parameters?
Does the loss function need to be used for constraint and optimization?

[BUG] in big_vision.models.proj.flexi.vit

Hello, big_vision team!
Thanks for your work on the repository. I found two small typo in the flexivit code:

line 194
restored_params = utils.load_params(None, init_file)
==>
restored_params = utils.load_params(init_file)

line 205
restored_params["embedding"]["kernel"] = resample_patchemb(old=old_patchemb, new_hw=model_cfg.patch_size)
==>
restored_params["embedding"]["kernel"] = resample_patchemb(old=old_patchemb, new_hw=model_cfg.get("patch_size"))

Reproduced result for flexivit

Hi,

I want to know how to reproduce the results of your teaser in Flexivit.

An image is split into 2*2, and the accuracy is 84.4%.

Best,
Guopeng.

Accuracy of vit-b-16 training

Hi, May I ask the top-1 accuracy of vit-b-16 training on imagenet-1k based on the config file "vit_1ik.py". I find the related paper report the accuracy is about 74.6.

Thank you very much!

Best
Lucas

google-research / big_vision Goto Github PK

big_vision's People

Stargazers

Watchers

Forkers

big_vision's Issues

Recommend Projects

Recommend Topics

Recommend Org