Giter VIP home page Giter VIP logo

eva's Introduction

EVA: Visual Representation Fantasies from BAAI

summary_tab

Contact

  • We are hiring at all levels at BAAI Vision Team, including full-time researchers, engineers and interns. If you are interested in working with us on foundation model, self-supervised learning and multimodal learning, please contact Xinlong Wang ([email protected]).

License

The content of this project itself is licensed under LICENSE.

Misc

Stargazers repo roster for @baaivision/EVA

Forkers repo roster for @baaivision/EVA

Star History Chart

eva's People

Contributors

asden avatar camielk avatar caoyue10 avatar encounter1997 avatar ivysochyn avatar jiahuichen-github avatar quan-sun avatar robert-zwr avatar wxinlong avatar yuxin-cv avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

eva's Issues

The MIM target for EVA

Sorry to bother you. I notice that the EVA utilizes the projected CLIP feature as the target, I am wondering why not use the feature before projection since it is the original representation of ViT model. Is is determined by the performance of experiments?

Have you tried Swin transformer?

In current EVA version, the backbone structure is the vanilla ViT. I have noticed that the author is the inventor of Swin transformer .
I have a doubt that have you tried to replace the ViT with Swin T, and how is the performance of ETA in the circumstance.
Thank you!

COCO object detection image size and inference speed

First of all, thank you for publishing all this hard work!

After some detectron2 struggles (it has to be build from the EVA fork in /EVA/det/, not from the facebookresearch repo!) I managed to run the COCO evaluation example to reproduce your object detection and instance segmentation results. Amazing stuff!!

My first question is regarding the image size. The detectron2 default image size for COCO is 1024. In your paper you say that you fine-tuned on COCO and LVIS using 12802 inputs, but in the code example published here you use a default image size of 1536 (cascade_mask_rcnn_vitdet_eva_1536.py). Why did you use 1536 here?

I did some inference experiments and it seems 1280 is indeed an optimal image size for inference:

image_size inference speed (3090) bbox mAP (only on 100 val2017 images)
1536 ~2.35 s/iter 67.1
1280 ~1.05 s/iter 67.4
1024 ~1.23 s/iter 66.5
612 ~0.95 s/iter 59.8

Second question. I used AMP to autocast the torch model to FP16 which already reduced the 12802 inference time from 1.05 s/iter to 0.75 s/iter. I am interested in further optimizing the model for faster inference speeds. Do you think it is possible to optimize your model using TensorRT or ONNX?

Thanks again!

EVA 2.0-CLIP

Hi, nice work to EVA-2.0, is there any plan to try EVA 2.0 training on CLIP or scaling EVA 2.0 to giant size?

Can't load instance segmenation model

Hello, thanks for the amazing work :)

I have met a problem when I tried to load the instance segmentation model by using the .py config file, since I wanted to do single image inference. Can you please help me? My code is:

from detectron2.config import LazyConfig
from detectron2.modeling import build_model

# I would like to use this model: EVA/det/projects/ViTDet/configs/COCO/cascade_mask_rcnn_vitdet_eva.py 
config_file = 'path/to/COCO/instance_segmentation model'
checkpoint_file = './drive/MyDrive/eva_coco_seg.pth'

cfg = LazyConfig.load(config_file)
model = build_model(cfg)

And I get following error:

/usr/local/lib/python3.8/dist-packages/omegaconf/dictconfig.py in _get_node(self, key, validate_access, validate_key, throw_on_missing_value, throw_on_missing_key)
    478         if value is None:
    479             if throw_on_missing_key:
--> 480                 raise ConfigKeyError(f"Missing key {key!s}")
    481         elif throw_on_missing_value and value._is_missing():
    482             raise MissingMandatoryValue("Missing mandatory value: $KEY")

ConfigAttributeError: Missing key MODEL
    full_key: MODEL
    object_type=dict

I also tried other detectron2 methods for loading model, but still not work.

The setting of hyperparameter

Hi, Thanks for the great work. I am confused about the meaning of pt_hw_seq_len. Is there any specific reason why it is set at 16? eg, following the pre-training setting, which is 224/14=16? Can it be set as 1 or being equal to ft_seq_len. In such cases, the position `t' is normalized to 0-1 or without normalization.

Some questions about EVA pretraining

  1. In your opinion, is EVA a method of both model scaling and data scaling? Does pretraining with more data (such as the data used in CLIP finetuning) yield better results than using only the 30M data described in the paper? What data is optimal for EVA?
  2. How does the teacher model influence the performance of the student model? Is it possible to replace the CLIP model with a supervised model trained on IN21k?
  3. In #19, you mentioned that ViT has many desirable properties, but some researchers have observed that scaling the ViT model can lead to instability during fp16 training (such as overflow in the forward pass and underflow in the backward pass). Why do you continue to choose ViT as a backbone over alternatives like SwinTransformer, and could you elaborate on ViT's "good properties"? Thank you.

detection inference error

Hi. Thx for the great work!

I tried detection inference with the following code

!python demo/demo.py \
    --config-file projects/ViTDet/configs/COCO/cascade_mask_rcnn_vitdet_eva.py \
    --input test.jpg \
    --output detect_test \
    --opts MODEL.WEIGHTS eva_coco_det.pth

And the following error occurred.

yaml.parser.ParserError: expected '<document start>', but found '<scalar>'
  in "projects\ViTDet\configs\COCO\cascade_mask_rcnn_vitdet_eva.py", line 6, column 5

Do you know why the error occurs?
or Can you show me the detection inference sample code?

Thank you very much!

warning while training

hi there
i am experiencing this warning while training

01/13 09:56:11 d2.engine.train_loop]: Starting training from iteration 0
/usr/local/lib/python3.8/dist-packages/shapely/set_operations.py:133: RuntimeWarning: invalid value encountered in intersection
return lib.intersection(a, b, **kwargs)
/usr/local/lib/python3.8/dist-packages/shapely/set_operations.py:133: RuntimeWarning: invalid value encountered in intersection
return lib.intersection(a, b, **kwargs)
/usr/local/lib/python3.8/dist-packages/shapely/set_operations.py:133: RuntimeWarning: invalid value encountered in intersection
return lib.intersection(a, b, **kwargs)
/usr/local/lib/python3.8/dist-packages/shapely/set_operations.py:133: RuntimeWarning: invalid value encountered in intersection
return lib.intersection(a, b, **kwargs)
/usr/local/lib/python3.8/dist-packages/shapely/set_operations.py:133: RuntimeWarning: invalid value encountered in intersection
return lib.intersection(a, b, **kwargs)
/usr/local/lib/python3.8/dist-packages/shapely/set_operations.py:133: RuntimeWarning: invalid value encountered in intersection
return lib.intersection(a, b, **kwargs)
/usr/local/lib/python3.8/dist-packages/torch/nn/functional.py:718: UserWarning: Named tensors and all their associated APIs are an experimental feature and subject to change. Please do not use them for anything important until they are released as stable. (Triggered internally at /pytorch/c10/core/TensorImpl.h:1156.)
return torch.max_pool2d(input, kernel_size, stride, padding, dilation, ceil_mode)
/usr/local/lib/python3.8/dist-packages/shapely/set_operations.py:133: RuntimeWarning: invalid value encountered in intersection
return lib.intersection(a, b, **kwargs)
/usr/local/lib/python3.8/dist-packages/shapely/set_operations.py:133: RuntimeWarning: invalid value encountered in intersection
return lib.intersection(a, b, **kwargs)
/usr/local/lib/python3.8/dist-packages/shapely/set_operations.py:133: RuntimeWarning: invalid value encountered in intersection
return lib.intersection(a, b, **kwargs)
/usr/local/lib/python3.8/dist-packages/shapely/set_operations.py:133: RuntimeWarning: invalid value encountered in intersection
return lib.intersection(a, b, **kwargs)
/usr/local/lib/python3.8/dist-packages/shapely/set_operations.py:133: RuntimeWarning: invalid value encountered in intersection
return lib.intersection(a, b, **kwargs)
[01/13 09:56:45 d2.utils.events]: eta: 20:18:34 iter: 19 total_loss: 14.61 loss_cls_stage0: 4.223 loss_box_reg_stage0: 0.002018 loss_cls_stage1: 4.189 loss_box_reg_stage1: 0.003367 loss_cls_stage2: 4.404 loss_box_reg_stage2: 0.00404 loss_mask: 0.693 loss_rpn_cls: 0.6935 loss_rpn_loc: 0.3204 time: 1.6258 data_time: 0.0255 lr: 9.7405e-07 max_mem: 28114M
/usr/local/lib/python3.8/dist-packages/shapely/set_operations.py:133: RuntimeWarning: invalid value encountered in intersection
return lib.intersection(a, b, **kwargs)

.py file for single image inference

Thank you so much for releasing this, I can't wait to try it on my own data. Are you planning to release a py script for single image inference?

installation on colab

Please I am having issue installing EVA
can anyone please help me installing and configuring EVA on google colab notebook ?
if possible send me a colab notebook

assert result.hostname is not None

Thanks for your excellent work!
When I use --num-gpus=2, I get the following error, I can't solve the problem, and I hope to get your help.

Traceback (most recent call last):
File "tools/lazyconfig_train_net.py", line 125, in
launch(
File "/public/home/1/code/EVA/det/detectron2/engine/launch.py", line 67, in launch
mp.spawn(
File "/public/home/1/Anaconda3/envs/EVA/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 230, in spawn
return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
File "/public/home/1/Anaconda3/envs/EVA/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 188, in start_processes
while not context.join():
File "/public/home/1/Anaconda3/envs/EVA/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 150, in join
raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException:

-- Process 0 terminated with the following error:
Traceback (most recent call last):
File "/public/home/1/Anaconda3/envs/EVA/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 59, in _wrap
fn(i, *args)
File "/public/home/1/code/EVA/det/detectron2/engine/launch.py", line 108, in _distributed_worker
raise e
File "/public/home/1/code/EVA/det/detectron2/engine/launch.py", line 98, in _distributed_worker
dist.init_process_group(
File "/public/home/1/Anaconda3/envs/EVA/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 520, in init_process_group
store, rank, world_size = next(rendezvous_iterator)
File "/public/home/1/Anaconda3/envs/EVA/lib/python3.8/site-packages/torch/distributed/rendezvous.py", line 141, in _tcp_rendezvous_handler
assert result.hostname is not None
AssertionError

This is my running code:
python tools/lazyconfig_train_net.py --num-gpus 2
--num-machines 1 --machine-rank 0 --dist-url "tcp://$MASTER_ADDR:60900"
--config-file projects/ViTDet/configs/COCO/cascade_mask_rcnn_vitdet_eva.py \

EVA train on object365

Hello,

I am trying to work with EVA pretrained on object365 but I have few questions about it. I try to load the model and get these message :


Some model parameters or buffers are not found in the checkpoint:
roi_heads.mask_head.deconv.{bias, weight}
roi_heads.mask_head.mask_fcn1.norm.{bias, weight}
roi_heads.mask_head.mask_fcn1.weight
roi_heads.mask_head.mask_fcn2.norm.{bias, weight}
roi_heads.mask_head.mask_fcn2.weight
roi_heads.mask_head.mask_fcn3.norm.{bias, weight}
roi_heads.mask_head.mask_fcn3.weight
roi_heads.mask_head.mask_fcn4.norm.{bias, weight}
roi_heads.mask_head.mask_fcn4.weight
roi_heads.mask_head.predictor.{bias, weight}

Does that mean the model has not been trained with a mask rcnn architecture ? So, only object detection pretraining ?

Would it be possible to share the lazy config file of this checkpoint ?

Thank !

Representation of Image in EVA

Hi, the EVA model is trained with image tokens, and you use the average pooling for image representation during fine-tuning. I think the CLS token is not well-learned. But for EVA-CLIP, it initialized from the EVA model whose CLS token is not well-learned and uses CLS for pre-training. I am wondering why not using the average pooling for image representation and align with the text?

How to handle pos encoding with different resolution?

Thank authors for the great great work!

In EVA-02, the model changes position encoding from RPE to RoPE. I'd like to understand how this change works when given different resolutions with pretrained one. Is it possible to input any resolution to do zero-shot tasks? Or it has to be finetuned when given different resolutions like detection or segmentation?

The pos_emb used in eva pretraining and eva det.

Hello,thanks for yor great work! Sorry to bother you, but I have some doubts about the pos emb in the codes.
(1) I find the pos_emb used in eva pretraining is abs_pos_emb
in 68 Line which set parser.set_defaults(abs_pos_emb=True)
(https://github.com/baaivision/EVA/blob/master/eva/run_eva_pretraining.py)
As i know, in the original beit codes, they use the rel_pos. Why do you need to make this modification, is this setting helpful for pre-training or some downstream tasks?
(2) I found that when performing detection tasks(eva_det), vit's pos emb uses rel pos, which is inconsistent with the pre-training stage

about pre-training model

hello,could you please post the pre-training model, e.g., vit-base/16 or vit-large/16?

Thanks

Error during fine-tuning EVA object detector

Hello,

I am fine-tuning EVA on my custom dataset. I ran into the following error (it also happens when fine-tuning on COCO):

File "train.py", line 187, in <module>
  main(args)
File "train.py", line 169, in main
  trainer.train(0, cfg.train.max_iter)
File "/home/appuser/eva_repo/det/detectron2/engine/train_loop.py", line 149, in train
  self.run_step()
File "/home/appuser/eva_repo/det/detectron2/engine/train_loop.py", line 421, in run_step
  self.grad_scaler.scale(losses).backward()
File "/home/appuser/.local/lib/python3.7/site-packages/torch/_tensor.py", line 255, in backward
  torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
File "/home/appuser/.local/lib/python3.7/site-packages/torch/autograd/__init__.py", line 149, in backward
  allow_unreachable=True, accumulate_grad=True)  # allow_unreachable flag
File "/home/appuser/.local/lib/python3.7/site-packages/torch/autograd/function.py", line 87, in apply
  return self._forward_cls.backward(self, *args)  # type: ignore[attr-defined]
File "/home/appuser/.local/lib/python3.7/site-packages/fairscale/nn/checkpoint/checkpoint_activations.py", line 331, in backward
  outputs = ctx.run_function(*unpacked_args, **unpacked_kwargs)
File "/home/appuser/eva_repo/det/detectron2/modeling/backbone/vit.py", line 289, in forward
  x = self.attn(x)
File "/home/appuser/.local/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
  return forward_call(*input, **kwargs)
File "/home/appuser/eva_repo/det/detectron2/modeling/backbone/vit.py", line 139, in forward
  attn = add_decomposed_rel_pos(attn, q, self.rel_pos_h, self.rel_pos_w, (H, W), (H, W))
File "/home/appuser/eva_repo/det/detectron2/modeling/backbone/utils.py", line 133, in add_decomposed_rel_pos
  Rh = get_rel_pos(q_h, k_h, rel_pos_h)
File "/home/appuser/eva_repo/det/detectron2/modeling/backbone/utils.py", line 100, in get_rel_pos
  z = rel_pos[:, i].view(src_size).cpu().float().numpy()
RuntimeError: Can't call numpy() on Tensor that requires grad. Use tensor.detach().numpy() instead.

The error is caused by this line:

z = rel_pos[:, i].view(src_size).cpu().float().numpy()

I fixed it by changing it to:

z = rel_pos[:, i].view(src_size).cpu().float().detach().numpy()

After this change, the training runs without issue and the loss decreases steadily.

But I am not sure if I understand the full implications of this change.
Calling .detach() means the gradients are not updated for this Tensor. Or is that not an issue for this call? Did you not get this error during training?

I am running EVA inside docker with CUDA 11.1, Python 3.7, torch 1.9.0, torchvision 0.10.0, mmcv-full 1.6.1. But I doubt this is a versioning issue

Distillation of large ViT

Hello,

In your opinion, what is the best way to distill a large vision transformer (eg. ViT-g) into a small one (eg. ViT-B) ?

Seems there are many alternatives: MIM as for EVA, distillation token as in DeiT, more classical techniques such as response based or feature based methods etc.

Thanks,
Simon

cuda out of memory

Thanks for your excellent work. When i train custom datasets, and adapt the total batch size into 8 (one node ,8 a100), the error still happend. I don't know why

Inference with Eva

Can anyone please show me how to perform inference (which script or command to use) on single or multiple image after training the Eva module on my custom datasets?

bug: detectron2 build fails after cleanup commit

Hello,

The recent cleanup commit (73b3708) causes the detectron2 build to fail.

Steps to reproduce:

cd /path/to/EVA/det
python -m pip install -e .

Result:

Obtaining file:///home/appuser/eva_repo/det
  Preparing metadata (setup.py) ... error
  error: subprocess-exited-with-error

  × python setup.py egg_info did not run successfully.
  │ exit code: 1
  ╰─> [8 lines of output]
      running egg_info
      creating /tmp/pip-pip-egg-info-6d_v4u3e/detectron2.egg-info
      writing /tmp/pip-pip-egg-info-6d_v4u3e/detectron2.egg-info/PKG-INFO
      writing dependency_links to /tmp/pip-pip-egg-info-6d_v4u3e/detectron2.egg-info/dependency_links.txt
      writing requirements to /tmp/pip-pip-egg-info-6d_v4u3e/detectron2.egg-info/requires.txt
      writing top-level names to /tmp/pip-pip-egg-info-6d_v4u3e/detectron2.egg-info/top_level.txt
      writing manifest file '/tmp/pip-pip-egg-info-6d_v4u3e/detectron2.egg-info/SOURCES.txt'
      error: package directory 'projects/PointRend/point_rend' does not exist
      [end of output]

  note: This error originates from a subprocess, and is likely not a problem with pip.
error: metadata-generation-failed

× Encountered error while generating package metadata.
╰─> See above for output.

note: This is an issue with the package mentioned above, not pip.
hint: See above for details.

Expected result:
detectron2 is installed without error

Proposed fix:
Change from

EVA/det/setup.py

Lines 141 to 145 in 56ee4e7

PROJECTS = {
"detectron2.projects.point_rend": "projects/PointRend/point_rend",
"detectron2.projects.deeplab": "projects/DeepLab/deeplab",
"detectron2.projects.panoptic_deeplab": "projects/Panoptic-DeepLab/panoptic_deeplab",
}

To

PROJECTS = {}

There is no dependency on these other projects so the reference to them can be safely removed.

Performance with MAE style pretraining

Hi, I noticed that during the pretraining for EVA, there are two settings: MAE style and BEiT style. I am wondering the performance of MAE style, is there any comparison between these two styles of pertaining?

About the fps

Thank you for your great work! By the way I want to know about your fps because I can not find it in the paper.
I got 0.7s per task on my own A5000, one card. Seems slower than other method? Or if I forgot to change some config?

Look forward to your reply! @Yuxin-CV

Typos VA/EVA-02/asuka/README.md

You always give the pretraining scripts as "python -m torch.distributed.launch --nproc_per_node=8 --nnodes=${WORLD_SIZE} --node_rank=${RANK} --master_addr=${MASTER_ADDR} --master_port=12345 --use_env run_beit_pretraining.py" I believe you mean run_eva_pretrianing.py?

About the EVA-02 on Arxiv

I sincerely suggest change all the title(EVA-02) mentioned in this paper back to black. Too many words in red is extremely distracting and hurt the overall reading experience.

How long does it take to train EVA-02?

Great job on your solid work! I am truly impressed! :)

Although I could find the GPU-time statistics in the eva-01 paper, I couldn't locate the pre-training compute used for eva-02?

The usage of BEiT_win

Hi, it is very kind to provide the BEiT with window attention. Did you try this model in seg for more efficient memory usage and how about the accuracy? In addition, you seem to drop the cls embedding and change the number of relative position bias from (2Wh-1 * 2Ww-1 + 3) to (2Wh-1 * 2Ww-1). So the relation position bias will be initialized randomly rather than loaded from pre-trained weights?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.