Giter VIP home page Giter VIP logo

delvm's Introduction

Training general-purpose vision models on purely sequential visual data, eschewing linguistic inputs, has heralded a new frontier in visual understanding. These models are intended to not only comprehend but also seamlessly transit to out-of-domain tasks. However, current endeavors are hamstrung by an over-reliance on colossal models, exemplified by models with upwards of 3B parameters, and the necessity for an extensive corpus of visual data, often comprising a staggering 400B tokens. In this paper, we delve into the development of an efficient, autoregression-based vision model, innovatively architected to operate on a limited dataset. We meticulously demonstrate how this model achieves proficiency in a spectrum of visual tasks spanning both high-level and low-level semantic understanding during the testing phase. Our empirical evaluations underscore the model's agility in adapting to various tasks, heralding a significant reduction in the parameter footprint, and a marked decrease in training data requirements, thereby paving the way for more sustainable and accessible advancements in the field of generalist vision models.

TODO List

  • Code about training models.
  • Code about inferencing models.
  • Huggingface & InternLM ckpts.
  • Code about data generation.

Set up

based on InternLM-v0.2.1dev20231121

Install: https://github.com/InternLM/InternLM/blob/v0.2.1dev20231121/doc/en/install.md

Put your training data to /path/to/data/vision.

Training command: torchrun --nproc_per_node 8 train.py --config ./configs/pretrain_300m.py --launcher torch

Training via KD command: torchrun --nproc_per_node 8 train.py --config ./configs/kd_1b_to_300m.py --launcher torch

Convert model and inference example: ./tools

The corresponding huggingface ckpt can be downloaded at LLaMA-1b-hf Onedrive / LLaMA-1b-hf Baidu Disk and LLaMA-300m-hf.

Data generation

Please refer to data_generation/README.md.

Citation

If you find this project useful in your research, please consider cite:

@article{guo2024dataefficient,
  title={Data-efficient Large Vision Models through Sequential Autoregression},
  author={Guo, Jianyuan and Hao, Zhiwei and Wang, Chengcheng and Tang, Yehui and Wu, Han and Hu, Han and Han, Kai and Xu, Chang},
  journal={arXiv preprint arXiv:2402.04841},
  year={2024}
}

Acknowledgement

We maily follow the directon of project LVM. And this repo is based on InternLM, huggingface.co/transformers, and huggingface.co/openMUSE.

License

License: MIT

delvm's People

Contributors

ggjy avatar lose4578 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

delvm's Issues

Problem when running the demo.ipynb

Hi,

Thank you for your great work. I am facing the same problem as #2 (comment). Using the seg_1.png gives black and my inputid / output.sequence is the same as [Sutongtong233]'s. When I screenshot the seg_1.png, it works reasonably. Other images under the data folder are good as well.

I am using pytorch 1.13.1 and transformer 4.38.1. Could you give any advice to debug this? Or any insight that could cause this strange issue? Thank you and looking forward to your reply.

The performance on more tasks

Your profile photo are just like you! Niubility! I have been waiting LVM release code longlong time.

This work has a great performance on segment&pose&deraining. And did you test on more tasks? (Especially the 3D tasks such as
depth estimation, which the original LVM performance is good) In other words, I'm very curious about the multi-task capability of LVM model with less training data. Could u show more experimental results?

Muse codes and pre-trained model

Hi, thanks for the great work.

The paper claims that you use an off-the-shelf VQGAN from Muse.

Could you kindly share which specific Muse code and pretrained model you used for this project?

Additionally, it would be really helpful to know where these resources can be found.

Thanks again!

Code about foreground segmentation

Hi, thank you for releasing your great work!
I want to know the details of post-processing in foreground segmentation.
Can you provide some code about it? e.g., code on how to inference on PASCAL-5i.
Looking forward to your reply.

GPU

how many A800 80G used for training?

Evaluation metric

Hi

When you evaluate the model on image segmentation task, for calculating the accuracy, what post-processing did you use to align the prediction with categories? And can you please provide the evaluation code to calculate the ACC?

All the best,

请求审稿意见

恭喜作者做出这项不错的工作!
作者你好,请问这个工作是投的ICML2024吗?请问被录用了吗?审稿意见能否发一下呢?
谢谢,期待作者的回复。

Problem when running demo.ipynb (blank result)

Hi, I am interested in this great job! Here I have a problem when running demo. I use LLaMA-1b-hf and vqgan-f16-8192-laion(As shown in data_preparition). The generated_img results:
image

(I find generated_img.max() = 0.07. Is there any mistake?
Thanks

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.