Giter VIP home page Giter VIP logo

vary's Introduction

Haoran Wei*, Lingyu Kong*, Jinyue Chen, Liang Zhao, Zheng Ge, Jinrong Yang, Jianjian Sun, Chunrui Han, Xiangyu Zhang

Release

  • [2024/5/24] 🔥🔥🔥 We propose a multi-page document understanding work -- Fox, which supports 8-page pdf-image input !!!
  • [2024/4/21] 🔥🔥🔥 For OneChart, we have released the web demo in Project Page. Have fun!!
  • [2024/4/21] 🔥🔥🔥 We present a Vary-tiny LAVIS codebase (for training from scratch) and the Vary-600k dataset (300K English and 300K Chinese pages) here !!!
  • [2024/4/15]🔥🔥🔥We release a chart parsing model OneChart here.
  • [2024/4/12]🔥🔥🔥We will release a chart parsing model based on Vary-tiny next week. The model supports both English and Chinese charts.
  • [2024/3/16]🔥🔥🔥I found many friends very interested in Vary-tiny(OPT-125M), so I opened source it here, a PDF-dense OCR and object detection version.
  • [2023/1/23]🔥🔥🔥We release the Vary-toy here. Besides, we show the super good Vary-family results here.
  • [2023/12/29]🔥🔥🔥We will release a new model (a small-size Vary, about 2B) at the beginning of next month and introduce a new feature (object detection). Our online demo will be temporarily closed to prepare for the deployment of the new model.
  • [2023/12/11] We released the online demo, have fun!
  • [2023/12/11] We released the codes of Vary (train and inference)!

Code License Data License Usage and License Notices: The data, code, and checkpoint are intended and licensed for research use only. They are also restricted to use that follow the license agreement of LLaMA, Vicuna, GPT-4, Qwen, and LLaVA.

Contents

Install

  1. Clone this repository and navigate to the Vary folder
git clone https://github.com/Ucas-HaoranWei/Vary.git
cd Vary
  1. Install Package
conda create -n vary python=3.10 -y
conda activate vary
pip install e .
  1. Install Flash-Attention
pip install ninja
pip install flash-attn --no-build-isolation

Vary Weights

  • If you are in urgent need of weights for your research recently, please contact me by email.
  • Download the CLIP-VIT-L in Hugging Face

Demo

  1. Update the CLIP-VIT path in the codes (/cache/vit-large-patch14/) to your path.

python vary/demo/run_qwen_vary.py  --model-name  /vary/model/path/ --image-file /an/image/file.png

Train

  • We currently do not plan to open source the weights of the intermediate.
  • However, we release the train codes. So you can train on your own dataset. If you want to do this, you can try this:
  1. For Vary-base (one machine, if you have multiple machines you need to prepare your host file)
deepspeed   Vary/train/train_qwen_vary.py  --deepspeed /Vary/zero_config/zero2.json
            --model_name_or_path /Qwen-7B/path/
            --vision_tower /vit-large-patch14/path/
            --freeze_vision_tower True
            --freeze_lm_model False
            --vision_select_layer  -2
            --use_im_start_end True
            --bf16 True
            --per_device_eval_batch_size 4
            --gradient_accumulation_steps 1
            --evaluation_strategy "no"
            --save_strategy "steps"
            --save_steps 5000
            --save_total_limit 1
            --weight_decay 0.
            --warmup_ratio 0.03
            --lr_scheduler_type "cosine"
            --logging_steps 1 --tf32 True
            --model_max_length 4096
            --gradient_checkpointing True
            --dataloader_num_workers 4
            --report_to none
            --per_device_train_batch_size 4
            --num_train_epochs 1
            --learning_rate 5e-5
            --datasets  data_name1+data_name2+data_name3
            --output_dir /path/to/output/
  1. For Vary-tiny
deepspeed   Vary/train/train_opt.py  --deepspeed /Vary/zero_config/zero2.json
            --model_name_or_path /opt125m/path/
            --conversation_version opt
            --freeze_vision_tower False
            --freeze_lm_model False
            --use_im_start_end True
            --bf16 True
            --per_device_eval_batch_size 4
            --gradient_accumulation_steps 1
            --evaluation_strategy "no"
            --save_strategy "steps"
            --save_steps 5000
            --save_total_limit 1
            --weight_decay 0.
            --warmup_ratio 0.03
            --lr_scheduler_type "cosine"
            --logging_steps 1 --tf32 True
            --model_max_length 4096
            --gradient_checkpointing True
            --dataloader_num_workers 4
            --report_to none
            --per_device_train_batch_size 16
            --num_train_epochs 1
            --learning_rate 5e-5
            --datasets  data_name1+data_name2+data_name3
            --output_dir /path/to/output/

Contact

If you have any questions related to the code or the paper, feel free to email ([email protected]).

Acknowledgement

  • LLaVA: the codebase we built upon!
  • Qwen: the LLM base model of Vary, which is good at both English and Chinese!

Citation

If you find our work useful in your research, please consider citing Vary:

@article{wei2023vary,
  title={Vary: Scaling up the Vision Vocabulary for Large Vision-Language Models},
  author={Wei, Haoran and Kong, Lingyu and Chen, Jinyue and Zhao, Liang and Ge, Zheng and Yang, Jinrong and Sun, Jianjian and Han, Chunrui and Zhang, Xiangyu},
  journal={arXiv preprint arXiv:2312.06109},
  year={2023}
}

@article{wei2024small,
  title={Small Language Model Meets with Reinforced Vision Vocabulary},
  author={Wei, Haoran and Kong, Lingyu and Chen, Jinyue and Zhao, Liang and Ge, Zheng and Yu, En and Sun, Jianjian and Han, Chunrui and Zhang, Xiangyu},
  journal={arXiv preprint arXiv:2401.12503},
  year={2024}
}

vary's People

Contributors

ucas-haoranwei avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

vary's Issues

关于训练数据的疑问

看论文训练也用到了表格,请问训练表格时,数据形式是怎么组织的?
图表数据的groundtruth应该是python字典格式,下面这种格式吧,那表格是怎么表示的呢,怎么保持表格的布局信息(行/列/合并行列等)
[
{
image: xxxx
conversations:[
{
from: human,
value: question/prompts
},
{
from: gpt,
value: {"title": "图:主粮及医疗药品是主要消费内容(2021)","data": {"服务": "6.40%","用品": "12.80%","药品及医疗": "29.20%","营养品": "1.80%","零食": "13.90%","主粮": "35.80%"}}
}
]
}
......
]

Will the OCR test set be released?

Thanks for your solid work.

I plan to follow your work, so I wonder whether the document-level OCR test set (with the evaluation scripts) will be released.

Thanks.

关于Text Tokenizer的疑问

很棒的工作!想请问 使用的Text Tokenizer是与后续text、visual token进行融合的LLMs相配对的吗

如何实现对话形式的交互

run_qwen_vary.py加载一次模型只能进行一次输出,如何改造成问答形式,需要其他的依赖吗,或者有已经实现的样例吗?

Not able to run on multiple GPUs

Hi, I don't have one GPU to hold the model, so trying to distribute it on top of 2 GPUs (2 x RTX3090 24GB), but got the following error:

Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████| 2/2 [00:09<00:00, 4.90s/it]
2 GPUs are available. Using DataParallel.
Traceback (most recent call last):
File "/works/ksjds/Vary/Vary-master/vary/demo/run_qwen_vary.py", line 130, in
eval_model(args)
File "/works/ksjds/Vary/Vary-master/vary/demo/run_qwen_vary.py", line 99, in eval_model
output_ids = model.generate(
File "/home/ksjds/miniconda3/envs/vary/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1695, in getattr
raise AttributeError(f"'{type(self).name}' object has no attribute '{name}'")
AttributeError: 'DataParallel' object has no attribute 'generate'

Could you pls help?

运行run_qwen_vary.py报错,请指教!

我把huggingface上的clip-vit-large-patch14模型git clone到了/cache/vit-large-patch14/目录。然后执行下面指令时报错,请教是--model-name还是哪里写错了吗?

指令:
/Vary/Vary-master/vary# python ./demo/run_qwen_vary.py --model-name /cache/vit-large-patch14/ --image-file /mnt/e/ocr.png

报错:
(vary) root@90bb63a226b2:/Vary/Vary-master/vary# python ./demo/run_qwen_vary.py --model-name /cache/vit-large-patch14/ --image-file /mnt/e/ocr.png
The argument trust_remote_code is to be used with Auto classes. It has no effect here and is ignored.
You are using a model of type clip to instantiate a model of type vary. This is not supported for all configurations of models and can yield errors.
Traceback (most recent call last):
File "/Vary/Vary-master/vary/./demo/run_qwen_vary.py", line 127, in
eval_model(args)
File "/Vary/Vary-master/vary/./demo/run_qwen_vary.py", line 43, in eval_model
model = varyQwenForCausalLM.from_pretrained(model_name, low_cpu_mem_usage=True, device_map='cuda', trust_remote_code=True)
File "/root/anaconda3/envs/vary/lib/python3.10/site-packages/transformers/modeling_utils.py", line 2876, in from_pretrained
model = cls(config, *model_args, **model_kwargs)
File "/root/anaconda3/envs/vary/lib/python3.10/site-packages/vary/model/vary_qwen_vary.py", line 238, in init
self.transformer = varyQwenModel(config)
File "/root/anaconda3/envs/vary/lib/python3.10/site-packages/vary/model/vary_qwen_vary.py", line 46, in init
super(varyQwenModel, self).init(config)
File "/root/anaconda3/envs/vary/lib/python3.10/site-packages/vary/model/llm/qwen/modeling_qwen.py", line 496, in init
[
File "/root/anaconda3/envs/vary/lib/python3.10/site-packages/vary/model/llm/qwen/modeling_qwen.py", line 497, in
QWenBlock(
File "/root/anaconda3/envs/vary/lib/python3.10/site-packages/vary/model/llm/qwen/modeling_qwen.py", line 393, in init
self.attn = QWenAttention(config)
File "/root/anaconda3/envs/vary/lib/python3.10/site-packages/vary/model/llm/qwen/modeling_qwen.py", line 120, in init
self.seq_length = config.seq_length
File "/root/anaconda3/envs/vary/lib/python3.10/site-packages/transformers/configuration_utils.py", line 261, in getattribute
return super().getattribute(key)
AttributeError: 'varyConfig' object has no attribute 'seq_length'. Did you mean: 'max_length'?
(vary) root@90bb63a226b2:/Vary/Vary-master/vary#

关于训练数据的格式

训练的命令中 --datasets data_name1+data_name2+data_name3,这里的data_name应该是什么格式的数据

keep generate duplicate content

i deploy the model using A100 , with

output_ids = model.generate(
            input_ids,
            images=[(image_tensor.unsqueeze(0).half().cuda(), image_tensor_1.unsqueeze(0).half().cuda())],
            do_sample=True,
            num_beams = 1,
            temperature=0.1,
            streamer=streamer,
            max_new_tokens=2048,
            repetition_penalty=1.05,
            stopping_criteria=[stopping_criteria]
            )

it always generate duplicate content, i have used different generation config ,but it did not help.
by the way ,when i use the following code to load model:

disable_torch_init()
# model_name = os.path.expanduser(model_name)

tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)

model = varyQwenForCausalLM.from_pretrained(model_name, low_cpu_mem_usage=True, device_map='cuda', trust_remote_code=True)


model.to(device='cuda',  dtype=torch.bfloat16)

image_processor = CLIPImageProcessor.from_pretrained(clip_model, torch_dtype=torch.float16)

i get the following warning:

/opt/anaconda3/envs/qiu_chatglm3/lib/python3.10/site-packages/torch/nn/modules/module.py:2025: UserWarning: for vision_model.encoder.layers.23.self_attn.v_proj.bias: copying from a non-meta parameter in the checkpoint to a meta parameter in the current model, which is a no-op. (Did you mean
to pass assign=True to assign items in the state dictionary to their corresponding key in the module instead of copying them in place?)

is there any problem with this warning?

关于论文中Vary-tiny和Vary-base训练方式有些疑惑

感谢你们将如此好的工作分享给大家。我对你们的论文也十分感兴趣,关于论文中 Vary-tiny 和 Vary-base 的训练方式还有一些疑惑的地方,想向你们请教一下。

  1. Vary-tiny
    关于 Vary-tiny 部分,论文中用了文档、图表、自然图像三种数据进行训练,训练过程中,自然图像是预测 "It’s an image of nature",文档类型是直接预测文档中的所有文字信息,图表类型是预测图标对应的 python-dict。不知道这部分理解有没有问题。

  2. Vary-base
    Vary-base 训练包含预训练和微调两个阶段,预训练直接使用的是来自 LAION-COCO 数据集的文本图像对。但在微调阶段论文中提到使用了 LLaVA-80k,不知道在 3.2.2 和 3.3.2 节中渲染的文档数据和图表数据是否也会参与微调,还是在用 LLaVA-80k 微调好的模型基础上,用渲染的数据再根据特定任务进行微调?

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 1.97 GiB. GPU 2 has a total capacty of 39.43 GiB of which 982.31 MiB is free. Including non-PyTorch memory, this process has 38.47 GiB memory in use. Of the allocated memory 28.76 GiB is allocated by PyTorch, and 7.31 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

您好,我在运行deepspeed vary/train/train_qwen_vary.py 时出现CUDA out of memory的问题,在调整了一些参数之后仍无法解决这个问题,以下是我的参数设置:

deepspeed vary/train/train_qwen_vary.py
--deepspeed /home/mayilin/workspace/code/Vary/Vary-master/zero_config/zero2.json
--model_name_or_path /data01/mayilin_data/llava/vary-llava80k
--vision_tower /home/mayilin/workspace/code/LLaVA/openai/clip-vit-large-patch14-336
--freeze_vision_tower True
--freeze_lm_model False
--vision_select_layer -2
--use_im_start_end True
--bf16 True
--per_device_train_batch_size 1
--per_device_eval_batch_size 1
--gradient_accumulation_steps 1
--evaluation_strategy "no"
--save_strategy "steps"
--save_steps 5000
--save_total_limit 1
--weight_decay 0.
--warmup_ratio 0.03
--lr_scheduler_type "cosine"
--logging_steps 1 --tf32 True
--model_max_length 4096
--gradient_checkpointing True
--dataloader_num_workers 4
--report_to none
--num_train_epochs 1
--learning_rate 5e-5
--datasets llava_merge_mix665k_latex_wollavapre+llava_v1_5_mix665k_14669
--output_dir /data01/mayilin_data/vary_checkpoints/ \

请问应该如何解决这个问题呢?

关于训练的疑惑

非常感谢分享如此好的方案。经过论文和代码阅读,产生了一些疑惑,希望作者能帮忙解惑一下,非常感谢。

论文中谈及训练分两个阶段,第一阶段训练 tiny ,得到新的词汇表,第二阶段训练 base ;目前遇到使用官方 Qwen-7B 模型在 40G A100 上训练出现 OOM 问题,暂未解决,可能需要调下参,前来请教一下,以下哪种理解是对的?

  1. 从论文理解,应该需要先训练 tiny ,然后再训练 base 吗?但从训练的命令看,两者貌似没有关联,训练 base 的命令中,仅依赖了 CLIP-VITQwen-7B,不需要关联 tiny 的输出结果?

  2. 从代码理解,貌似 train_qwen_vary 整体逻辑貌似也包含了 train_opt 逻辑(不太确定),因此训练 base 可以直接准备数据集,直接执行训练 base 的命令 ,不需要先执行训练 tiny 的命令吧?

Question about the evaluation

Thanks for your solid work.
I would like to ask if your model is individually fine-tuned on each specific dataset?
As far as I know, models like UReader and DocPedia are evaluated across various datasets using a single model.

训练过程的loss为0

训练过程loss为0,经过排查,是因为vision_tower和vision_tower_high的预训练权重没有加载进去,导致经过vision_tower之后的image_features全部为nan。
但是将vision_tower和vision_tower_high在initialize_vision_modules中重新加载,导致显存超出OOM。设备是80G的A100

执行demo报torch.cuda.OutOfMemoryError

1.按照demo下载完相应的包
2.通过邮件联系,下载vary权重到/cache/vit-large-patch14/目录和/home/itouchtv/zhao/Vary/Vary-master/vary/model/path/vary-llava80k/目录
3.执行demo代码 python vary/demo/run_qwen_vary.py --model-name /home/itouchtv/zhao/Vary/Vary-master/vary/model/path/vary-llava80k/ --image-file OCR_PIC.png
报错如下,请问是什么原因???
The argument trust_remote_code is to be used with Auto classes. It has no effect here and is ignored.
You are using a model of type mmgpt to instantiate a model of type vary. This is not supported for all configurations of models and can yield errors.
You are using a model of type mmgpt to instantiate a model of type clip_vision_model. This is not supported for all configurations of models and can yield errors.
Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:12<00:00, 6.44s/it]
Some weights of CLIPVisionModel were not initialized from the model checkpoint at /cache/vit-large-patch14/ and are newly initialized: ['vision_model.encoder.layers.10.mlp.fc2.bias', 'vision_model.encoder.layers.14.mlp.fc1.bias', 'vision_model.encoder.layers.24.layer_norm1.bias', 'vision_model.encoder.layers.5.mlp.fc1.bias', 'vision_model.encoder.layers.25.mlp.fc2.bias', 'vision_model.encoder.layers.1.self_attn.k_proj.weight', 'vision_model.encoder.layers.13.mlp.fc2.weight', 'vision_model.encoder.layers.10.layer_norm1.weight', 'vision_model.encoder.layers.28.mlp.fc1.bias', 'vision_model.encoder.layers.25.self_attn.out_proj.weight', 'vision_model.encoder.layers.14.self_attn.v_proj.weight', 'vision_model.encoder.layers.27.self_attn.v_proj.weight', 'vision_model.encoder.layers.30.mlp.fc2.bias', 'vision_model.encoder.layers.13.layer_norm1.weight', 'vision_model.encoder.layers.7.self_attn.out_proj.weight', 'vision_model.encoder.layers.1.self_attn.q_proj.weight', 'vision_model.encoder.layers.15.mlp.fc1.weight', 'vision_model.encoder.layers.30.self_attn.q_proj.bias', 'vision_model.encoder.layers.12.self_attn.v_proj.bias', 'vision_model.encoder.layers.17.self_attn.v_proj.weight', 'vision_model.embeddings.position_embedding.weight', 'vision_model.encoder.layers.28.mlp.fc1.weight', 'vision_model.encoder.layers.28.self_attn.v_proj.bias', 'vision_model.encoder.layers.28.layer_norm2.weight', 'vision_model.encoder.layers.6.self_attn.v_proj.bias', 'vision_model.encoder.layers.18.mlp.fc1.weight', 'vision_model.encoder.layers.7.self_attn.out_proj.bias', 'vision_model.encoder.layers.0.self_attn.q_proj.weight', 'vision_model.encoder.layers.12.mlp.fc2.bias', 'vision_model.encoder.layers.2.self_attn.v_proj.bias', 'vision_model.encoder.layers.11.mlp.fc1.weight', 'vision_model.encoder.layers.16.layer_norm1.bias', 'vision_model.encoder.layers.11.self_attn.k_proj.weight', 'vision_model.encoder.layers.12.self_attn.out_proj.weight', 'vision_model.encoder.layers.19.mlp.fc1.weight', 'vision_model.encoder.layers.30.self_attn.k_proj.bias', 'vision_model.encoder.layers.18.self_attn.q_proj.bias', 'vision_model.encoder.layers.16.layer_norm2.bias', 'vision_model.encoder.layers.22.mlp.fc1.bias', 'vision_model.encoder.layers.1.layer_norm1.weight', 'vision_model.encoder.layers.9.mlp.fc2.weight', 'vision_model.encoder.layers.15.layer_norm2.bias', 'vision_model.encoder.layers.22.mlp.fc1.weight', 'vision_model.encoder.layers.2.mlp.fc1.weight', 'vision_model.encoder.layers.10.layer_norm2.bias', 'vision_model.encoder.layers.25.layer_norm1.bias', 'vision_model.encoder.layers.10.layer_norm2.weight', 'vision_model.encoder.layers.8.mlp.fc2.weight', 'vision_model.encoder.layers.0.self_attn.v_proj.bias', 'vision_model.encoder.layers.24.self_attn.out_proj.bias', 'vision_model.encoder.layers.24.self_attn.v_proj.bias', 'vision_model.encoder.layers.5.self_attn.out_proj.bias', 'vision_model.encoder.layers.1.mlp.fc2.weight', 'vision_model.encoder.layers.8.layer_norm2.bias', 'vision_model.encoder.layers.14.mlp.fc1.weight', 'vision_model.encoder.layers.21.layer_norm1.weight', 'vision_model.encoder.layers.2.layer_norm2.weight', 'vision_model.encoder.layers.29.mlp.fc1.weight', 'vision_model.encoder.layers.3.mlp.fc1.bias', 'vision_model.encoder.layers.9.self_attn.out_proj.weight', 'vision_model.encoder.layers.0.self_attn.out_proj.weight', 'vision_model.encoder.layers.12.self_attn.k_proj.weight', 'vision_model.encoder.layers.8.self_attn.out_proj.bias', 'vision_model.encoder.layers.19.layer_norm1.bias', 'vision_model.encoder.layers.12.mlp.fc2.weight', 'vision_model.encoder.layers.3.self_attn.out_proj.bias', 'vision_model.encoder.layers.0.mlp.fc2.bias', 'vision_model.encoder.layers.13.self_attn.v_proj.bias', 'vision_model.encoder.layers.9.layer_norm1.weight', 'vision_model.encoder.layers.4.self_attn.q_proj.bias', 'vision_model.encoder.layers.7.self_attn.q_proj.bias', 'vision_model.encoder.layers.19.mlp.fc2.bias', 'vision_model.encoder.layers.19.self_attn.v_proj.weight', 'vision_model.encoder.layers.24.self_attn.q_proj.bias', 'vision_model.encoder.layers.24.self_attn.out_proj.weight', 'vision_model.encoder.layers.18.self_attn.v_proj.bias', 'vision_model.encoder.layers.15.mlp.fc2.bias', 'vision_model.encoder.layers.17.mlp.fc2.bias', 'vision_model.encoder.layers.10.self_attn.q_proj.weight', 'vision_model.encoder.layers.14.mlp.fc2.bias', 'vision_model.encoder.layers.6.self_attn.out_proj.bias', 'vision_model.encoder.layers.15.self_attn.q_proj.bias', 'vision_model.encoder.layers.20.mlp.fc2.bias', 'vision_model.encoder.layers.16.self_attn.q_proj.weight', 'vision_model.encoder.layers.14.layer_norm1.weight', 'vision_model.encoder.layers.3.self_attn.out_proj.weight', 'vision_model.encoder.layers.26.self_attn.k_proj.bias', 'vision_model.encoder.layers.4.self_attn.q_proj.weight', 'vision_model.encoder.layers.4.self_attn.k_proj.weight', 'vision_model.encoder.layers.3.self_attn.k_proj.bias', 'vision_model.encoder.layers.29.layer_norm1.bias', 'vision_model.encoder.layers.16.mlp.fc1.bias', 'vision_model.encoder.layers.8.layer_norm1.bias', 'vision_model.encoder.layers.2.mlp.fc1.bias', 'vision_model.encoder.layers.8.self_attn.out_proj.weight', 'vision_model.encoder.layers.15.self_attn.q_proj.weight', 'vision_model.encoder.layers.7.layer_norm1.bias', 'vision_model.encoder.layers.17.layer_norm2.bias', 'vision_model.encoder.layers.30.self_attn.v_proj.bias', 'vision_model.encoder.layers.29.self_attn.v_proj.bias', 'vision_model.encoder.layers.1.layer_norm2.bias', 'vision_model.encoder.layers.22.layer_norm2.weight', 'vision_model.encoder.layers.10.self_attn.q_proj.bias', 'vision_model.encoder.layers.21.layer_norm2.weight', 'vision_model.encoder.layers.15.self_attn.out_proj.bias', 'vision_model.encoder.layers.22.layer_norm1.weight', 'vision_model.encoder.layers.3.self_attn.v_proj.bias', 'vision_model.encoder.layers.6.self_attn.q_proj.bias', 'vision_model.encoder.layers.9.layer_norm1.bias', 'vision_model.encoder.layers.3.mlp.fc2.weight', 'vision_model.encoder.layers.7.mlp.fc1.weight', 'vision_model.encoder.layers.9.self_attn.v_proj.bias', 'vision_model.encoder.layers.29.mlp.fc2.weight', 'vision_model.encoder.layers.10.layer_norm1.bias', 'vision_model.encoder.layers.18.layer_norm1.weight', 'vision_model.encoder.layers.24.mlp.fc1.weight', 'vision_model.encoder.layers.21.mlp.fc2.weight', 'vision_model.encoder.layers.27.self_attn.k_proj.weight', 'vision_model.encoder.layers.15.mlp.fc1.bias', 'vision_model.encoder.layers.20.layer_norm2.weight', 'vision_model.encoder.layers.22.self_attn.out_proj.weight', 'vision_model.encoder.layers.26.self_attn.q_proj.bias', 'vision_model.encoder.layers.2.self_attn.out_proj.bias', 'vision_model.encoder.layers.25.mlp.fc2.weight', 'vision_model.encoder.layers.9.mlp.fc1.weight', 'vision_model.encoder.layers.9.layer_norm2.bias', 'vision_model.encoder.layers.29.self_attn.q_proj.weight', 'vision_model.encoder.layers.29.mlp.fc1.bias', 'vision_model.encoder.layers.2.self_attn.out_proj.weight', 'vision_model.encoder.layers.14.layer_norm2.weight', 'vision_model.encoder.layers.20.mlp.fc1.bias', 'vision_model.encoder.layers.8.layer_norm2.weight', 'vision_model.encoder.layers.0.self_attn.q_proj.bias', 'vision_model.encoder.layers.16.self_attn.k_proj.weight', 'vision_model.encoder.layers.7.self_attn.v_proj.weight', 'vision_model.encoder.layers.8.self_attn.q_proj.bias', 'vision_model.encoder.layers.9.self_attn.out_proj.bias', 'vision_model.encoder.layers.10.self_attn.k_proj.weight', 'vision_model.encoder.layers.3.layer_norm1.bias', 'vision_model.encoder.layers.24.self_attn.v_proj.weight', 'vision_model.encoder.layers.1.self_attn.k_proj.bias', 'vision_model.encoder.layers.31.mlp.fc1.bias', 'vision_model.encoder.layers.3.mlp.fc1.weight', 'vision_model.encoder.layers.17.layer_norm1.weight', 'vision_model.encoder.layers.0.self_attn.v_proj.weight', 'vision_model.encoder.layers.13.self_attn.out_proj.weight', 'vision_model.encoder.layers.15.self_attn.k_proj.weight', 'vision_model.encoder.layers.8.self_attn.v_proj.bias', 'vision_model.encoder.layers.11.mlp.fc2.weight', 'vision_model.encoder.layers.10.self_attn.v_proj.weight', 'vision_model.encoder.layers.20.layer_norm2.bias', 'vision_model.encoder.layers.31.layer_norm1.bias', 'vision_model.encoder.layers.3.self_attn.q_proj.weight', 'vision_model.encoder.layers.26.mlp.fc2.weight', 'vision_model.encoder.layers.11.layer_norm2.weight', 'vision_model.encoder.layers.25.layer_norm2.weight', 'vision_model.encoder.layers.20.mlp.fc1.weight', 'vision_model.encoder.layers.5.layer_norm2.weight', 'vision_model.encoder.layers.21.self_attn.k_proj.bias', 'vision_model.encoder.layers.28.self_attn.q_proj.weight', 'vision_model.encoder.layers.17.mlp.fc1.weight', 'vision_model.encoder.layers.19.self_attn.k_proj.bias', 'vision_model.encoder.layers.16.mlp.fc2.bias', 'vision_model.encoder.layers.1.self_attn.q_proj.bias', 'vision_model.encoder.layers.12.self_attn.q_proj.bias', 'vision_model.encoder.layers.26.mlp.fc1.weight', 'vision_model.encoder.layers.28.self_attn.v_proj.weight', 'vision_model.encoder.layers.8.mlp.fc2.bias', 'vision_model.encoder.layers.18.self_attn.k_proj.weight', 'vision_model.encoder.layers.9.mlp.fc1.bias', 'vision_model.encoder.layers.31.layer_norm2.bias', 'vision_model.encoder.layers.9.layer_norm2.weight', 'vision_model.encoder.layers.3.self_attn.q_proj.bias', 'vision_model.encoder.layers.1.mlp.fc1.weight', 'vision_model.encoder.layers.21.self_attn.q_proj.bias', 'vision_model.encoder.layers.5.self_attn.out_proj.weight', 'vision_model.encoder.layers.25.self_attn.v_proj.weight', 'vision_model.encoder.layers.2.layer_norm1.bias', 'vision_model.encoder.layers.24.layer_norm1.weight', 'vision_model.encoder.layers.14.layer_norm2.bias', 'vision_model.encoder.layers.18.self_attn.k_proj.bias', 'vision_model.encoder.layers.27.mlp.fc2.bias', 'vision_model.encoder.layers.18.layer_norm2.bias', 'vision_model.encoder.layers.0.self_attn.k_proj.bias', 'vision_model.encoder.layers.19.layer_norm2.weight', 'vision_model.encoder.layers.22.self_attn.v_proj.bias', 'vision_model.encoder.layers.20.self_attn.out_proj.bias', 'vision_model.encoder.layers.17.self_attn.v_proj.bias', 'vision_model.encoder.layers.31.self_attn.v_proj.bias', 'vision_model.encoder.layers.24.mlp.fc2.weight', 'vision_model.encoder.layers.30.mlp.fc2.weight', 'vision_model.encoder.layers.19.self_attn.k_proj.weight', 'vision_model.encoder.layers.26.self_attn.v_proj.weight', 'vision_model.encoder.layers.6.self_attn.v_proj.weight', 'vision_model.encoder.layers.18.layer_norm1.bias', 'vision_model.encoder.layers.8.self_attn.k_proj.weight', 'vision_model.encoder.layers.27.mlp.fc2.weight', 'vision_model.encoder.layers.27.layer_norm2.weight', 'vision_model.encoder.layers.28.layer_norm2.bias', 'vision_model.encoder.layers.12.layer_norm2.bias', 'vision_model.encoder.layers.17.mlp.fc1.bias', 'vision_model.encoder.layers.3.layer_norm2.weight', 'vision_model.encoder.layers.31.self_attn.out_proj.bias', 'vision_model.encoder.layers.14.self_attn.v_proj.bias', 'vision_model.encoder.layers.23.mlp.fc1.bias', 'vision_model.encoder.layers.31.self_attn.k_proj.weight', 'vision_model.encoder.layers.31.mlp.fc1.weight', 'vision_model.encoder.layers.7.layer_norm2.weight', 'vision_model.encoder.layers.15.mlp.fc2.weight', 'vision_model.encoder.layers.29.layer_norm2.bias', 'vision_model.encoder.layers.7.self_attn.q_proj.weight', 'vision_model.encoder.layers.31.mlp.fc2.bias', 'vision_model.encoder.layers.13.self_attn.k_proj.bias', 'vision_model.encoder.layers.6.self_attn.out_proj.weight', 'vision_model.encoder.layers.5.self_attn.v_proj.weight', 'vision_model.encoder.layers.13.self_attn.out_proj.bias', 'vision_model.encoder.layers.2.layer_norm1.weight', 'vision_model.encoder.layers.30.layer_norm2.bias', 'vision_model.encoder.layers.23.self_attn.k_proj.bias', 'vision_model.encoder.layers.10.self_attn.v_proj.bias', 'vision_model.encoder.layers.5.layer_norm1.weight', 'vision_model.encoder.layers.14.self_attn.q_proj.bias', 'vision_model.encoder.layers.21.layer_norm2.bias', 'vision_model.encoder.layers.11.mlp.fc1.bias', 'vision_model.encoder.layers.19.self_attn.out_proj.bias', 'vision_model.encoder.layers.27.self_attn.out_proj.weight', 'vision_model.encoder.layers.20.self_attn.v_proj.weight', 'vision_model.encoder.layers.29.self_attn.v_proj.weight', 'vision_model.encoder.layers.20.self_attn.k_proj.bias', 'vision_model.encoder.layers.8.mlp.fc1.weight', 'vision_model.encoder.layers.26.self_attn.k_proj.weight', 'vision_model.encoder.layers.28.self_attn.out_proj.weight', 'vision_model.encoder.layers.7.mlp.fc2.weight', 'vision_model.encoder.layers.23.mlp.fc1.weight', 'vision_model.encoder.layers.27.self_attn.out_proj.bias', 'vision_model.encoder.layers.0.self_attn.k_proj.weight', 'vision_model.encoder.layers.12.self_attn.k_proj.bias', 'vision_model.encoder.layers.23.self_attn.out_proj.bias', 'vision_model.encoder.layers.16.self_attn.k_proj.bias', 'vision_model.encoder.layers.10.self_attn.out_proj.bias', 'vision_model.encoder.layers.25.self_attn.q_proj.bias', 'vision_model.encoder.layers.24.mlp.fc1.bias', 'vision_model.encoder.layers.4.mlp.fc2.weight', 'vision_model.encoder.layers.1.mlp.fc2.bias', 'vision_model.encoder.layers.2.mlp.fc2.weight', 'vision_model.encoder.layers.11.self_attn.q_proj.bias', 'vision_model.encoder.layers.3.self_attn.k_proj.weight', 'vision_model.encoder.layers.26.mlp.fc2.bias', 'vision_model.encoder.layers.26.self_attn.out_proj.bias', 'vision_model.encoder.layers.26.layer_norm1.weight', 'vision_model.encoder.layers.23.self_attn.v_proj.bias', 'vision_model.encoder.layers.13.self_attn.q_proj.bias', 'vision_model.encoder.layers.24.layer_norm2.weight', 'vision_model.encoder.layers.24.self_attn.q_proj.weight', 'vision_model.encoder.layers.20.self_attn.out_proj.weight', 'vision_model.encoder.layers.27.layer_norm1.bias', 'vision_model.encoder.layers.27.layer_norm2.bias', 'vision_model.encoder.layers.27.self_attn.k_proj.bias', 'vision_model.encoder.layers.30.layer_norm1.weight', 'vision_model.encoder.layers.19.self_attn.v_proj.bias', 'vision_model.pre_layrnorm.weight', 'vision_model.encoder.layers.28.layer_norm1.bias', 'vision_model.encoder.layers.6.layer_norm1.bias', 'vision_model.encoder.layers.22.self_attn.out_proj.bias', 'vision_model.encoder.layers.4.self_attn.k_proj.bias', 'vision_model.encoder.layers.7.layer_norm2.bias', 'vision_model.encoder.layers.23.mlp.fc2.weight', 'vision_model.encoder.layers.12.layer_norm1.weight', 'vision_model.encoder.layers.21.mlp.fc2.bias', 'vision_model.encoder.layers.15.self_attn.v_proj.weight', 'vision_model.encoder.layers.26.layer_norm1.bias', 'vision_model.encoder.layers.26.self_attn.out_proj.weight', 'vision_model.encoder.layers.27.mlp.fc1.weight', 'vision_model.encoder.layers.13.layer_norm1.bias', 'vision_model.encoder.layers.20.self_attn.v_proj.bias', 'vision_model.encoder.layers.27.layer_norm1.weight', 'vision_model.encoder.layers.31.self_attn.q_proj.weight', 'vision_model.encoder.layers.30.layer_norm2.weight', 'vision_model.encoder.layers.15.self_attn.k_proj.bias', 'vision_model.encoder.layers.11.layer_norm1.bias', 'vision_model.embeddings.patch_embedding.weight', 'vision_model.encoder.layers.30.self_attn.out_proj.bias', 'vision_model.encoder.layers.19.self_attn.q_proj.weight', 'vision_model.encoder.layers.19.layer_norm2.bias', 'vision_model.encoder.layers.21.mlp.fc1.weight', 'vision_model.encoder.layers.26.self_attn.q_proj.weight', 'vision_model.encoder.layers.18.self_attn.q_proj.weight', 'vision_model.encoder.layers.0.mlp.fc2.weight', 'vision_model.encoder.layers.6.mlp.fc2.weight', 'vision_model.encoder.layers.6.mlp.fc1.weight', 'vision_model.encoder.layers.11.layer_norm1.weight', 'vision_model.encoder.layers.16.mlp.fc1.weight', 'vision_model.encoder.layers.0.self_attn.out_proj.bias', 'vision_model.encoder.layers.5.mlp.fc2.bias', 'vision_model.encoder.layers.2.layer_norm2.bias', 'vision_model.encoder.layers.30.self_attn.v_proj.weight', 'vision_model.encoder.layers.2.self_attn.k_proj.bias', 'vision_model.encoder.layers.1.self_attn.v_proj.weight', 'vision_model.encoder.layers.29.layer_norm1.weight', 'vision_model.encoder.layers.28.layer_norm1.weight', 'vision_model.encoder.layers.2.self_attn.q_proj.weight', 'vision_model.encoder.layers.5.layer_norm2.bias', 'vision_model.encoder.layers.7.mlp.fc1.bias', 'vision_model.encoder.layers.8.self_attn.v_proj.weight', 'vision_model.encoder.layers.5.layer_norm1.bias', 'vision_model.encoder.layers.31.self_attn.k_proj.bias', 'vision_model.encoder.layers.31.self_attn.v_proj.weight', 'vision_model.encoder.layers.25.layer_norm2.bias', 'vision_model.encoder.layers.22.mlp.fc2.weight', 'vision_model.encoder.layers.27.mlp.fc1.bias', 'vision_model.encoder.layers.19.layer_norm1.weight', 'vision_model.encoder.layers.7.mlp.fc2.bias', 'vision_model.encoder.layers.14.layer_norm1.bias', 'vision_model.encoder.layers.4.self_attn.v_proj.bias', 'vision_model.encoder.layers.11.self_attn.out_proj.bias', 'vision_model.encoder.layers.22.mlp.fc2.bias', 'vision_model.encoder.layers.23.layer_norm1.bias', 'vision_model.encoder.layers.28.self_attn.q_proj.bias', 'vision_model.encoder.layers.30.layer_norm1.bias', 'vision_model.encoder.layers.17.self_attn.k_proj.bias', 'vision_model.encoder.layers.29.self_attn.k_proj.weight', 'vision_model.encoder.layers.23.self_attn.v_proj.weight', 'vision_model.encoder.layers.21.self_attn.k_proj.weight', 'vision_model.encoder.layers.25.self_attn.k_proj.bias', 'vision_model.encoder.layers.14.self_attn.out_proj.weight', 'vision_model.encoder.layers.20.self_attn.k_proj.weight', 'vision_model.encoder.layers.7.layer_norm1.weight', 'vision_model.encoder.layers.24.self_attn.k_proj.bias', 'vision_model.encoder.layers.25.self_attn.q_proj.weight', 'vision_model.encoder.layers.11.self_attn.v_proj.weight', 'vision_model.encoder.layers.4.self_attn.v_proj.weight', 'vision_model.encoder.layers.2.self_attn.q_proj.bias', 'vision_model.encoder.layers.20.layer_norm1.bias', 'vision_model.encoder.layers.23.layer_norm1.weight', 'vision_model.encoder.layers.22.self_attn.k_proj.bias', 'vision_model.encoder.layers.4.mlp.fc2.bias', 'vision_model.encoder.layers.10.mlp.fc1.bias', 'vision_model.encoder.layers.6.self_attn.k_proj.weight', 'vision_model.encoder.layers.6.layer_norm2.weight', 'vision_model.encoder.layers.8.layer_norm1.weight', 'vision_model.encoder.layers.15.layer_norm1.bias', 'vision_model.encoder.layers.23.self_attn.q_proj.weight', 'vision_model.encoder.layers.12.mlp.fc1.weight', 'vision_model.encoder.layers.18.layer_norm2.weight', 'vision_model.encoder.layers.22.self_attn.k_proj.weight', 'vision_model.encoder.layers.18.mlp.fc2.bias', 'vision_model.encoder.layers.23.mlp.fc2.bias', 'vision_model.encoder.layers.31.layer_norm1.weight', 'vision_model.encoder.layers.13.mlp.fc2.bias', 'vision_model.encoder.layers.27.self_attn.v_proj.bias', 'vision_model.encoder.layers.2.self_attn.k_proj.weight', 'vision_model.encoder.layers.30.self_attn.k_proj.weight', 'vision_model.encoder.layers.17.mlp.fc2.weight', 'vision_model.encoder.layers.30.self_attn.out_proj.weight', 'vision_model.encoder.layers.14.self_attn.k_proj.bias', 'vision_model.encoder.layers.26.mlp.fc1.bias', 'vision_model.encoder.layers.17.self_attn.q_proj.bias', 'vision_model.encoder.layers.23.self_attn.k_proj.weight', 'vision_model.encoder.layers.10.mlp.fc1.weight', 'vision_model.encoder.layers.29.self_attn.out_proj.bias', 'vision_model.encoder.layers.15.self_attn.out_proj.weight', 'vision_model.encoder.layers.13.self_attn.v_proj.weight', 'vision_model.encoder.layers.25.self_attn.v_proj.bias', 'vision_model.encoder.layers.6.mlp.fc1.bias', 'vision_model.encoder.layers.0.layer_norm2.weight', 'vision_model.encoder.layers.6.layer_norm2.bias', 'vision_model.encoder.layers.3.self_attn.v_proj.weight', 'vision_model.encoder.layers.12.layer_norm1.bias', 'vision_model.encoder.layers.31.mlp.fc2.weight', 'vision_model.encoder.layers.17.self_attn.out_proj.bias', 'vision_model.encoder.layers.4.mlp.fc1.weight', 'vision_model.encoder.layers.15.self_attn.v_proj.bias', 'vision_model.encoder.layers.4.self_attn.out_proj.bias', 'vision_model.encoder.layers.3.mlp.fc2.bias', 'vision_model.encoder.layers.8.mlp.fc1.bias', 'vision_model.encoder.layers.11.self_attn.out_proj.weight', 'vision_model.encoder.layers.28.self_attn.k_proj.bias', 'vision_model.encoder.layers.10.mlp.fc2.weight', 'vision_model.encoder.layers.5.mlp.fc1.weight', 'vision_model.encoder.layers.9.self_attn.v_proj.weight', 'vision_model.encoder.layers.29.layer_norm2.weight', 'vision_model.encoder.layers.13.mlp.fc1.bias', 'vision_model.post_layernorm.bias', 'vision_model.encoder.layers.14.self_attn.out_proj.bias', 'vision_model.encoder.layers.15.layer_norm1.weight', 'vision_model.encoder.layers.25.mlp.fc1.bias', 'vision_model.encoder.layers.18.mlp.fc2.weight', 'vision_model.encoder.layers.19.self_attn.out_proj.weight', 'vision_model.encoder.layers.9.self_attn.k_proj.bias', 'vision_model.encoder.layers.13.layer_norm2.weight', 'vision_model.encoder.layers.17.self_attn.q_proj.weight', 'vision_model.encoder.layers.19.mlp.fc1.bias', 'vision_model.encoder.layers.20.layer_norm1.weight', 'vision_model.encoder.layers.14.self_attn.q_proj.weight', 'vision_model.encoder.layers.30.mlp.fc1.bias', 'vision_model.encoder.layers.15.layer_norm2.weight', 'vision_model.encoder.layers.18.mlp.fc1.bias', 'vision_model.embeddings.class_embedding', 'vision_model.encoder.layers.20.mlp.fc2.weight', 'vision_model.encoder.layers.31.self_attn.out_proj.weight', 'vision_model.encoder.layers.28.mlp.fc2.bias', 'vision_model.encoder.layers.28.self_attn.out_proj.bias', 'vision_model.encoder.layers.14.mlp.fc2.weight', 'vision_model.encoder.layers.29.self_attn.k_proj.bias', 'vision_model.encoder.layers.2.self_attn.v_proj.weight', 'vision_model.encoder.layers.6.self_attn.k_proj.bias', 'vision_model.encoder.layers.10.self_attn.out_proj.weight', 'vision_model.encoder.layers.24.layer_norm2.bias', 'vision_model.encoder.layers.22.self_attn.q_proj.bias', 'vision_model.encoder.layers.3.layer_norm1.weight', 'vision_model.encoder.layers.7.self_attn.k_proj.bias', 'vision_model.encoder.layers.16.mlp.fc2.weight', 'vision_model.encoder.layers.30.self_attn.q_proj.weight', 'vision_model.encoder.layers.16.layer_norm2.weight', 'vision_model.encoder.layers.11.self_attn.k_proj.bias', 'vision_model.encoder.layers.27.self_attn.q_proj.weight', 'vision_model.encoder.layers.25.mlp.fc1.weight', 'vision_model.encoder.layers.26.layer_norm2.bias', 'vision_model.encoder.layers.8.self_attn.k_proj.bias', 'vision_model.encoder.layers.21.self_attn.v_proj.bias', 'vision_model.encoder.layers.9.self_attn.q_proj.weight', 'vision_model.encoder.layers.17.self_attn.out_proj.weight', 'vision_model.encoder.layers.23.layer_norm2.weight', 'vision_model.encoder.layers.17.layer_norm2.weight', 'vision_model.encoder.layers.17.layer_norm1.bias', 'vision_model.encoder.layers.24.mlp.fc2.bias', 'vision_model.encoder.layers.9.self_attn.q_proj.bias', 'vision_model.encoder.layers.12.self_attn.out_proj.bias', 'vision_model.encoder.layers.16.self_attn.v_proj.weight', 'vision_model.encoder.layers.22.self_attn.q_proj.weight', 'vision_model.encoder.layers.27.self_attn.q_proj.bias', 'vision_model.encoder.layers.29.mlp.fc2.bias', 'vision_model.encoder.layers.1.mlp.fc1.bias', 'vision_model.encoder.layers.14.self_attn.k_proj.weight', 'vision_model.encoder.layers.1.layer_norm2.weight', 'vision_model.pre_layrnorm.bias', 'vision_model.encoder.layers.5.self_attn.q_proj.bias', 'vision_model.encoder.layers.25.self_attn.k_proj.weight', 'vision_model.encoder.layers.28.mlp.fc2.weight', 'vision_model.encoder.layers.16.self_attn.out_proj.bias', 'vision_model.encoder.layers.12.mlp.fc1.bias', 'vision_model.encoder.layers.17.self_attn.k_proj.weight', 'vision_model.encoder.layers.20.self_attn.q_proj.weight', 'vision_model.encoder.layers.0.layer_norm1.weight', 'vision_model.encoder.layers.21.self_attn.out_proj.bias', 'vision_model.encoder.layers.21.layer_norm1.bias', 'vision_model.encoder.layers.1.layer_norm1.bias', 'vision_model.encoder.layers.9.self_attn.k_proj.weight', 'vision_model.encoder.layers.0.mlp.fc1.weight', 'vision_model.encoder.layers.18.self_attn.out_proj.weight', 'vision_model.encoder.layers.16.self_attn.q_proj.bias', 'vision_model.encoder.layers.1.self_attn.out_proj.bias', 'vision_model.encoder.layers.4.layer_norm2.weight', 'vision_model.encoder.layers.5.self_attn.k_proj.bias', 'vision_model.encoder.layers.4.layer_norm1.bias', 'vision_model.encoder.layers.9.mlp.fc2.bias', 'vision_model.encoder.layers.6.layer_norm1.weight', 'vision_model.encoder.layers.0.mlp.fc1.bias', 'vision_model.encoder.layers.3.layer_norm2.bias', 'vision_model.encoder.layers.21.mlp.fc1.bias', 'vision_model.encoder.layers.21.self_attn.v_proj.weight', 'vision_model.encoder.layers.23.layer_norm2.bias', 'vision_model.encoder.layers.1.self_attn.v_proj.bias', 'vision_model.encoder.layers.22.layer_norm1.bias', 'vision_model.encoder.layers.28.self_attn.k_proj.weight', 'vision_model.encoder.layers.10.self_attn.k_proj.bias', 'vision_model.encoder.layers.11.self_attn.v_proj.bias', 'vision_model.encoder.layers.4.mlp.fc1.bias', 'vision_model.encoder.layers.11.layer_norm2.bias', 'vision_model.encoder.layers.0.layer_norm2.bias', 'vision_model.post_layernorm.weight', 'vision_model.encoder.layers.13.self_attn.k_proj.weight', 'vision_model.encoder.layers.7.self_attn.k_proj.weight', 'vision_model.encoder.layers.6.self_attn.q_proj.weight', 'vision_model.encoder.layers.7.self_attn.v_proj.bias', 'vision_model.encoder.layers.13.self_attn.q_proj.weight', 'vision_model.encoder.layers.30.mlp.fc1.weight', 'vision_model.encoder.layers.1.self_attn.out_proj.weight', 'vision_model.encoder.layers.5.self_attn.k_proj.weight', 'vision_model.encoder.layers.31.self_attn.q_proj.bias', 'vision_model.encoder.layers.11.self_attn.q_proj.weight', 'vision_model.encoder.layers.19.mlp.fc2.weight', 'vision_model.encoder.layers.4.layer_norm1.weight', 'vision_model.encoder.layers.18.self_attn.v_proj.weight', 'vision_model.encoder.layers.21.self_attn.out_proj.weight', 'vision_model.encoder.layers.22.layer_norm2.bias', 'vision_model.encoder.layers.19.self_attn.q_proj.bias', 'vision_model.encoder.layers.2.mlp.fc2.bias', 'vision_model.encoder.layers.16.self_attn.out_proj.weight', 'vision_model.encoder.layers.29.self_attn.q_proj.bias', 'vision_model.encoder.layers.29.self_attn.out_proj.weight', 'vision_model.encoder.layers.12.self_attn.q_proj.weight', 'vision_model.encoder.layers.16.self_attn.v_proj.bias', 'vision_model.encoder.layers.31.layer_norm2.weight', 'vision_model.encoder.layers.4.self_attn.out_proj.weight', 'vision_model.encoder.layers.13.mlp.fc1.weight', 'vision_model.encoder.layers.13.layer_norm2.bias', 'vision_model.encoder.layers.20.self_attn.q_proj.bias', 'vision_model.encoder.layers.6.mlp.fc2.bias', 'vision_model.encoder.layers.21.self_attn.q_proj.weight', 'vision_model.encoder.layers.8.self_attn.q_proj.weight', 'vision_model.encoder.layers.11.mlp.fc2.bias', 'vision_model.encoder.layers.18.self_attn.out_proj.bias', 'vision_model.encoder.layers.0.layer_norm1.bias', 'vision_model.encoder.layers.25.layer_norm1.weight', 'vision_model.encoder.layers.23.self_attn.out_proj.weight', 'vision_model.encoder.layers.26.self_attn.v_proj.bias', 'vision_model.encoder.layers.5.self_attn.q_proj.weight', 'vision_model.encoder.layers.23.self_attn.q_proj.bias', 'vision_model.encoder.layers.12.layer_norm2.weight', 'vision_model.encoder.layers.4.layer_norm2.bias', 'vision_model.encoder.layers.25.self_attn.out_proj.bias', 'vision_model.encoder.layers.5.self_attn.v_proj.bias', 'vision_model.encoder.layers.5.mlp.fc2.weight', 'vision_model.encoder.layers.22.self_attn.v_proj.weight', 'vision_model.encoder.layers.26.layer_norm2.weight', 'vision_model.encoder.layers.16.layer_norm1.weight', 'vision_model.encoder.layers.24.self_attn.k_proj.weight', 'vision_model.encoder.layers.12.self_attn.v_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Loading checkpoint shards: 0%| | 0/2 [00:11<?, ?it/s]
Traceback (most recent call last):
File "/home/itouchtv/zhao/Vary/Vary-master/vary/demo/run_qwen_vary.py", line 127, in
eval_model(args)
File "/home/itouchtv/zhao/Vary/Vary-master/vary/demo/run_qwen_vary.py", line 43, in eval_model
model = varyQwenForCausalLM.from_pretrained(model_name, low_cpu_mem_usage=True, device_map='cuda', trust_remote_code=True)
File "/home/itouchtv/anaconda3/envs/vary/lib/python3.10/site-packages/transformers/modeling_utils.py", line 3091, in from_pretrained
) = cls._load_pretrained_model(
File "/home/itouchtv/anaconda3/envs/vary/lib/python3.10/site-packages/transformers/modeling_utils.py", line 3471, in _load_pretrained_model
new_error_msgs, offload_index, state_dict_index = _load_state_dict_into_meta_model(
File "/home/itouchtv/anaconda3/envs/vary/lib/python3.10/site-packages/transformers/modeling_utils.py", line 736, in _load_state_dict_into_meta_model
set_module_tensor_to_device(model, param_name, param_device, **set_module_kwargs)
File "/home/itouchtv/anaconda3/envs/vary/lib/python3.10/site-packages/accelerate/utils/modeling.py", line 317, in set_module_tensor_to_device
new_value = value.to(device)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 172.00 MiB. GPU 0 has a total capacty of 7.78 GiB of which 53.19 MiB is free. Process 2063158 has 1.38 GiB memory in use. Process 1027252 has 94.00 MiB memory in use. Process 1027451 has 94.00 MiB memory in use. Process 1027562 has 94.00 MiB memory in use. Including non-PyTorch memory, this process has 6.07 GiB memory in use. Of the allocated memory 5.58 GiB is allocated by PyTorch, and 2.80 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

关于模型的OCR输出结果

我是用了两个prompt,对同一张图片输出推理结果,但是推理的结果是相同的:

1.Convert the image to markdown/latex format.
2.Provide the OCR results of this image.

没有按照论文中例子,分别输出markdown和ocr的结果,是哪里出了问题呢。模型是不是不能输出带bbox的结果。

是否支持多卡推理

代码中有 device_map = "auto" 但是实际上只会用到一张卡,应该如何修改支持多卡加载模型并推理呢?

特殊字符识别

请问对于特殊字符的识别是不是还是要扩充词表微调训练呢?

关于opt的调用

您好 我debug的时候发现vary_opt.py里的model没有调用opt解码生成文本呀?

Error during Package Install

INFO: pip is looking at multiple versions of vary to determine which version is compatible with other requirements. This could take a while.
ERROR: Cannot install vary==0.1.0 because these package versions have conflicting dependencies.

The conflict is caused by:
vary 0.1.0 depends on accelerate==0.24.1
vary 0.1.0 depends on accelerate==0.21.0

有一些关于模型训练和数据的问题请教

  1. 请问论文sec3.3.2中提到的生成数据也是用于预训练吗?
  2. sec3.3.2提到的文档和图表生成数据后续会开源吗?
  3. 模型参数会提交到huggingface或者google driver吗?百度云下载太慢了🤕

本地推理经常出现多次重复字段的问题

image
结果:
英语(RJ) 年级(下册) 年级(上册) 年级(下册) 年级(下册) 年级(上册) 年级(下册) 年级(下册) 年级(上册) 年级(下册) 年级(下册) 年级(上册) 年级(下册) 年级(下册) 年级(上册) 年级(下册) 年级(下册) 年级(上册) 年级(下册) 年级(下册) 年级(上册) 年级(下册) 年级(下册) 年级(上册) 年级(下册) 年级(下册) 年级(上册) 年级(下册) 年级(下册) 年级(上册) 年级(下册) 年级(下册) 年级(上册) 年级(下册) 年级(下册) 年级(上册) 年级(下册) 年级(下册) 年级(上册) 年级(下册) 年级(下册) 年级(上册) 年级(下册) 年级(下册) 年级(上册) 年级(下册) 年级(下册) 年级(上册) 年级(下册) 年级(下册) 年级(上册) 年级(下册) 年级(下册) 年级(上册) 年级(下册) 年级(下册) 年级(上册) 年级(下册) 年级(下册) 年级(上册) 年级(下册) 年级(下册) 年级(上册) 年级(下册) 年级(下册) 年级(上册) 年级(下册) 年级(下册) 年级(上册) 年级(下册) 年级(下册) 年级(上册) 年级(下册) 年级(下册) 年级(上册) 年级(下册) 年级(下册) 年级(上册) 年级(下册) 年级(下册) 年级(上册) 年级(下册) 年级(下册) 年级(上册) 年级(下册) 年级(下册) 年级(上册) 年级(下册) 年级(下册) 年级(上册) 年级(下册) 年级(下册) 年级(上册) 年级(下册) 年级(下册) 年级(上册) 年级(下册) 年级(下册) 年级(上册) 年级(下册) 年级(下册) 年级(上册) 年级(下册) 年级(下册) 年级(上册) 年级(下册) 年级(下册) 年级(上册) 年级(下册) 年级(下册) 年级(上册) 年级(下册) 年级(下册) 年级(上册) 年级(下册) 年级(下册) 年级(上册) 年级(下册) 年级(下册) 年级(上册) 年级(下册) 年级(下册) 年级(上册) 年级(下册) 年级(下册) 年级(上册) 年级(下册) 年级(下册) 年级(上册) 年级(下册) 年级(下册) 年级(上册) 年级(下册) 年级(下册) 年级(上册) 年级(下册) 年级(下册) 年级(下册) 年级(上册) 年级(下册) 年级(下册) 年级(下册) 年级(下册) 年级(下册) 年级(下册) 年级(下册) 年级(下册) 年级(下册) 年级(下册) 年级(下册) 年级(下册) 年级(下册) 年级(下册) 年级(下册) 年级(下册) 年级(下册) 年级(下册) 年级(下册) 年级(下册) 年级(下册) 年级(下册) 年级(下册) 年级(下册) 年级(下册) 年级(下册) 年级(下册) 年级(下册) 年级(下册) 年级(下册) 年级(下册) 年级(下册) 年级(下册) 年级(下册) 年级(下册) 年级(下册) 年级(下册) 年级(下册) 年级(下册) 年级

还有其他:
“我倒想要知道,我们之中谁会走得最远!”最小的一 粒豌豆说。 “是的,事情马上就要揭晓了。”最大的那粒豌豆说。 啦!豆荚裂开来了。那五粒豌豆全都躺在一个孩子的 手中。这个孩子紧紧地捏着它们,说可以当作玩具使用。 八颗豌豆都是绿的, “现在我要飞到广阔的世界里去了!如果你能握住一粒 和豌豆一起坠落, 我就请你一起吧!”第一粒豌豆说完就飞走了。 这才爆开一粒 豆子,豆子飞到了外面,豆子飞到了外面,豆子飞到了外面,豆子飞到了外面,豆子飞到了外面,豆子飞到了外面,豆子飞到了外面,豆子飞到了外面,豆子飞到了外面,豆子飞到了外面,豆子飞到了外面,豆子飞到了外面,豆子飞到了外面,豆子飞到了外面,豆子飞到了外面,豆子飞到了外面,豆子飞到了外面,豆子飞到了外面,豆子飞到了外面,豆子飞到了外面,豆子飞到了外面,豆子飞到了外面,豆子飞到了外面,豆子飞到了外面,豆子飞到了外面,豆子飞到了外面,豆子飞到了外面,豆子飞到了外面,豆子飞到了外面,豆子飞到了外面,豆子飞到了外面,豆子飞到了外面,豆子飞到了外面,豆子飞到了外面,豆子飞到了外面,豆子飞到了外面,豆子飞到了外面,豆子飞到了外面,豆子飞到了外面,豆子飞到了外面,豆子飞到了外面,豆子飞到了外面,豆子飞到了外面,豆子飞到了外面,豆子飞到了外面,豆子飞到了外面,豆子飞到了外面,豆子飞到了外面,豆子飞到了外面,豆子飞到了外面,豆子飞到了外面,豆子飞到了外面,豆子飞到了外面,豆子飞到了外面,豆子飞到了外面,豆子飞到了外面,豆子飞到了外面,豆子飞到了外面,豆子飞到了外面,豆子飞到了外面,豆子飞到了外面,豆子飞到了外面,豆子飞到了

patch_embedding failed with RuntimeError: GET was unable to find an engine to execute this computation

Hi, could you please give some advice for this issue?

python vary/demo/run_qwen_vary.py --model-name /vary/model/path/ --image-file /an/image/file.png

My env:

Linux OS
Driver Version: 525.105.17   CUDA Version: 12.0
A40 GPU

Python env:

conda create -n vary python=3.10 -y
conda activate vary
pip install e .
pip install ninja
pip install flash-attn --no-build-isolation
File "/workspace/anaconda3/envs/vary/lib/python3.10/site-packages/transformers/models/clip/modeling_clip.py", line 196, in forward
    patch_embeds = self.patch_embedding(pixel_values.to(dtype=target_dtype))  # shape = [*, width, grid, grid]
  File "/workspace/anaconda3/envs/vary/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/workspace/anaconda3/envs/vary/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/workspace/anaconda3/envs/vary/lib/python3.10/site-packages/torch/nn/modules/conv.py", line 460, in forward
    return self._conv_forward(input, self.weight, self.bias)
  File "/workspace/anaconda3/envs/vary/lib/python3.10/site-packages/torch/nn/modules/conv.py", line 456, in _conv_forward
    return F.conv2d(input, weight, bias, self.stride,
RuntimeError: GET was unable to find an engine to execute this computation

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.