24G GPU 炸显存了 about qwen HOT 19 CLOSED

qwenlm commented on May 21, 2024

24G GPU 炸显存了

from qwen.

Comments (19)

JustinLin610 commented on May 21, 2024

再尝试一遍？刚更新了代码

from qwen.

Louis-y-nlp commented on May 21, 2024

用新代码显存占用： 17150MiB / 32510MiB

from qwen.

hutianyu2006 commented on May 21, 2024

关键是官方也没给量化模型。。。我单独开了个#18，希望官方有看到。。。

from qwen.

logicwong commented on May 21, 2024

用你们的DEMO，结果跑不起来，炸显存了，难道只能用量化的吗？ torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 2.32 GiB (GPU 0; 23.65 GiB total capacity; 20.85 GiB already allocated; 1.26 GiB free; 20.86 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

您好，可能是默认使用了fp32精度导致OOM？可以试试拉取我们的最新代码，然后使用fp16精度来加载模型？方法如下：

model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-7B", device_map="auto", trust_remote_code=True, fp16=True).eval()

from qwen.

JustinLin610 commented on May 21, 2024

关键是官方也没给量化模型。。。我单独开了个#18，希望官方有看到。。。

刚才王鹏提了精度问题，打开fp16是一种。量化部分在README有说，看量化章节，只需要加入quantization_config就行

from qwen.

JohnZhuYX commented on May 21, 2024

模型重新下载了一边，还是不行啊，难道只能用量化的？。。。。。
Traceback (most recent call last):
File "/home/johnzyx/working/pythonprojects/LLaMA-Efficient-Tuning/src/zyx_QwenDemo.py", line 14, in
response, history = model.chat(tokenizer, "你好", history=None)
File "/home/johnzyx/.cache/huggingface/modules/transformers_modules/Qwen-7B-Chat/modeling_qwen.py", line 905, in chat
outputs = self.generate(
File "/home/johnzyx/.cache/huggingface/modules/transformers_modules/Qwen-7B-Chat/modeling_qwen.py", line 951, in generate
return super().generate(
File "/home/johnzyx/environment/anaconda-env/python3.10/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/home/johnzyx/environment/anaconda-env/python3.10/lib/python3.10/site-packages/transformers/generation/utils.py", line 1615, in generate
return self.sample(
File "/home/johnzyx/environment/anaconda-env/python3.10/lib/python3.10/site-packages/transformers/generation/utils.py", line 2737, in sample
outputs = self(
File "/home/johnzyx/environment/anaconda-env/python3.10/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/home/johnzyx/environment/anaconda-env/python3.10/lib/python3.10/site-packages/accelerate/hooks.py", line 165, in new_forward
output = old_forward(*args, **kwargs)
File "/home/johnzyx/.cache/huggingface/modules/transformers_modules/Qwen-7B-Chat/modeling_qwen.py", line 842, in forward
lm_logits = self.lm_head(hidden_states)
File "/home/johnzyx/environment/anaconda-env/python3.10/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/home/johnzyx/environment/anaconda-env/python3.10/lib/python3.10/site-packages/accelerate/hooks.py", line 160, in new_forward
args, kwargs = module._hf_hook.pre_forward(module, *args, **kwargs)
File "/home/johnzyx/environment/anaconda-env/python3.10/lib/python3.10/site-packages/accelerate/hooks.py", line 286, in pre_forward
set_module_tensor_to_device(
File "/home/johnzyx/environment/anaconda-env/python3.10/lib/python3.10/site-packages/accelerate/utils/modeling.py", line 298, in set_module_tensor_to_device
new_value = value.to(device)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 2.32 GiB (GPU 0; 23.65 GiB total capacity; 20.83 GiB already allocated; 1.18 GiB free; 20.85 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

from qwen.

Louis-y-nlp commented on May 21, 2024

初始化模型的时候加上这个试试：torch_dtype=torch.float16

from qwen.

JohnZhuYX commented on May 21, 2024

如果加参数fp=16,像上面的
model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-7B", device_map="auto", trust_remote_code=True, fp16=True).eval()
也会报错：
Warning: import flash_attn fail, please install FlashAttention https://github.com/Dao-AILab/flash-attention
Traceback (most recent call last):
File "/home/johnzyx/working/pythonprojects/LLaMA-Efficient-Tuning/src/zyx_QwenDemo.py", line 14, in
response, history = model.chat(tokenizer, "你好", history=None)
File "/home/johnzyx/.cache/huggingface/modules/transformers_modules/Qwen-7B-Chat/modeling_qwen.py", line 905, in chat
outputs = self.generate(
File "/home/johnzyx/.cache/huggingface/modules/transformers_modules/Qwen-7B-Chat/modeling_qwen.py", line 951, in generate
return super().generate(
File "/home/johnzyx/environment/anaconda-env/python3.10/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, kwargs)
File "/home/johnzyx/environment/anaconda-env/python3.10/lib/python3.10/site-packages/transformers/generation/utils.py", line 1615, in generate
return self.sample(
File "/home/johnzyx/environment/anaconda-env/python3.10/lib/python3.10/site-packages/transformers/generation/utils.py", line 2750, in sample
next_token_scores = logits_processor(input_ids, next_token_logits)
File "/home/johnzyx/environment/anaconda-env/python3.10/lib/python3.10/site-packages/transformers/generation/logits_process.py", line 97, in call
scores = processor(input_ids, scores)
File "/home/johnzyx/.cache/huggingface/modules/transformers_modules/Qwen-7B-Chat/qwen_generation_utils.py", line 349, in call
scores[i, self.eos_token_id] = float(230)
RuntimeError: value cannot be converted to type at::Half without overflow

from qwen.

Louis-y-nlp commented on May 21, 2024

你是不是改config.json文件了？我修改config.json可以复现你这个报错。去hf上下一遍新的代码吧。

from qwen.

Louis-y-nlp commented on May 21, 2024

17250MiB / 32510MiB

from qwen.

jackaihfia2334 commented on May 21, 2024

同样报错，已经下载了huggingface上最新的config.json
仍然报错 RuntimeError: value cannot be converted to type at::Half without overflow

from qwen.

trexliu commented on May 21, 2024

同样是楼上的错误，也是24G显存，各种方法都用了。不是OOM就是 overflow

from qwen.

JohnZhuYX commented on May 21, 2024

我的config.json
你们看一下
{
"activation": "swiglu",
"apply_residual_connection_post_layernorm": false,
"architectures": [
"QWenLMHeadModel"
],
"auto_map": {
"AutoConfig": "configuration_qwen.QWenConfig",
"AutoModelForCausalLM": "modeling_qwen.QWenLMHeadModel"
},
"attn_pdrop": 0.0,
"bf16": false,
"bias_dropout_fusion": true,
"bos_token_id": 151643,
"embd_pdrop": 0.1,
"eos_token_id": 151643,
"ffn_hidden_size": 22016,
"fp16": false,
"initializer_range": 0.02,
"kv_channels": 128,
"layer_norm_epsilon": 1e-05,
"model_type": "qwen",
"n_embd": 4096,
"n_head": 32,
"n_layer": 32,
"n_positions": 6144,
"no_bias": true,
"onnx_safe": null,
"padded_vocab_size": 151936,
"params_dtype": "torch.bfloat16",
"pos_emb": "rotary",
"resid_pdrop": 0.1,
"rotary_emb_base": 10000,
"rotary_pct": 1.0,
"scale_attn_weights": true,
"seq_length": 2048,
"tie_word_embeddings": false,
"tokenizer_type": "QWenTokenizer",
"transformers_version": "4.31.0",
"use_cache": true,
"use_flash_attn": true,
"vocab_size": 151936,
"use_dynamic_ntk": false,
"use_logn_attn": false
}

from qwen.

Louis-y-nlp commented on May 21, 2024

你把 bf16 改成true可能就能跑了，我刚测试了一下指定torch_dtype=torch.float16没用，加载的参数还是bf16的。但是奇怪的是v100是不支持bf16的，不知道我这里怎么跑起来的。

from qwen.

jackaihfia2334 commented on May 21, 2024

你是不是改config.json文件了？我修改config.json可以复现你这个报错。去hf上下一遍新的代码吧。

能否分享一下您正确的config.json

from qwen.

Louis-y-nlp commented on May 21, 2024

他这个模型好像只能在bf16下跑，所以要么在config里把fp16设置成false，bf16设置成true，初始化的时候什么都不加，要么两个都设置成false，初始化时加上 torch_dtype=torch.bfloat16 ，我试了一下两种方法都能跑，显存占用都小于20G。

{
  "activation": "swiglu",
  "apply_residual_connection_post_layernorm": false,
  "architectures": [
    "QWenLMHeadModel"
  ],  
  "auto_map": {
    "AutoConfig": "configuration_qwen.QWenConfig",
    "AutoModelForCausalLM": "modeling_qwen.QWenLMHeadModel"
  },  
  "attn_pdrop": 0.0,
  "bf16": true,
  "bias_dropout_fusion": true,
  "bos_token_id": 151643,
  "embd_pdrop": 0.1,
  "eos_token_id": 151643,
  "ffn_hidden_size": 22016,
  "fp16": false,
  "initializer_range": 0.02,
  "kv_channels": 128,
  "layer_norm_epsilon": 1e-05,
  "model_type": "qwen",
  "n_embd": 4096,
  "n_head": 32, 
  "n_layer": 32, 
  "n_positions": 6144,
  "no_bias": true,
  "onnx_safe": null,
  "padded_vocab_size": 151936,
  "params_dtype": "torch.bfloat16",
  "pos_emb": "rotary",
  "resid_pdrop": 0.1,
  "rotary_emb_base": 10000,
  "rotary_pct": 1.0,
  "scale_attn_weights": true,
  "seq_length": 2048,
  "tie_word_embeddings": false,
  "tokenizer_type": "QWenTokenizer",
  "transformers_version": "4.31.0",
  "use_cache": true,
  "use_flash_attn": true,
  "vocab_size": 151936,
  "use_dynamic_ntk": false,
  "use_logn_attn": false
}

用这个应该就能跑

from qwen.

sevenold commented on May 21, 2024

拉取最新的仓库

显卡：4090 24G
use fp32: OOM
use fp16:
'''
scores[i, self.eos_token_id] = float(2**30)
RuntimeError: value cannot be converted to type at::Half without overflow

'''
use bf16: 正常没问题 17031MiB / 23.99GiB

from qwen.

JohnZhuYX commented on May 21, 2024

确实只能用bf16=True或者量化的，fp32是不行了

from qwen.

logicwong commented on May 21, 2024

如果加参数fp=16,像上面的 model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-7B", device_map="auto", trust_remote_code=True, fp16=True).eval() 也会报错： Warning: import flash_attn fail, please install FlashAttention https://github.com/Dao-AILab/flash-attention Traceback (most recent call last): File "/home/johnzyx/working/pythonprojects/LLaMA-Efficient-Tuning/src/zyx_QwenDemo.py", line 14, in response, history = model.chat(tokenizer, "你好", history=None) File "/home/johnzyx/.cache/huggingface/modules/transformers_modules/Qwen-7B-Chat/modeling_qwen.py", line 905, in chat outputs = self.generate( File "/home/johnzyx/.cache/huggingface/modules/transformers_modules/Qwen-7B-Chat/modeling_qwen.py", line 951, in generate return super().generate( File "/home/johnzyx/environment/anaconda-env/python3.10/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context return func(*args, kwargs) File "/home/johnzyx/environment/anaconda-env/python3.10/lib/python3.10/site-packages/transformers/generation/utils.py", line 1615, in generate return self.sample( File "/home/johnzyx/environment/anaconda-env/python3.10/lib/python3.10/site-packages/transformers/generation/utils.py", line 2750, in sample next_token_scores = logits_processor(input_ids, next_token_logits) File "/home/johnzyx/environment/anaconda-env/python3.10/lib/python3.10/site-packages/transformers/generation/logits_process.py", line 97, in call scores = processor(input_ids, scores) File "/home/johnzyx/.cache/huggingface/modules/transformers_modules/Qwen-7B-Chat/qwen_generation_utils.py", line 349, in call scores[i, self.eos_token_id] = float(230) RuntimeError: value cannot be converted to type at::Half without overflow

感谢各位同学的反馈，这个bug是因为float(2**30)超过了fp16的范围，最新代码修复了这个bug。可以再尝试下

from qwen.

24G GPU 炸显存了 about qwen HOT 19 CLOSED

Comments (19)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent