用A750推理的时候，low bit加载int4模型，cpu内存会突然升到18GB，然后降回到11GB。系统是windows10

We find out the root cause is chatglm uses <a href="https://github.com/pytorch/pytorch

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

<a class="user-mention notranslate" data-hovercard-type="user" data-hover

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

I have tested with 20240118 version and loading method like: <div class="highlight

Memory will up to 18GB when load low bit INT4 chatglm3-6B about bigdl HOT 11 CLOSED

KiwiHana commented on May 26, 2024

Memory will up to 18GB when load low bit INT4 chatglm3-6B

from bigdl.

Comments (11)

zhentaocc commented on May 26, 2024 2

We find out the root cause is chatglm uses skip_init to initialize the model, but the device was default to cpu. The following steps to quantize the model regard the model as a weights initialized model and allocate buffers to linear layers. So finally, there are two set of linear weights for chatglm which contributes to the observation.

The reason why it's normal on linux is that linux platform seems to automatically release memory while windows won't do the same.

from bigdl.

hkvision commented on May 26, 2024

Hi, this is the result on our machine (windows11 + arc770):

Original cpu memory occupied before running bigdl: 10G
When loading the chatglm3-6b int4 model from disk to cpu: peak memory 18G
Putting the loaded int4 model to xpu: back to 10G

cpu内存会突然升到18GB，然后降回到11GB, want to ask:

Is there any memory already occupied by other applications beforehand?
Is the memory back to 11G observed when loading to CPU or putting the loaded model to GPU?

from bigdl.

hkvision commented on May 26, 2024

The memory increase when loading the model is not reasonable, we are looking into this.

from bigdl.

KiwiHana commented on May 26, 2024

The memory increase when loading the model is not reasonable, we are looking into this.

Thanks! The problem that CPU memory increase too much caused on chatglm3-6B on Arc 750 windows 10. But CPU memory increase is OK for baichuan2-7B. Therefore, it due to chatglm3-6B model

from bigdl.

hkvision commented on May 26, 2024

The memory increase when loading the model is not reasonable, we are looking into this.

Thanks! The problem that CPU memory increase too much caused on chatglm3-6B on Arc 750 windows 10. But CPU memory increase is OK for baichuan2-7B. Therefore, it due to chatglm3-6B model

Sure, we also observe llama is OK but chatglm3 is abnormal (seems only on Windows, Ubuntu is fine). We are looking into this. Will find out the reason as soon as possible.

from bigdl.

zhentaocc commented on May 26, 2024

@KiwiHana please verify if latest build fixes your issue. I suppose it would be 20240123 version.

from bigdl.

KiwiHana commented on May 26, 2024

@KiwiHana please verify if latest build fixes your issue. I suppose it would be 20240123 version.

OK, I will check the day after tomorrow when the device A750 windows come back.

from bigdl.

KiwiHana commented on May 26, 2024

Hi @zhentaocc Zhentao, it can't solve the problem by 2.5.0b20230123. It also exists in MTL iGPU.

bigdl-core-xe-21 2.5.0b20240123
bigdl-llm 2.5.0b20240123
intel-extension-for-pytorch 2.1.10+git8ff85d6 with oneapi 2024.0
torch 2.1.0a0+cxx11.abi
torchvision 0.16.0a0+cxx11.abi

finish to xpu
stream_chat-----input_ids: {'input_ids': tensor([[64790, 64792,  4155,  2488,   260,   622, 30932,   627, 13519,   260,
          1332,  2689,   554,  7364,   289,   431, 15672, 30930,  1165,  2456,
           289,   490,   289,  3727,   293,  1630,   623,   705, 30932,   293,
           431,   817]], device='xpu:0'), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1, 1, 1, 1]], device='xpu:0'), 'position_ids': tensor([[ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16, 17,
         18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31]],
       device='xpu:0')}
Exception in thread Thread-1 (generate):
Traceback (most recent call last):
  File "C:\Users\Administrator\AppData\Local\Programs\AIGC Assistant\resources\audiollm\llmsd_env_asr\lib\threading.py", line 1016, in _bootstrap_inner
    self.run()
  File "C:\Users\Administrator\AppData\Local\Programs\AIGC Assistant\resources\audiollm\llmsd_env_asr\lib\threading.py", line 953, in run
    self._target(*self._args, **self._kwargs)
  File "C:\Users\Administrator\AppData\Local\Programs\AIGC Assistant\resources\audiollm\llmsd_env_asr\lib\site-packages\torch\utils\_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "C:\Users\Administrator\AppData\Local\Programs\AIGC Assistant\resources\audiollm\llmsd_env_asr\lib\site-packages\transformers\generation\utils.py", line 1335, in generate
    and torch.sum(inputs_tensor[:, -1] == generation_config.pad_token_id) > 0
RuntimeError: Allocation is out of device memory on current platform.

from bigdl.

zhentaocc commented on May 26, 2024

I have tested with 20240118 version and loading method like:

model = AutoModel.load_low_bit(load_path, optimize_model=True,device="meta",
                                          trust_remote_code=True, use_cache=True, cpu_embedding=cpu_embedding)

which works. But 20240122 version is not working. Seems other changes cause this issue.

from bigdl.

zhentaocc commented on May 26, 2024

0118 is confirmed to be good with previous fix.
Now there is new memory issue from 0122.
memory status is normal before loading weights to meta model:

but it increases a lot when it's finished:

from bigdl.

zhentaocc commented on May 26, 2024

This issue will be resolved in today's new wheel.

from bigdl.

Memory will up to 18GB when load low bit INT4 chatglm3-6B about bigdl HOT 11 CLOSED

Comments (11)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent