Comments (11)
We find out the root cause is chatglm uses skip_init to initialize the model, but the device was default to cpu. The following steps to quantize the model regard the model as a weights initialized model and allocate buffers to linear layers. So finally, there are two set of linear weights for chatglm which contributes to the observation.
The reason why it's normal on linux is that linux platform seems to automatically release memory while windows won't do the same.
from bigdl.
Hi, this is the result on our machine (windows11 + arc770):
- Original cpu memory occupied before running bigdl: 10G
- When loading the chatglm3-6b int4 model from disk to cpu: peak memory 18G
- Putting the loaded int4 model to xpu: back to 10G
cpu内存会突然升到18GB,然后降回到11GB, want to ask:
- Is there any memory already occupied by other applications beforehand?
- Is the memory back to 11G observed when loading to CPU or putting the loaded model to GPU?
from bigdl.
The memory increase when loading the model is not reasonable, we are looking into this.
from bigdl.
The memory increase when loading the model is not reasonable, we are looking into this.
Thanks! The problem that CPU memory increase too much caused on chatglm3-6B on Arc 750 windows 10. But CPU memory increase is OK for baichuan2-7B. Therefore, it due to chatglm3-6B model
from bigdl.
The memory increase when loading the model is not reasonable, we are looking into this.
Thanks! The problem that CPU memory increase too much caused on chatglm3-6B on Arc 750 windows 10. But CPU memory increase is OK for baichuan2-7B. Therefore, it due to chatglm3-6B model
Sure, we also observe llama is OK but chatglm3 is abnormal (seems only on Windows, Ubuntu is fine). We are looking into this. Will find out the reason as soon as possible.
from bigdl.
@KiwiHana please verify if latest build fixes your issue. I suppose it would be 20240123 version.
from bigdl.
@KiwiHana please verify if latest build fixes your issue. I suppose it would be 20240123 version.
OK, I will check the day after tomorrow when the device A750 windows come back.
from bigdl.
Hi @zhentaocc Zhentao, it can't solve the problem by 2.5.0b20230123. It also exists in MTL iGPU.
bigdl-core-xe-21 2.5.0b20240123
bigdl-llm 2.5.0b20240123
intel-extension-for-pytorch 2.1.10+git8ff85d6 with oneapi 2024.0
torch 2.1.0a0+cxx11.abi
torchvision 0.16.0a0+cxx11.abi
finish to xpu
stream_chat-----input_ids: {'input_ids': tensor([[64790, 64792, 4155, 2488, 260, 622, 30932, 627, 13519, 260,
1332, 2689, 554, 7364, 289, 431, 15672, 30930, 1165, 2456,
289, 490, 289, 3727, 293, 1630, 623, 705, 30932, 293,
431, 817]], device='xpu:0'), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1]], device='xpu:0'), 'position_ids': tensor([[ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17,
18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31]],
device='xpu:0')}
Exception in thread Thread-1 (generate):
Traceback (most recent call last):
File "C:\Users\Administrator\AppData\Local\Programs\AIGC Assistant\resources\audiollm\llmsd_env_asr\lib\threading.py", line 1016, in _bootstrap_inner
self.run()
File "C:\Users\Administrator\AppData\Local\Programs\AIGC Assistant\resources\audiollm\llmsd_env_asr\lib\threading.py", line 953, in run
self._target(*self._args, **self._kwargs)
File "C:\Users\Administrator\AppData\Local\Programs\AIGC Assistant\resources\audiollm\llmsd_env_asr\lib\site-packages\torch\utils\_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "C:\Users\Administrator\AppData\Local\Programs\AIGC Assistant\resources\audiollm\llmsd_env_asr\lib\site-packages\transformers\generation\utils.py", line 1335, in generate
and torch.sum(inputs_tensor[:, -1] == generation_config.pad_token_id) > 0
RuntimeError: Allocation is out of device memory on current platform.
from bigdl.
I have tested with 20240118 version and loading method like:
model = AutoModel.load_low_bit(load_path, optimize_model=True,device="meta",
trust_remote_code=True, use_cache=True, cpu_embedding=cpu_embedding)
which works. But 20240122 version is not working. Seems other changes cause this issue.
from bigdl.
0118 is confirmed to be good with previous fix.
Now there is new memory issue from 0122.
memory status is normal before loading weights to meta model:
but it increases a lot when it's finished:
from bigdl.
This issue will be resolved in today's new wheel.
from bigdl.
Related Issues (20)
- Running 2 x A770 with Ollama, inference responses slow down dramatically HOT 23
- wav2lip issue - Intel ARC on Ubuntu HOT 2
- GPU hang when switch between Llama2 and Llama3 on ARC770 HOT 1
- How to use BigDL to achieve early stop function HOT 1
- RuntimeError: "fused_dropout" not implemented for 'Byte' when running trl ppo finetuning HOT 3
- Can provide llama.dll build guide? HOT 2
- Issue with saving and loading low bit BLIP-2 model HOT 1
- Improve First Token Latency for multi-GPU projects (by flash attention or alternative)
- Feature request: Support fp16 with self-speculative decoding on XPU in ipex_llm.serving.fastchat.ipex_llm_worker HOT 5
- phi-3-mini support HOT 1
- IndexError: list index out of range when ipex_fp16_gpu test_api is used in all-in-one HOT 2
- Fastchat serving embeddings? HOT 4
- unable to run inference in linux environment HOT 9
- Performance drop for neural-chat 7b with new repo of ipex-llm(2.5.0b20240425) vllm serving. HOT 22
- 2nd latency of llama3-8B-Instruct with int4 & all-in-one tool issue HOT 1
- Unable to invoke the torch installed via the setup tutorial. HOT 2
- can not find gpu with linux system HOT 4
- MTL 165H ubuntu22.04 can't benchmark qwen/Qwen-7B-Chat HOT 1
- Docker image (intelanalytics/ipex-llm-xpu): Documentation stated I would need to disable iGPU to use A770. When will you fix this issue since disabling iGPU is problematic? HOT 6
- IPEX-LLM on Intel Max Series 1100 for inference libintel-ext-pt-gpu.so: undefined symbol: _ZNK5torch8autograd4Node4nameB5cxx11Ev HOT 7
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from bigdl.