Hi, below screenshot captured the error encountered while running this sample, <a href

This error is caused by out of GPU memory. See our faq <a href="https://bigdl.readthed

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

Failed to run Llama 2 inference on Flex 140 about bigdl HOT 4 CLOSED

HLneoh commented on September 26, 2024

Failed to run Llama 2 inference on Flex 140

from bigdl.

Comments (4)

qiuxin2012 commented on September 26, 2024

This error is caused by out of GPU memory. See our faq https://bigdl.readthedocs.io/en/latest/doc/LLM/Overview/FAQ/faq.html#native-api-failed-native-api-returns-5-pi-error-out-of-resources-5-pi-error-out-of-resources
Could you check your Memory Physical Size(output of sudo xpu-smi discovery -d 0)? See https://dgpu-docs.intel.com/driver/installation.html#xpu-smi-device-information-and-telemetry. We just find one of customer's Flex 140 only has 5GB memory, but it should be 12GB.

from bigdl.

HLneoh commented on September 26, 2024

Hi @qiuxin2012 & @rnwang04, from what I understand, the flex 140 has two GPUs per card, with a memory capacity of 12GB (6GB per GPU). See attached for my output of sudo xpu-smi discovery -d 0. Currently, the inference is only running on GPU device 0.

from bigdl.

jason-dai commented on September 26, 2024

In this case, we many need to run the model inference on two cards using TP or PP.

from bigdl.

rnwang04 commented on September 26, 2024

As flex 140's memory is so limited, you may add cpu_embedding=True in https://github.com/intel-analytics/BigDL/blob/main/python/llm/example/GPU/HF-Transformers-AutoModels/Model/llama2/generate.py to help save memory of Embedding layer and see whether it works :

    model = AutoModelForCausalLM.from_pretrained(model_path,
                                                 load_in_4bit=True,
                                                 optimize_model=True,
                                                 trust_remote_code=True,
                                                 use_cache=True,
                                                 cpu_embedding=True)