🐛 Bug mlc_llm reports unexpective error "The model weight size ma

cc <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

[Bug] Unexpected Error: The model weight size may be larger than GPU memory size about mlc-llm HOT 5 CLOSED

ahz-r3v commented on June 9, 2024

[Bug] Unexpected Error: The model weight size may be larger than GPU memory size

from mlc-llm.

Comments (5)

ahz-r3v commented on June 9, 2024

I think these may be helpful:
I added some prints in engine_base.py:

max_total_sequence_length = int(
        (
            int(gpu_size_bytes) * 0.90
            - params_bytes
            - temp_func_bytes
            - kv_aux_workspace_bytes
            - model_workspace_bytes
            - logit_processor_workspace_bytes
        )
        / kv_bytes_per_token
    )
    print("gpu_size_bytes: {}".format(gpu_size_bytes))
    print("params_bytes: {}".format(params_bytes))
    print("temp_func_bytes: {}".format(temp_func_bytes))
    print("kv_aux_workspace_bytes: {}".format(kv_aux_workspace_bytes))
    print("model_workspace_bytes: {}".format(model_workspace_bytes))
    print("logit_processor_workspace_bytes: {}".format(logit_processor_workspace_bytes))
    print("kv_bytes_per_token: {}".format(kv_bytes_per_token))
    assert max_total_sequence_length > 0, (
        "Cannot estimate KV cache capacity. "
        f"The model weight size {params_bytes} may be larger than GPU memory size {gpu_size_bytes}"
    )

and here's outputs:

gpu_size_bytes: 16739180544
params_bytes: 1033572352.0
temp_func_bytes: 21852323840
kv_aux_workspace_bytes: 591666136
model_workspace_bytes: 268894528
logit_processor_workspace_bytes: 195999040.0
kv_bytes_per_token: 196609.25

from mlc-llm.

tqchen commented on June 9, 2024

Thanks for reporting. As a temp measure. Reduce the prefill chunk size might help. We should followup by auto limit this number when we run gen config

from mlc-llm.

Hzfengsy commented on June 9, 2024

cc @MasterJH5574 to see if we can enhance the error message

from mlc-llm.

ahz-r3v commented on June 9, 2024

Thanks for reporting. As a temp measure. Reduce the prefill chunk size might help. We should followup by auto limit this number when we run gen config

Thanks for the solution. But after reducing prefill chunk size to 4096, mlc_llm serve still gets the same error running Qwen1.5-1.8B-Chat-q4f16_1-MLC:

(base) root@orangepi5pro:/home/Allin/mlc-llm-new# mlc_llm serve ./dist/Qwen1.5-1.8B-Chat-q4f16_1-MLC/ --device opencl --model-lib-path /home/Allin/mlc-llm-new/dist/Qwen1.5-1.8B-Chat-q4f16_1-MLC/Qwen1.5-1.8B-Chat-q4f16_1-opencl.so 
[2024-04-28 23:11:54] INFO auto_device.py:76: Found device: opencl:0
[2024-04-28 23:11:54] INFO chat_module.py:379: Using model folder: /home/Allin/mlc-llm-new/dist/Qwen1.5-1.8B-Chat-q4f16_1-MLC
[2024-04-28 23:11:54] INFO chat_module.py:380: Using mlc chat config: /home/Allin/mlc-llm-new/dist/Qwen1.5-1.8B-Chat-q4f16_1-MLC/mlc-chat-config.json
[2024-04-28 23:11:54] INFO chat_module.py:529: Using library model: /home/Allin/mlc-llm-new/dist/Qwen1.5-1.8B-Chat-q4f16_1-MLC/Qwen1.5-1.8B-Chat-q4f16_1-opencl.so
arm_release_ver of this libmali is 'g6p0-01eac0', rk_so_ver is '7'.
gpu_size_bytes: 16739180544
params_bytes: 1033572352.0
temp_func_bytes: 21852323840
kv_aux_workspace_bytes: 118004696
model_workspace_bytes: 33898816
logit_processor_workspace_bytes: 195999040.0
kv_bytes_per_token: 196609.25
Traceback (most recent call last):
  File "/root/miniforge3/bin/mlc_llm", line 33, in <module>
    sys.exit(load_entry_point('mlc-llm', 'console_scripts', 'mlc_llm')())
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/Allin/mlc-llm-new/python/mlc_llm/__main__.py", line 41, in main
    cli.main(sys.argv[2:])
  File "/home/Allin/mlc-llm-new/python/mlc_llm/cli/serve.py", line 75, in main
    serve(
  File "/home/Allin/mlc-llm-new/python/mlc_llm/interface/serve.py", line 43, in serve
    async_engine = engine.AsyncEngine(model_info, kv_cache_config, enable_tracing=enable_tracing)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/Allin/mlc-llm-new/python/mlc_llm/serve/engine.py", line 780, in __init__
    super().__init__("async", models, kv_cache_config, engine_mode, enable_tracing)
  File "/home/Allin/mlc-llm-new/python/mlc_llm/serve/engine_base.py", line 565, in __init__
    kv_cache_config.max_total_sequence_length = _estimate_max_total_sequence_length(
                                                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/Allin/mlc-llm-new/python/mlc_llm/serve/engine_base.py", line 230, in _estimate_max_total_sequence_length
    assert max_total_sequence_length > 0, (
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
AssertionError: Cannot estimate KV cache capacity. The model weight size 1033572352.0 may be larger than GPU memory size 16739180544

Here's my mlc-chat-config.json after changing prefill_chunk_size and context_window_size to 4096:

{
  "model_type": "qwen2",
  "quantization": "q4f16_1",
  "model_config": {
    "hidden_act": "silu",
    "hidden_size": 2048,
    "intermediate_size": 5504,
    "num_attention_heads": 16,
    "num_hidden_layers": 24,
    "num_key_value_heads": 16,
    "rms_norm_eps": 1e-06,
    "rope_theta": 1000000.0,
    "vocab_size": 151936,
    "context_window_size": 4096,
    "prefill_chunk_size": 4096,
    "tensor_parallel_shards": 1,
    "head_dim": 128,
    "dtype": "float32"
  },
  "vocab_size": 151936,
  "context_window_size": 4096,
  "sliding_window_size": -1,
  "prefill_chunk_size": 4096,
  "attention_sink_size": -1,
  "tensor_parallel_shards": 1,
  "mean_gen_len": 128,
  "max_gen_len": 512,
  "shift_fill_factor": 0.3,
  "temperature": 0.7,
  "presence_penalty": 0.0,
  "freq_uency_penalty": 0.0,
  "repetition_penalty": 1.1,
  "top_p": 0.8,
  "conv_template": {
    "name": "mistral_default",
    "system_template": "[INST] {system_message}",
    "system_message": "Always assist with care, respect, and truth. Respond with utmost utility yet securely. Avoid harmful, unethical, prejudiced, or negative content. Ensure replies promote fairness and positivity.",
    "system_prefix_token_ids": [
      1
    ],
    "add_role_after_system_message": true,
    "roles": {
      "user": "<|im_start|>user",
      "assistant": "<|im_start|>assistant"
    },
    "role_templates": {
      "user": "{user_message}",
      "assistant": "{assistant_message}",
      "tool": "{tool_message}"
    },
    "messages": [],
    "seps": [
      "<|im_end|>\n"
    ],
    "role_content_sep": "\n",
    "role_empty_sep": "\n",
    "stop_str": [
      "<|im_end|>"
    ],
    "stop_token_ids": [
      151643
    ],
    "function_string": "",
    "use_function_calling": false
  },
  "pad_token_id": 151643,
  "bos_token_id": 151643,
  "eos_token_id": [
    151645,
    151643
  ],
  "tokenizer_files": [
    "tokenizer.json",
    "vocab.json",
    "merges.txt",
    "tokenizer_config.json",
    "added_tokens.json"
  ],
  "version": "0.1.0"
}

And I also tried Qwen-7B-Chat-q4f16_1-MLC(https://huggingface.co/mlc-ai/Qwen-7B-Chat-q4f16_1-MLC), this model works fine both on mlc_llm chat and mlc_llm serve.

from mlc-llm.

tqchen commented on June 9, 2024

@ahz-r3v you might need to cross check if you have recompiled the lib

from mlc-llm.

[Bug] Unexpected Error: The model weight size may be larger than GPU memory size about mlc-llm HOT 5 CLOSED

Comments (5)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent