Giter VIP home page Giter VIP logo

Comments (5)

ahz-r3v avatar ahz-r3v commented on June 9, 2024

I think these may be helpful:
I added some prints in engine_base.py:

max_total_sequence_length = int(
        (
            int(gpu_size_bytes) * 0.90
            - params_bytes
            - temp_func_bytes
            - kv_aux_workspace_bytes
            - model_workspace_bytes
            - logit_processor_workspace_bytes
        )
        / kv_bytes_per_token
    )
    print("gpu_size_bytes: {}".format(gpu_size_bytes))
    print("params_bytes: {}".format(params_bytes))
    print("temp_func_bytes: {}".format(temp_func_bytes))
    print("kv_aux_workspace_bytes: {}".format(kv_aux_workspace_bytes))
    print("model_workspace_bytes: {}".format(model_workspace_bytes))
    print("logit_processor_workspace_bytes: {}".format(logit_processor_workspace_bytes))
    print("kv_bytes_per_token: {}".format(kv_bytes_per_token))
    assert max_total_sequence_length > 0, (
        "Cannot estimate KV cache capacity. "
        f"The model weight size {params_bytes} may be larger than GPU memory size {gpu_size_bytes}"
    )

and here's outputs:

gpu_size_bytes: 16739180544
params_bytes: 1033572352.0
temp_func_bytes: 21852323840
kv_aux_workspace_bytes: 591666136
model_workspace_bytes: 268894528
logit_processor_workspace_bytes: 195999040.0
kv_bytes_per_token: 196609.25

from mlc-llm.

tqchen avatar tqchen commented on June 9, 2024

Thanks for reporting. As a temp measure. Reduce the prefill chunk size might help. We should followup by auto limit this number when we run gen config

from mlc-llm.

Hzfengsy avatar Hzfengsy commented on June 9, 2024

cc @MasterJH5574 to see if we can enhance the error message

from mlc-llm.

ahz-r3v avatar ahz-r3v commented on June 9, 2024

Thanks for reporting. As a temp measure. Reduce the prefill chunk size might help. We should followup by auto limit this number when we run gen config

Thanks for the solution. But after reducing prefill chunk size to 4096, mlc_llm serve still gets the same error running Qwen1.5-1.8B-Chat-q4f16_1-MLC:

(base) root@orangepi5pro:/home/Allin/mlc-llm-new# mlc_llm serve ./dist/Qwen1.5-1.8B-Chat-q4f16_1-MLC/ --device opencl --model-lib-path /home/Allin/mlc-llm-new/dist/Qwen1.5-1.8B-Chat-q4f16_1-MLC/Qwen1.5-1.8B-Chat-q4f16_1-opencl.so 
[2024-04-28 23:11:54] INFO auto_device.py:76: Found device: opencl:0
[2024-04-28 23:11:54] INFO chat_module.py:379: Using model folder: /home/Allin/mlc-llm-new/dist/Qwen1.5-1.8B-Chat-q4f16_1-MLC
[2024-04-28 23:11:54] INFO chat_module.py:380: Using mlc chat config: /home/Allin/mlc-llm-new/dist/Qwen1.5-1.8B-Chat-q4f16_1-MLC/mlc-chat-config.json
[2024-04-28 23:11:54] INFO chat_module.py:529: Using library model: /home/Allin/mlc-llm-new/dist/Qwen1.5-1.8B-Chat-q4f16_1-MLC/Qwen1.5-1.8B-Chat-q4f16_1-opencl.so
arm_release_ver of this libmali is 'g6p0-01eac0', rk_so_ver is '7'.
gpu_size_bytes: 16739180544
params_bytes: 1033572352.0
temp_func_bytes: 21852323840
kv_aux_workspace_bytes: 118004696
model_workspace_bytes: 33898816
logit_processor_workspace_bytes: 195999040.0
kv_bytes_per_token: 196609.25
Traceback (most recent call last):
  File "/root/miniforge3/bin/mlc_llm", line 33, in <module>
    sys.exit(load_entry_point('mlc-llm', 'console_scripts', 'mlc_llm')())
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/Allin/mlc-llm-new/python/mlc_llm/__main__.py", line 41, in main
    cli.main(sys.argv[2:])
  File "/home/Allin/mlc-llm-new/python/mlc_llm/cli/serve.py", line 75, in main
    serve(
  File "/home/Allin/mlc-llm-new/python/mlc_llm/interface/serve.py", line 43, in serve
    async_engine = engine.AsyncEngine(model_info, kv_cache_config, enable_tracing=enable_tracing)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/Allin/mlc-llm-new/python/mlc_llm/serve/engine.py", line 780, in __init__
    super().__init__("async", models, kv_cache_config, engine_mode, enable_tracing)
  File "/home/Allin/mlc-llm-new/python/mlc_llm/serve/engine_base.py", line 565, in __init__
    kv_cache_config.max_total_sequence_length = _estimate_max_total_sequence_length(
                                                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/Allin/mlc-llm-new/python/mlc_llm/serve/engine_base.py", line 230, in _estimate_max_total_sequence_length
    assert max_total_sequence_length > 0, (
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
AssertionError: Cannot estimate KV cache capacity. The model weight size 1033572352.0 may be larger than GPU memory size 16739180544

Here's my mlc-chat-config.json after changing prefill_chunk_size and context_window_size to 4096:

{
  "model_type": "qwen2",
  "quantization": "q4f16_1",
  "model_config": {
    "hidden_act": "silu",
    "hidden_size": 2048,
    "intermediate_size": 5504,
    "num_attention_heads": 16,
    "num_hidden_layers": 24,
    "num_key_value_heads": 16,
    "rms_norm_eps": 1e-06,
    "rope_theta": 1000000.0,
    "vocab_size": 151936,
    "context_window_size": 4096,
    "prefill_chunk_size": 4096,
    "tensor_parallel_shards": 1,
    "head_dim": 128,
    "dtype": "float32"
  },
  "vocab_size": 151936,
  "context_window_size": 4096,
  "sliding_window_size": -1,
  "prefill_chunk_size": 4096,
  "attention_sink_size": -1,
  "tensor_parallel_shards": 1,
  "mean_gen_len": 128,
  "max_gen_len": 512,
  "shift_fill_factor": 0.3,
  "temperature": 0.7,
  "presence_penalty": 0.0,
  "freq_uency_penalty": 0.0,
  "repetition_penalty": 1.1,
  "top_p": 0.8,
  "conv_template": {
    "name": "mistral_default",
    "system_template": "[INST] {system_message}",
    "system_message": "Always assist with care, respect, and truth. Respond with utmost utility yet securely. Avoid harmful, unethical, prejudiced, or negative content. Ensure replies promote fairness and positivity.",
    "system_prefix_token_ids": [
      1
    ],
    "add_role_after_system_message": true,
    "roles": {
      "user": "<|im_start|>user",
      "assistant": "<|im_start|>assistant"
    },
    "role_templates": {
      "user": "{user_message}",
      "assistant": "{assistant_message}",
      "tool": "{tool_message}"
    },
    "messages": [],
    "seps": [
      "<|im_end|>\n"
    ],
    "role_content_sep": "\n",
    "role_empty_sep": "\n",
    "stop_str": [
      "<|im_end|>"
    ],
    "stop_token_ids": [
      151643
    ],
    "function_string": "",
    "use_function_calling": false
  },
  "pad_token_id": 151643,
  "bos_token_id": 151643,
  "eos_token_id": [
    151645,
    151643
  ],
  "tokenizer_files": [
    "tokenizer.json",
    "vocab.json",
    "merges.txt",
    "tokenizer_config.json",
    "added_tokens.json"
  ],
  "version": "0.1.0"
}

And I also tried Qwen-7B-Chat-q4f16_1-MLC(https://huggingface.co/mlc-ai/Qwen-7B-Chat-q4f16_1-MLC), this model works fine both on mlc_llm chat and mlc_llm serve.

from mlc-llm.

tqchen avatar tqchen commented on June 9, 2024

@ahz-r3v you might need to cross check if you have recompiled the lib

from mlc-llm.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.