Comments (5)
I think these may be helpful:
I added some prints in engine_base.py:
max_total_sequence_length = int(
(
int(gpu_size_bytes) * 0.90
- params_bytes
- temp_func_bytes
- kv_aux_workspace_bytes
- model_workspace_bytes
- logit_processor_workspace_bytes
)
/ kv_bytes_per_token
)
print("gpu_size_bytes: {}".format(gpu_size_bytes))
print("params_bytes: {}".format(params_bytes))
print("temp_func_bytes: {}".format(temp_func_bytes))
print("kv_aux_workspace_bytes: {}".format(kv_aux_workspace_bytes))
print("model_workspace_bytes: {}".format(model_workspace_bytes))
print("logit_processor_workspace_bytes: {}".format(logit_processor_workspace_bytes))
print("kv_bytes_per_token: {}".format(kv_bytes_per_token))
assert max_total_sequence_length > 0, (
"Cannot estimate KV cache capacity. "
f"The model weight size {params_bytes} may be larger than GPU memory size {gpu_size_bytes}"
)
and here's outputs:
gpu_size_bytes: 16739180544
params_bytes: 1033572352.0
temp_func_bytes: 21852323840
kv_aux_workspace_bytes: 591666136
model_workspace_bytes: 268894528
logit_processor_workspace_bytes: 195999040.0
kv_bytes_per_token: 196609.25
from mlc-llm.
Thanks for reporting. As a temp measure. Reduce the prefill chunk size might help. We should followup by auto limit this number when we run gen config
from mlc-llm.
cc @MasterJH5574 to see if we can enhance the error message
from mlc-llm.
Thanks for reporting. As a temp measure. Reduce the prefill chunk size might help. We should followup by auto limit this number when we run gen config
Thanks for the solution. But after reducing prefill chunk size to 4096, mlc_llm serve
still gets the same error running Qwen1.5-1.8B-Chat-q4f16_1-MLC:
(base) root@orangepi5pro:/home/Allin/mlc-llm-new# mlc_llm serve ./dist/Qwen1.5-1.8B-Chat-q4f16_1-MLC/ --device opencl --model-lib-path /home/Allin/mlc-llm-new/dist/Qwen1.5-1.8B-Chat-q4f16_1-MLC/Qwen1.5-1.8B-Chat-q4f16_1-opencl.so
[2024-04-28 23:11:54] INFO auto_device.py:76: Found device: opencl:0
[2024-04-28 23:11:54] INFO chat_module.py:379: Using model folder: /home/Allin/mlc-llm-new/dist/Qwen1.5-1.8B-Chat-q4f16_1-MLC
[2024-04-28 23:11:54] INFO chat_module.py:380: Using mlc chat config: /home/Allin/mlc-llm-new/dist/Qwen1.5-1.8B-Chat-q4f16_1-MLC/mlc-chat-config.json
[2024-04-28 23:11:54] INFO chat_module.py:529: Using library model: /home/Allin/mlc-llm-new/dist/Qwen1.5-1.8B-Chat-q4f16_1-MLC/Qwen1.5-1.8B-Chat-q4f16_1-opencl.so
arm_release_ver of this libmali is 'g6p0-01eac0', rk_so_ver is '7'.
gpu_size_bytes: 16739180544
params_bytes: 1033572352.0
temp_func_bytes: 21852323840
kv_aux_workspace_bytes: 118004696
model_workspace_bytes: 33898816
logit_processor_workspace_bytes: 195999040.0
kv_bytes_per_token: 196609.25
Traceback (most recent call last):
File "/root/miniforge3/bin/mlc_llm", line 33, in <module>
sys.exit(load_entry_point('mlc-llm', 'console_scripts', 'mlc_llm')())
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/Allin/mlc-llm-new/python/mlc_llm/__main__.py", line 41, in main
cli.main(sys.argv[2:])
File "/home/Allin/mlc-llm-new/python/mlc_llm/cli/serve.py", line 75, in main
serve(
File "/home/Allin/mlc-llm-new/python/mlc_llm/interface/serve.py", line 43, in serve
async_engine = engine.AsyncEngine(model_info, kv_cache_config, enable_tracing=enable_tracing)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/Allin/mlc-llm-new/python/mlc_llm/serve/engine.py", line 780, in __init__
super().__init__("async", models, kv_cache_config, engine_mode, enable_tracing)
File "/home/Allin/mlc-llm-new/python/mlc_llm/serve/engine_base.py", line 565, in __init__
kv_cache_config.max_total_sequence_length = _estimate_max_total_sequence_length(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/Allin/mlc-llm-new/python/mlc_llm/serve/engine_base.py", line 230, in _estimate_max_total_sequence_length
assert max_total_sequence_length > 0, (
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
AssertionError: Cannot estimate KV cache capacity. The model weight size 1033572352.0 may be larger than GPU memory size 16739180544
Here's my mlc-chat-config.json after changing prefill_chunk_size
and context_window_size
to 4096:
{
"model_type": "qwen2",
"quantization": "q4f16_1",
"model_config": {
"hidden_act": "silu",
"hidden_size": 2048,
"intermediate_size": 5504,
"num_attention_heads": 16,
"num_hidden_layers": 24,
"num_key_value_heads": 16,
"rms_norm_eps": 1e-06,
"rope_theta": 1000000.0,
"vocab_size": 151936,
"context_window_size": 4096,
"prefill_chunk_size": 4096,
"tensor_parallel_shards": 1,
"head_dim": 128,
"dtype": "float32"
},
"vocab_size": 151936,
"context_window_size": 4096,
"sliding_window_size": -1,
"prefill_chunk_size": 4096,
"attention_sink_size": -1,
"tensor_parallel_shards": 1,
"mean_gen_len": 128,
"max_gen_len": 512,
"shift_fill_factor": 0.3,
"temperature": 0.7,
"presence_penalty": 0.0,
"freq_uency_penalty": 0.0,
"repetition_penalty": 1.1,
"top_p": 0.8,
"conv_template": {
"name": "mistral_default",
"system_template": "[INST] {system_message}",
"system_message": "Always assist with care, respect, and truth. Respond with utmost utility yet securely. Avoid harmful, unethical, prejudiced, or negative content. Ensure replies promote fairness and positivity.",
"system_prefix_token_ids": [
1
],
"add_role_after_system_message": true,
"roles": {
"user": "<|im_start|>user",
"assistant": "<|im_start|>assistant"
},
"role_templates": {
"user": "{user_message}",
"assistant": "{assistant_message}",
"tool": "{tool_message}"
},
"messages": [],
"seps": [
"<|im_end|>\n"
],
"role_content_sep": "\n",
"role_empty_sep": "\n",
"stop_str": [
"<|im_end|>"
],
"stop_token_ids": [
151643
],
"function_string": "",
"use_function_calling": false
},
"pad_token_id": 151643,
"bos_token_id": 151643,
"eos_token_id": [
151645,
151643
],
"tokenizer_files": [
"tokenizer.json",
"vocab.json",
"merges.txt",
"tokenizer_config.json",
"added_tokens.json"
],
"version": "0.1.0"
}
And I also tried Qwen-7B-Chat-q4f16_1-MLC(https://huggingface.co/mlc-ai/Qwen-7B-Chat-q4f16_1-MLC), this model works fine both on mlc_llm chat
and mlc_llm serve
.
from mlc-llm.
@ahz-r3v you might need to cross check if you have recompiled the lib
from mlc-llm.
Related Issues (20)
- mlc_llm serve fails on concurrent users - Llama3 70B parameter hosting HOT 3
- 执行mlc_chat指令时总是报错 HOT 3
- Compiling WebAssembly library with debug symbols/source map to aid in debugging
- [Doc] Request for suggested build-from-source options + explanation of added functionality
- [Doc] benchmark on different hardware
- [Bug] iOS | mlc_llm package not working HOT 6
- [Model Request] T5
- [Question] Cannot compile custom model to work on web browser
- [Bug] Google Colab T4 Error TVMError: FlashInfer ParallelTopPSamplingFromProb error no kernel image is available for execution on the device HOT 4
- exe "mlc_llm package" error HOT 4
- [Bug] CUDA: out of memory on dual gpu HOT 2
- [Bug] Bug Missing mlc_llm.dll file when setting up MLC LLM for Android development on Windows HOT 4
- [Bug] `mlc_llm serve` throws `CUDA: invalid device ordinal` HOT 4
- [Bug] SEVERE downstream task performance degradation compared to uncompiled model HOT 11
- run mlc_llm package ValueError: Git clone failed with return code 128: None. The command was HOT 4
- [Feature Request] please allow f32q5_k and f16q5_k quantizations
- [Bug] FlashInfer decode BeginForward error an illegal instruction was encountered HOT 1
- mlc_llm package is ERROR: returned non-zero exit status[Bug] HOT 8
- [Question] Running mlc_llm into a multi-phase container build HOT 10
- [Question] Error when running debug_chat.py HOT 4
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from mlc-llm.