Giter VIP home page Giter VIP logo

internlm / lmdeploy Goto Github PK

View Code? Open in Web Editor NEW
2.5K 24.0 230.0 3.26 MB

LMDeploy is a toolkit for compressing, deploying, and serving LLMs.

Home Page: https://lmdeploy.readthedocs.io/en/latest/

License: Apache License 2.0

Python 42.51% Shell 0.33% CMake 1.50% Cuda 19.30% C++ 36.28% C 0.04% Dockerfile 0.03% PowerShell 0.01%
cuda-kernels deepspeed fastertransformer llm-inference turbomind internlm llama llm codellama llama2

lmdeploy's People

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

lmdeploy's Issues

【P0】support vicuna 7B

llmdeploy didn't support vicuna 7B well, because its preprocessor cannot tokenize <s> and </s> into bos and eos token respectively.

I think we'd better change the tokenizer (llmdeploy/fastertransformer/triton_models/preprcessing/1/model.py) with huggingface's AutoTokenizer

F.Y.I, here is an introduction to download and serve vicuna-7B v1.1 model

Question about persistent Batch Inference

Hi, Thank you for the open source LMDeploy project!

There is a image in the documentation that provides a good description of the process of dynamic batching inference, but I couldn't find more details about how LMDeploy implements this function.

Is there any document or could you tell me where this part is implemented in the code?

Get trouble with 'Quantization' in '/README.md'

here is the log

(lmdeploy_test) [xxx@xxxxxxxxxxxx internlm_test]$ python -m lmdeploy.lite.apis.kv_qparams --model internlm-chat-7b --output_dir internlm-chat-7b-deploy --symmetry True --offload  False --num_tp 1
Traceback (most recent call last):
  File "/xxx/xxx/miniconda3/envs/lmdeploy_test/lib/python3.9/runpy.py", line 197, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/nvme/xxx/miniconda3/envs/lmdeploy_test/lib/python3.9/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/xxx/xxx/miniconda3/envs/lmdeploy_test/lib/python3.9/site-packages/lmdeploy/lite/apis/kv_qparams.py", line 199, in <module>
    fire.Fire(main)
  File "/xxx/xxx/miniconda3/envs/lmdeploy_test/lib/python3.9/site-packages/fire/core.py", line 141, in Fire
    component_trace = _Fire(component, args, parsed_flag_args, context, name)
  File "/xxx/xxx/miniconda3/envs/lmdeploy_test/lib/python3.9/site-packages/fire/core.py", line 475, in _Fire
    component, remaining_args = _CallAndUpdateTrace(
  File "/xxx/xxx/miniconda3/envs/lmdeploy_test/lib/python3.9/site-packages/fire/core.py", line 691, in _CallAndUpdateTrace
    component = fn(*varargs, **kwargs)
  File "/xxx/xxx/miniconda3/envs/lmdeploy_test/lib/python3.9/site-packages/lmdeploy/lite/apis/kv_qparams.py", line 112, in main
    tokenizer = AutoTokenizer.from_pretrained(model, use_fast=False)
  File "/xxx/xxx/miniconda3/envs/lmdeploy_test/lib/python3.9/site-packages/transformers/models/auto/tokenization_auto.py", line 688, in from_pretrained
    raise ValueError(
ValueError: Tokenizer class InternLMTokenizer does not exist or is not currently imported.

It looks like it can't get 'InternLMTokenizer' smoothly

启动时报错

将llama-7b转换格式后docker启动报错
转换命令:
python3 -m lmdeploy.serve.turbomind.deploy llama-7b /home/nlp/lwp/pre_models/llama-7b-hf hf
启动命令:
docker run --gpus "device=8" --rm -v $(pwd)/workspace:/workspace -it openmmlab/lmdeploy:latest python3 -m lmdeploy.turbomind.chat llama /workspace
报错信息:
[WARNING] gemm_config.in is not found; using default GEMM algo
Traceback (most recent call last):
File "/usr/lib/python3.8/runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/usr/lib/python3.8/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/usr/local/lib/python3.8/dist-packages/lmdeploy/turbomind/chat.py", line 96, in
fire.Fire(main)
File "/usr/local/lib/python3.8/dist-packages/fire/core.py", line 141, in Fire
component_trace = _Fire(component, args, parsed_flag_args, context, name)
File "/usr/local/lib/python3.8/dist-packages/fire/core.py", line 475, in _Fire
component, remaining_args = _CallAndUpdateTrace(
File "/usr/local/lib/python3.8/dist-packages/fire/core.py", line 691, in _CallAndUpdateTrace
component = fn(*varargs, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/lmdeploy/turbomind/chat.py", line 35, in main
tokenizer = AutoTokenizer.from_pretrained(tokenizer_model_path,
File "/usr/local/lib/python3.8/dist-packages/transformers/models/auto/tokenization_auto.py", line 690, in from_pretrained
raise ValueError(
ValueError: Tokenizer class LLaMATokenizer does not exist or is not currently imported.

使用workerspace下面的脚本运行的时候也会报错,会报error: creating server: Internal - failed to load all models

请问这是哪里出错了?

请问怎么进行batch inference

在batch为2的情况下,我执行了 python3 -m lmdeploy.turbomind.chat llama /workspace [0,1] 以及将input_ids填了batch个,但是出现RuntimeError: output with shape [1] doesn't match the broadcast shape [2]

[WIP] Support InternLM on 3rd-party inference toolboxes

This issue is to track progress on 3rd party toolboxes which is related to InternLM.

VLLM

https://github.com/wangruohui/vllm/tree/internlm

  • Inference with single GPU
    • There seems some bug, not sure from my implementation or from upstream
  • Tensor parallel

DeepSpeed

InternLM-7B is supported in Deepspeed inference and merged to main branch: microsoft/DeepSpeed#4137

  • Single GPU with kernel infection policy
  • Tensor parallel

Meta tensor for faster model loading: watching microsoft/DeepSpeed#3608

支持tritonserver

我看了一下代码,tritonBackend好像也支持,但是没int8_mode的选项,是没有支持吗?

'InternLMTokenizer' object has no attribute 'backend_tokenizer'

在Inference by TurboMind的时候使用命令
python3 -m lmdeploy.turbomind.chat internlm ./workspace/报错
Traceback (most recent call last):
File "/usr/lib/python3.8/runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/usr/lib/python3.8/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/opt/tritonserver/lmdeploy/lmdeploy/turbomind/chat.py", line 109, in
fire.Fire(main)
File "/usr/local/lib/python3.8/dist-packages/fire/core.py", line 141, in Fire
component_trace = _Fire(component, args, parsed_flag_args, context, name)
File "/usr/local/lib/python3.8/dist-packages/fire/core.py", line 475, in _Fire
component, remaining_args = _CallAndUpdateTrace(
File "/usr/local/lib/python3.8/dist-packages/fire/core.py", line 691, in _CallAndUpdateTrace
component = fn(*varargs, **kwargs)
File "/opt/tritonserver/lmdeploy/lmdeploy/turbomind/chat.py", line 43, in main
tokenizer = Tokenizer(tokenizer_model_path)
File "/opt/tritonserver/lmdeploy/lmdeploy/turbomind/tokenizer.py", line 152, in init
self.model = HuggingFaceTokenizer(model_folder)
File "/opt/tritonserver/lmdeploy/lmdeploy/turbomind/tokenizer.py", line 84, in init
self.model.backend_tokenizer.save(backend_tokenizer_file)
AttributeError: 'InternLMTokenizer' object has no attribute 'backend_tokenizer

我看了下源码,好像这个backend_tokenizer确实没咋用到,这个有用吗

RuntimeError: Internal: src/sentencepiece_processor.cc(1101) [model_proto->ParseFromArray(serialized.data(), serialized.size())]

(lmdeploy) ➜  lmdeploy git:(main) python -m lmdeploy.serve.turbomind.deploy internlm-7b /mnt/internlm-7b hf
create workspace in directory ./workspace
copy triton model templates from "/mnt/lmdeploy/lmdeploy/serve/turbomind/triton_models" to "./workspace/triton_models" successfully
['pytorch_model-00001-of-00002.bin', 'pytorch_model-00002-of-00002.bin']

### copying layers.31.attention.wo.bias, shape=torch.Size([4096])
layers.31.attention.wo.0.bias torch.Size([4096])
*** splitting layers.31.feed_forward.w1.weight, shape=torch.Size([4096, 11008]), split_dim=-1
layers.31.feed_forward.w1.0.weight torch.Size([4096, 11008])
*** splitting layers.31.feed_forward.w2.weight, shape=torch.Size([11008, 4096]), split_dim=0
layers.31.feed_forward.w2.0.weight torch.Size([11008, 4096])
*** splitting layers.31.feed_forward.w3.weight, shape=torch.Size([4096, 11008]), split_dim=-1
layers.31.feed_forward.w3.0.weight torch.Size([4096, 11008])
layers.31.attention_norm.weight torch.Size([4096])
layers.31.ffn_norm.weight torch.Size([4096])
tok_embeddings.weight torch.Size([103168, 4096])
norm.weight torch.Size([4096])
output.weight torch.Size([103168, 4096])
Traceback (most recent call last):
  File "/mnt/miniconda/envs/lmdeploy/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/mnt/miniconda/envs/lmdeploy/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/mnt/lmdeploy/lmdeploy/serve/turbomind/deploy.py", line 549, in <module>
    fire.Fire(main)
  File "/mnt/miniconda/envs/lmdeploy/lib/python3.10/site-packages/fire/core.py", line 141, in Fire
    component_trace = _Fire(component, args, parsed_flag_args, context, name)
  File "/mnt/miniconda/envs/lmdeploy/lib/python3.10/site-packages/fire/core.py", line 475, in _Fire
    component, remaining_args = _CallAndUpdateTrace(
  File "/mnt/miniconda/envs/lmdeploy/lib/python3.10/site-packages/fire/core.py", line 691, in _CallAndUpdateTrace
    component = fn(*varargs, **kwargs)
  File "/mnt/lmdeploy/lmdeploy/serve/turbomind/deploy.py", line 522, in main
    res = deploy_hf(model_name, model_path, tokenizer_path,
  File "/mnt/lmdeploy/lmdeploy/serve/turbomind/deploy.py", line 462, in deploy_hf
    return export(model_name, num_layer, norm_eps, model_params,
  File "/mnt/lmdeploy/lmdeploy/serve/turbomind/deploy.py", line 167, in export
    vocab_size, bos_id, eos_id = tokenizer_info(tokenizer_path)
  File "/mnt/lmdeploy/lmdeploy/serve/turbomind/deploy.py", line 87, in tokenizer_info
    sp_model = SentencePieceProcessor(model_file=model_path)
  File "/mnt/miniconda/envs/lmdeploy/lib/python3.10/site-packages/sentencepiece/__init__.py", line 447, in Init
    self.Load(model_file=model_file, model_proto=model_proto)
  File "/mnt/miniconda/envs/lmdeploy/lib/python3.10/site-packages/sentencepiece/__init__.py", line 905, in Load
    return self.LoadFromFile(model_file)
  File "/mnt/miniconda/envs/lmdeploy/lib/python3.10/site-packages/sentencepiece/__init__.py", line 310, in LoadFromFile
    return _sentencepiece.SentencePieceProcessor_LoadFromFile(self, arg)
RuntimeError: Internal: src/sentencepiece_processor.cc(1101) [model_proto->ParseFromArray(serialized.data(), serialized.size())] 

image

need all the LFS file

Using deepspeed tp to load InternLM, but memory do not save.

我现在的硬件配置是4块16G的V100,我按照README.md里基于pytorch推理

python3 -m lmdeploy.pytorch.chat $NAME_OR_PATH_TO_HF_MODEL\
    --max_new_tokens 64 \
    --temperture 0.8 \
    --top_p 0.95 \
    --seed 0

单卡推理load完模型后占用显存15G
image

deepspeed --module --num_gpus 2 lmdeploy.pytorch.chat \
    $NAME_OR_PATH_TO_HF_MODEL \
    --max_new_tokens 64 \
    --temperture 0.8 \
    --top_p 0.95 \
    --seed 0

双卡tp推理load完模型后每张卡占用显存也是15G
image

没有达到期望的降低显存的效果,不是很理解问题出在哪?

huggingface safetensor support

I'm trying to deploy LLaMA2 70b chat model locally and find that this LMDeploy seems don't support huggingface safetensor ckpt. It just raise a confusing Exception:

Traceback (most recent call last):
  File "/opt/conda/envs/lmdeploy/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/opt/conda/envs/lmdeploy/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/workspace/lmdeploy/lmdeploy/serve/turbomind/deploy.py", line 592, in <module>
    fire.Fire(main)
  File "/opt/conda/envs/lmdeploy/lib/python3.10/site-packages/fire/core.py", line 141, in Fire
    component_trace = _Fire(component, args, parsed_flag_args, context, name)
  File "/opt/conda/envs/lmdeploy/lib/python3.10/site-packages/fire/core.py", line 475, in _Fire
    component, remaining_args = _CallAndUpdateTrace(
  File "/opt/conda/envs/lmdeploy/lib/python3.10/site-packages/fire/core.py", line 691, in _CallAndUpdateTrace
    component = fn(*varargs, **kwargs)
  File "/workspace/lmdeploy/lmdeploy/serve/turbomind/deploy.py", line 562, in main
    res = deploy_hf(model_name, model_path, tokenizer_path,
  File "/workspace/lmdeploy/lmdeploy/serve/turbomind/deploy.py", line 482, in deploy_hf
    assert num_layer == i, f'miss matched layers: {num_layer} vs {i}'
AssertionError: miss matched layers: 80 vs 0

because it only read *.bin:

_files = [file for file in os.listdir(model_path) if file.endswith('.bin')]

RuntimeError: Internal: src/sentencepiece_processor.cc(1101) [model_proto->ParseFromArray(serialized.data(), serialized.size())]

(lmdeploy) ➜  lmdeploy git:(main) python -m lmdeploy.pytorch.chat /mnt/internlm-7b \ 
    --max_new_tokens 64 \
    --temperture 0.8 \
    --top_p 0.95 \
    --seed 0
Traceback (most recent call last):
  File "/mnt/miniconda/envs/lmdeploy/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/mnt/miniconda/envs/lmdeploy/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/mnt/lmdeploy/lmdeploy/pytorch/chat.py", line 190, in <module>
    fire.Fire(main)
  File "/mnt/miniconda/envs/lmdeploy/lib/python3.10/site-packages/fire/core.py", line 141, in Fire
    component_trace = _Fire(component, args, parsed_flag_args, context, name)
  File "/mnt/miniconda/envs/lmdeploy/lib/python3.10/site-packages/fire/core.py", line 475, in _Fire
    component, remaining_args = _CallAndUpdateTrace(
  File "/mnt/miniconda/envs/lmdeploy/lib/python3.10/site-packages/fire/core.py", line 691, in _CallAndUpdateTrace
    component = fn(*varargs, **kwargs)
  File "/mnt/lmdeploy/lmdeploy/pytorch/chat.py", line 120, in main
    tokenizer, model = init_model(
  File "/mnt/lmdeploy/lmdeploy/pytorch/chat.py", line 62, in init_model
    tokenizer = AutoTokenizer.from_pretrained(tokenizer_path,
  File "/mnt/miniconda/envs/lmdeploy/lib/python3.10/site-packages/transformers/models/auto/tokenization_auto.py", line 693, in from_pretrained
    return tokenizer_class.from_pretrained(pretrained_model_name_or_path, *inputs, **kwargs)
  File "/mnt/miniconda/envs/lmdeploy/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 1812, in from_pretrained
    return cls._from_pretrained(
  File "/mnt/miniconda/envs/lmdeploy/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 1975, in _from_pretrained
    tokenizer = cls(*init_inputs, **init_kwargs)
  File "/root/.cache/huggingface/modules/transformers_modules/internlm-7b/tokenization_internlm.py", line 81, in __init__
    self.sp_model.Load(vocab_file)
  File "/mnt/miniconda/envs/lmdeploy/lib/python3.10/site-packages/sentencepiece/__init__.py", line 905, in Load
    return self.LoadFromFile(model_file)
  File "/mnt/miniconda/envs/lmdeploy/lib/python3.10/site-packages/sentencepiece/__init__.py", line 310, in LoadFromFile
    return _sentencepiece.SentencePieceProcessor_LoadFromFile(self, arg)
RuntimeError: Internal: src/sentencepiece_processor.cc(1101) [model_proto->ParseFromArray(serialized.data(), serialized.size())] 

pip list

(lmdeploy) ➜  lmdeploy git:(main) pip list
Package            Version  Editable project location
------------------ -------- -------------------------
addict             2.4.0
brotlipy           0.7.0
certifi            2023.5.7
cffi               1.15.1
charset-normalizer 2.0.4
contourpy          1.1.0
cryptography       39.0.1
cycler             0.11.0
filelock           3.9.0
fire               0.5.0
fonttools          4.41.0
fsspec             2023.6.0
gmpy2              2.1.2
grpcio             1.56.0
huggingface-hub    0.16.4
idna               3.4
importlib-metadata 6.8.0
Jinja2             3.1.2
kiwisolver         1.4.4
lmdeploy           0.0.1    /mnt/lmdeploy
markdown-it-py     3.0.0
MarkupSafe         2.1.1
matplotlib         3.7.2
mdurl              0.1.2
mkl-fft            1.3.6
mkl-random         1.2.2
mkl-service        2.4.0
mmengine           0.8.2
mpmath             1.2.1
networkx           2.8.4
numpy              1.25.0
opencv-python      4.8.0.74
packaging          23.1
Pillow             9.4.0
pip                23.1.2
platformdirs       3.9.1
protobuf           4.23.4
pybind11           2.11.0
pycparser          2.21
Pygments           2.15.1
pyOpenSSL          23.0.0
pyparsing          3.0.9
PySocks            1.7.1
python-dateutil    2.8.2
python-rapidjson   1.10
PyYAML             6.0
regex              2023.6.3
requests           2.29.0
rich               13.4.2
safetensors        0.3.1
sentencepiece      0.1.99
setuptools         67.8.0
six                1.16.0
sympy              1.11.1
termcolor          2.3.0
tokenizers         0.13.3
tomli              2.0.1
torch              2.0.0
torchaudio         2.0.0
torchvision        0.15.0
tqdm               4.65.0
transformers       4.29.2
triton             2.0.0
tritonclient       2.33.0
typing_extensions  4.6.3
urllib3            1.26.16
wheel              0.38.4
yapf               0.40.1
zipp               3.16.2

Test

PB10.mp4

PB

PB11.mp4
PersistentBatchInference.mp4

PersistentBatchInference

请问llama 65b kv cache量化和context fmha不能同时打开吗?

在部署测试过程中,llama 7b的use_context_fmha = 1,quant_policy = 4是可以运行的,但是llama 65b不可以,需要use_context_fmha = 0。请问这是我这边的问题还是目前确实不能同时打开呢?

报错是这个:
what(): [TM][ERROR] CUDA runtime error: an illegal memory access was encountered lmdeploy/src/turbomind/models/llama/LlamaBatch.cc:843

HTTP client question

Is there a regular HTTP request client that does not require complex lmdeploy package installation and gRPC calls, nor does it need streaming transmission, returning all answer results in a single response.

a bug

I try to deploy in my server with 2*3090, cuda-11.7.
It deploy normally with command:"docker run --gpus all --rm -v $(pwd)/workspace:/workspace -it openmmlab/lmdeploy:latest
python3 -m lmdeploy.turbomind.chat internlm /workspace"
however it can't be deployed by command:"bash workspace/service_docker_up.sh" because segment fault.
image

otherwise, when I try "python3 lmdeploy.app {server_ip_addresss}:33337 internlm" to deploy a client,it report : torch don't have module cuda. it is because you add lmdeploy/lmdeploy/torch into sys.path, and that torch don't have cuda module. I fix this by add "sys.path.remove("lmdeploy/lmdeploy/torch")"

Question about internrm-chat-7b-8k

用tp=2 转换Internlm-chat-7b-8k模型 为turbomind格式,最终生成的weight/config.ini如下,8k不是最大支持8千多个token嘛? 这个在哪里设置的,我现在调用超过2048就报错了

[llama]
model_name = internlm-chat-7b
head_num = 32
size_per_head = 128
vocab_size = 103168
num_layer = 32
rotary_embedding = 128
inter_size = 11008
norm_eps = 1e-06
attn_bias = 1
start_id = 1
end_id = 2
weight_type = fp16
max_batch_size = 32
max_context_token_num = 4
session_len = 2056
step_length = 1
cache_max_entry_count = 48
cache_chunk_size = 1
use_context_fmha = 1
quant_policy = 0
tensor_para_size = 2

ziya启动不了

(lmdeploy) ➜ lmdeploy sudo bash workspace/service_docker_up.sh

=============================
== Triton Inference Server ==

NVIDIA Release 22.12 (build 50109463)
Triton Server Version 2.29.0

Copyright (c) 2018-2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved.

Various files include modifications (c) NVIDIA CORPORATION & AFFILIATES. All rights reserved.

This container image and its contents are governed by the NVIDIA Deep Learning Container License.
By pulling and using the container, you accept the terms and conditions of this license:
https://developer.nvidia.com/ngc/nvidia-deep-learning-container-license

I0721 03:34:28.851237 1 pinned_memory_manager.cc:240] Pinned memory pool is created at '0x7f6d00000000' with size 268435456
I0721 03:34:28.851696 1 cuda_memory_manager.cc:105] CUDA memory pool is created on device 0 with size 67108864
I0721 03:34:28.857984 1 model_lifecycle.cc:459] loading: postprocessing:1
I0721 03:34:28.858019 1 model_lifecycle.cc:459] loading: preprocessing:1
I0721 03:34:28.858036 1 model_lifecycle.cc:459] loading: turbomind:1
I0721 03:34:29.000309 1 libfastertransformer.cc:1746] TRITONBACKEND_Initialize: turbomind
I0721 03:34:29.000337 1 libfastertransformer.cc:1753] Triton TRITONBACKEND API version: 1.10
I0721 03:34:29.000340 1 libfastertransformer.cc:1757] 'turbomind' TRITONBACKEND API version: 1.10
I0721 03:34:29.002218 1 libfastertransformer.cc:1784] TRITONBACKEND_ModelInitialize: turbomind (version 1)
I0721 03:34:29.002902 1 libfastertransformer.cc:307] Instance group type: KIND_CPU count: 48
num_nodes=1
tp_pp_size=1
gpu_size=1
world_size=1
model_instance_size=1
I0721 03:34:29.002929 1 libfastertransformer.cc:346] Sequence Batching: disabled
I0721 03:34:29.002934 1 libfastertransformer.cc:357] Dynamic Batching: disabled
[ERROR] Does not find the section llama with name model_name.

创建模型时候报错ModelLifeCycle::CreateModel()

========== step1 ============
我在一台机器上执行的 lmdeploy.serve.turbomind.deploy命令,并且参数tp=2,因为我想将模型放在两个gpu上运行
这一步是成功的

========== step2 ============
在另一台机器上执行service_docker_up.sh,tritonserver启动了turbomind后端,但是模型没有成功加载,报错了

这是报错内容

I0712 08:36:20.719380 1 pinned_memory_manager.cc:240] Pinned memory pool is created at '0x7f0c44000000' with size 268435456
I0712 08:36:20.720671 1 cuda_memory_manager.cc:105] CUDA memory pool is created on device 0 with size 67108864
I0712 08:36:20.720696 1 cuda_memory_manager.cc:105] CUDA memory pool is created on device 1 with size 67108864
W0712 08:36:20.859316 1 server.cc:218] failed to enable peer access for some device pairs
I0712 08:36:21.439338 1 model_lifecycle.cc:459] loading: turbomind:1
I0712 08:36:21.441099 1 model_lifecycle.cc:459] loading: postprocessing:1
I0712 08:36:21.442823 1 model_lifecycle.cc:459] loading: preprocessing:1
I0712 08:36:21.582434 1 libfastertransformer.cc:1746] TRITONBACKEND_Initialize: turbomind
I0712 08:36:21.582484 1 libfastertransformer.cc:1753] Triton TRITONBACKEND API version: 1.10
I0712 08:36:21.582501 1 libfastertransformer.cc:1757] 'turbomind' TRITONBACKEND API version: 1.10
I0712 08:36:21.585413 1 libfastertransformer.cc:1784] TRITONBACKEND_ModelInitialize: turbomind (version 1)
I0712 08:36:21.586543 1 libfastertransformer.cc:307] Instance group type: KIND_CPU count: 48
E0712 08:36:21.586654 1 libfastertransformer.cc:226] Invalid configuration argument 'tensor_para_size': stoi
[3b379e147757:1    :0:86] Caught signal 8 (Floating point exception: integer divide by zero)
==== backtrace (tid:     86) ====
 0 0x0000000000014420 __funlockfile()  ???:0
 1 0x0000000000018313 triton::backend::turbomind_backend::ModelState::ModelState()  /opt/tritonserver/lmdeploy/src/turbomind/triton_backend/libfastertransformer.cc:323
 2 0x0000000000024554 triton::backend::turbomind_backend::ModelState::Create()  /opt/tritonserver/lmdeploy/src/turbomind/triton_backend/libfastertransformer.cc:182
 3 0x0000000000024b81 TRITONBACKEND_ModelInitialize()  /opt/tritonserver/lmdeploy/src/turbomind/triton_backend/libfastertransformer.cc:1791
 4 0x000000000010689b triton::core::TritonModel::Create()  :0
 5 0x00000000001c4f5d triton::core::ModelLifeCycle::CreateModel()  :0
 6 0x00000000001caccd std::_Function_handler<void (), triton::core::ModelLifeCycle::AsyncLoad(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, inference::ModelConfig const&, bool, std::shared_ptr<triton::core::TritonRepoAgentModelList> const&, std::function<void (triton::core::Status)>&&)::{lambda()#1}>::_M_invoke()  model_lifecycle.cc:0
 7 0x00000000003083a0 std::thread::_State_impl<std::thread::_Invoker<std::tuple<triton::common::ThreadPool::ThreadPool(unsigned long)::{lambda()#1}> > >::_M_run()  thread_pool.cc:0
 8 0x00000000000d6de4 std::error_code::default_error_condition()  ???:0
 9 0x0000000000008609 start_thread()  ???:0
10 0x000000000011f133 clone()  ???:0
=================================
[3b379e147757:00001] *** Process received signal ***
[3b379e147757:00001] Signal: Floating point exception (8)
[3b379e147757:00001] Signal code:  (-6)
[3b379e147757:00001] Failing at address: 0x1
[3b379e147757:00001] [ 0] /usr/lib/x86_64-linux-gnu/libpthread.so.0(+0x14420)[0x7f0c8d1a9420]
[3b379e147757:00001] [ 1] /opt/tritonserver/backends/turbomind/libtriton_turbomind.so(+0x18313)[0x7f0c80652313]
[3b379e147757:00001] [ 2] /opt/tritonserver/backends/turbomind/libtriton_turbomind.so(+0x24554)[0x7f0c8065e554]
[3b379e147757:00001] [ 3] /opt/tritonserver/backends/turbomind/libtriton_turbomind.so(TRITONBACKEND_ModelInitialize+0x341)[0x7f0c8065eb81]
[3b379e147757:00001] [ 4] /opt/tritonserver/lib/libtritonserver.so(+0x10689b)[0x7f0c8c2de89b]
[3b379e147757:00001] [ 5] /opt/tritonserver/lib/libtritonserver.so(+0x1c4f5d)[0x7f0c8c39cf5d]
[3b379e147757:00001] [ 6] /opt/tritonserver/lib/libtritonserver.so(+0x1caccd)[0x7f0c8c3a2ccd]
[3b379e147757:00001] [ 7] /opt/tritonserver/lib/libtritonserver.so(+0x3083a0)[0x7f0c8c4e03a0]
[3b379e147757:00001] [ 8] /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xd6de4)[0x7f0c8be25de4]
[3b379e147757:00001] [ 9] /usr/lib/x86_64-linux-gnu/libpthread.so.0(+0x8609)[0x7f0c8d19d609]
[3b379e147757:00001] [10] /usr/lib/x86_64-linux-gnu/libc.so.6(clone+0x43)[0x7f0c8bb10133]
[3b379e147757:00001] *** End of error message ***

有什么debug的建议吗?

[Bug] TurboMind execute failure: 1

Checklist

  • 1. I have searched related issues but cannot get the expected help.
  • 2. The bug has not been fixed in the latest version.

Describe the bug

When i communicate with the inference server more than one round, there is an error. I must rest the session.

TurboMind execute failure:  1
07/31 12:28:43 - service.ft - ERROR - /usr/local/lib/python3.8/dist-packages/lmdeploy/serve/turbomind/chatbot.py - stream_consumer - 553 - got error from turbomind, code StatusCode.TRITON_SERVER_ERR, TurboMind execute failure:  1, token 677

Reproduction

Communicate with the inference server more than one round.

Error traceback

No response

The inference is stuck, possibly occurring before the invocation of the 'forward' method.

Hello, I have run the service successfully. However, when I use the app 'lmdeploy.app.py' and send a message to the server, I notice that the inference gets stuck.

These are the logs of Tritonserver.

[TM][INFO] [forward][rank=0] INPUT: step [1]
[TM][INFO] [forward][rank=0] INPUT: repetition_penalty [1]
[TM][INFO] [forward][rank=0] INPUT: temperature [1]
[TM][INFO] [forward][rank=0] INPUT: STOP [1]
[TM][INFO] [forward][rank=0] INPUT: START [1]
[TM][INFO] [forward][rank=0] INPUT: random_seed [1]
[TM][INFO] [forward][rank=0] INPUT: input_ids [1, 15]
[TM][INFO] [forward][rank=0] INPUT: stop_words_list [1, 2, 2]
[TM][INFO] [forward][rank=0] INPUT: runtime_top_p [1]
[TM][INFO] [forward][rank=0] INPUT: END [1]
[TM][INFO] [forward][rank=0] INPUT: input_lengths [1]
[TM][INFO] [forward][rank=0] INPUT: CORRID [1]
[TM][INFO] [forward][rank=0] INPUT: request_output_len [1, 1]
[TM][INFO] [forward][rank=0] INPUT: session_len [1]
[TM][INFO] [forward][rank=0] OUTPUT: sequence_length [1, 1]
[TM][INFO] [forward][rank=0] OUTPUT: output_ids [1, 1, 2056]
[TM][INFO] [forward] Enqueue requests
[TM][INFO] [forward] Wait for requests to complete ...
[TM][INFO] [synchronize] batch_size = 0
[TM][INFO] [LlamaCacheManager][create] 140002462050048
[TM][INFO] [LlamaCacheManager][allocate]
[TM][INFO] [LlamaCacheManager][allocate] free = 0
[TM][INFO] [init] infer_request_count = 1
[TM][INFO] [init] batch_size = 1
[TM][INFO] [init] session_len = 2056
[TM][INFO] [init] max_input_length = 15
[TM][INFO] [init] max_context_len = 15
[TM][INFO] [init] slot  sequence_id  history_len  input_len  context_len  tmp_input_len  token_ids.size  cache_len
[TM][INFO] [init]    0   3708069632            0         15           15             15               0          0
[TM][INFO] [decodeContext] base = 0, count = 1
[TM][INFO] [decodeContext] offset = 0, batch_size = 1, token_num = 14, max_input_len = 14, max_context_len = 14
[TM][INFO] context decoding start

Based on the source code, I believe it might be stuck before the 'forward' method.

TM_LOG_INFO("context decoding start");

Could you give me some advice? Thanks.

根据readme.md部署报错 [FT][ERROR] CUDA runtime error: invalid argument /opt/trito

step1 :下载internlm-chat-7b模型

step2: 运行docker镜像docker run -itd --net=host --name internlm_server --gpus all -v ./workspace/:/workspace -v /data/models/:/models -it openmmlab/lmdeploy:latest bash

step3: 转换模型为turbomind ,因为我是有两台T4
root@:/opt/tritonserver/lmdeploy# python3 -m lmdeploy.serve.turbomind.deploy internlm-7b /models/internlm-chat-7b hf --tp 2

step4: 切换到目录运行
root@/opt/tritonserver/lmdeploy# python3 -m lmdeploy.turbomind.chat internlm ./workspace/

报错如下:

[WARNING] gemm_config.in is not found; using default GEMM algo
terminate called after throwing an instance of 'std::runtime_error'
what(): [FT][ERROR] CUDA runtime error: invalid argument /opt/tritonserver/lmdeploy/src/turbomind/utils/allocator.h:252

Aborted (core dumped)

[QA] 如何将ckpt保存内容转换为pytorch_model文件

请教下如何将训练过程保存的ckpt内容转换为pytorch_model内容?谢谢

比如,使用 zero=4/tensor=2 + 自有数据 预训练了100步,保存的ckpt文件夹内容:
context.pt gpus-8_pp-0_tp-0_zo-3.pt gpus-8_pp-0_tp-0_zo-7.pt optimizer_tp0_pp0_zo1.pt optimizer_tp0_pp0_zo5.pt schedulder.pt
gpus-8_pp-0_tp-0_zo-0.pt gpus-8_pp-0_tp-0_zo-4.pt model_config.pt optimizer_tp0_pp0_zo2.pt optimizer_tp0_pp0_zo6.pt topo_tp0_pp0.json
gpus-8_pp-0_tp-0_zo-1.pt gpus-8_pp-0_tp-0_zo-5.pt model_tp0_pp0.pt optimizer_tp0_pp0_zo3.pt optimizer_tp0_pp0_zo7.pt
gpus-8_pp-0_tp-0_zo-2.pt gpus-8_pp-0_tp-0_zo-6.pt optimizer_tp0_pp0_zo0.pt optimizer_tp0_pp0_zo4.pt sampler.pt

目标:转成可以直接被 lmdeploy.serve.turbomind.deploy加载的模型文件
config.json modeling_internlm.py pytorch_model.bin.index.json tokenization_internlm.py
configuration_internlm.py pytorch_model-00001-of-00002.bin README.md tokenizer_config.json
generation_config.json pytorch_model-00002-of-00002.bin special_tokens_map.json

[documentation] run failed according the readme docs

Get InternLM model

# 1. Download InternLM model

# Make sure you have git-lfs installed (https://git-lfs.com)
git lfs install
git clone https://huggingface.co/internlm/internlm-7b /path/to/internlm-7b

# if you want to clone without large files – just their pointers
# prepend your git clone with the following env var:
GIT_LFS_SKIP_SMUDGE=1

# 2. Convert InternLM model to turbomind's format, which will be in "./workspace" by default
python3 -m lmdeploy.serve.turbomind.deploy internlm-7b /path/to/internlm-7b hf

Users need to install requirements for internlm before then we can run python3 -m lmdeploy.serve.turbomind.deploy internlm-7b /path/to/internlm-7b hf successfully. so can we add these tips in the document?

Comparasion with vllm

vllm can boost up to 24x compare with vanilla llama version, does lmdeploy have any speed test compare with it?

Question about benchmark

Hi, I tested LMDeploy with the following steps,

    1. Get models from https://huggingface.co/internlm/internlm-chat-7b/
    1. Convert to triton models python -m lmdeploy.serve.turbomind.deploy interlm-7b interlm-7b hf
    1. Run python3 profile_generation.py --model_path /workspace/ --model_name internlm --concurrency 8 --input_seqlen 1 --output_seqlen 2048 --test_round 8 in provided docker image container openmmlab/lmdeploy:latest with A100 80G

The result I get is throughput: 70.98455828512093 token/s while the document shows it will reach 640 token/s almost with batch=8.
image

Are there any configurations I need to modify, Thanks

an error about llama-65b

65B
python3 lmdeploy.serve.turbomind.deploy llama-13B /path/to/llama-13b llama
--tokenizer_path /path/to/tokenizer/model --tp 8
bash workspace/service_docker_up.sh

is this correct?i found this in docs.

ModuleNotFoundError: No module named '_turbomind'

I installed with pip install -e . and tried to run python3 -m lmdeploy.turbomind.chat llama ... but got:

  File "/mnt//lmdeploy/lmdeploy/turbomind/__init__.py", line 3, in <module>
    from .turbomind import TurboMind
  File "/mnt//work/lmdeploy/lmdeploy/turbomind/turbomind.py", line 17, in <module>
    import _turbomind as _tm  # noqa: E402
ModuleNotFoundError: No module named '_turbomind'

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.