internlm / lmdeploy Goto Github PK

View Code? Open in Web Editor NEW

2.5K 24.0 230.0 3.26 MB

LMDeploy is a toolkit for compressing, deploying, and serving LLMs.

Home Page: https://lmdeploy.readthedocs.io/en/latest/

License: Apache License 2.0

Python 42.51% Shell 0.33% CMake 1.50% Cuda 19.30% C++ 36.28% C 0.04% Dockerfile 0.03% PowerShell 0.01%

cuda-kernels deepspeed fastertransformer llm-inference turbomind internlm llama llm codellama llama2

lmdeploy's People

Stargazers

Watchers

Forkers

grimoire runningleon lzhangzz wangruohui ajinkyapuar zhangjun aptsunny tpoisonooo deftruth ajay-wong coderchem lvhan028 hanrui1sensetime chenzb0712 marsbzp del-zhenwu apx103 allentdan assassindesign kevinnunu rollroll90 humu789 streamsunshine skyrookieyu techsuni2023 varuy322 irexyc adai-5090 mygit-2023 ipackhu kunlun-zhu zimix0 vansin hanzz2007 lkjacky lzhgrla braintaken ludoplex harold-lkk feiyunwill gzl11 sundogs8603 amarone tianhaofu mworldstudio2023 zwh821 abbasns huangyingting camenduru hudengjunai huangongshu switiz87 jimmyma99 jjhw asdlei99 zigzagcai felixgithub2017 changdaxia-pts cdm1619 sleepwalker2017 nicolasf felixstander yihaocs hzzhang-nlp birdhaihe pdtgct iongpt sam2332 mauridev777 jianantian pangsg jack1981 wangxingjun778 alwaysssssss ksksks2222 yoctohan woolenwang 9tong mikejin5c scv119 akhoroshev jiajiajia789 eltociear aisensiy spongezz huxiangkun c1rn09 genggui001 mqlww shahrukhx01 opendilab-llm-safety mokeyish open-mmlab-12 hit-cwh shshenhao yunzhongyan0 kevin-14 yaospacetim zhcharles pppppm

lmdeploy's Issues

Unable to find image 'openmmlab/lmdeploy:latest' locally

Install following the readme process, facing the upfront problem.
I am deploying llama2-chat-7b

【P0】support vicuna 7B

llmdeploy didn't support vicuna 7B well, because its preprocessor cannot tokenize <s> and </s> into bos and eos token respectively.

I think we'd better change the tokenizer (llmdeploy/fastertransformer/triton_models/preprcessing/1/model.py) with huggingface's AutoTokenizer

F.Y.I, here is an introduction to download and serve vicuna-7B v1.1 model

Question about persistent Batch Inference

Hi, Thank you for the open source LMDeploy project!

There is a image in the documentation that provides a good description of the process of dynamic batching inference, but I couldn't find more details about how LMDeploy implements this function.

Is there any document or could you tell me where this part is implemented in the code?

Get trouble with 'Quantization' in '/README.md'

here is the log

(lmdeploy_test) [xxx@xxxxxxxxxxxx internlm_test]$ python -m lmdeploy.lite.apis.kv_qparams --model internlm-chat-7b --output_dir internlm-chat-7b-deploy --symmetry True --offload  False --num_tp 1
Traceback (most recent call last):
  File "/xxx/xxx/miniconda3/envs/lmdeploy_test/lib/python3.9/runpy.py", line 197, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/nvme/xxx/miniconda3/envs/lmdeploy_test/lib/python3.9/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/xxx/xxx/miniconda3/envs/lmdeploy_test/lib/python3.9/site-packages/lmdeploy/lite/apis/kv_qparams.py", line 199, in <module>
    fire.Fire(main)
  File "/xxx/xxx/miniconda3/envs/lmdeploy_test/lib/python3.9/site-packages/fire/core.py", line 141, in Fire
    component_trace = _Fire(component, args, parsed_flag_args, context, name)
  File "/xxx/xxx/miniconda3/envs/lmdeploy_test/lib/python3.9/site-packages/fire/core.py", line 475, in _Fire
    component, remaining_args = _CallAndUpdateTrace(
  File "/xxx/xxx/miniconda3/envs/lmdeploy_test/lib/python3.9/site-packages/fire/core.py", line 691, in _CallAndUpdateTrace
    component = fn(*varargs, **kwargs)
  File "/xxx/xxx/miniconda3/envs/lmdeploy_test/lib/python3.9/site-packages/lmdeploy/lite/apis/kv_qparams.py", line 112, in main
    tokenizer = AutoTokenizer.from_pretrained(model, use_fast=False)
  File "/xxx/xxx/miniconda3/envs/lmdeploy_test/lib/python3.9/site-packages/transformers/models/auto/tokenization_auto.py", line 688, in from_pretrained
    raise ValueError(
ValueError: Tokenizer class InternLMTokenizer does not exist or is not currently imported.

It looks like it can't get 'InternLMTokenizer' smoothly

将llama-7b转换格式后docker启动报错
转换命令：
python3 -m lmdeploy.serve.turbomind.deploy llama-7b /home/nlp/lwp/pre_models/llama-7b-hf hf
启动命令：
docker run --gpus "device=8" --rm -v $(pwd)/workspace:/workspace -it openmmlab/lmdeploy:latest python3 -m lmdeploy.turbomind.chat llama /workspace
报错信息：
[WARNING] gemm_config.in is not found; using default GEMM algo
Traceback (most recent call last):
File "/usr/lib/python3.8/runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/usr/lib/python3.8/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/usr/local/lib/python3.8/dist-packages/lmdeploy/turbomind/chat.py", line 96, in
fire.Fire(main)
File "/usr/local/lib/python3.8/dist-packages/fire/core.py", line 141, in Fire
component_trace = _Fire(component, args, parsed_flag_args, context, name)
File "/usr/local/lib/python3.8/dist-packages/fire/core.py", line 475, in _Fire
component, remaining_args = _CallAndUpdateTrace(
File "/usr/local/lib/python3.8/dist-packages/fire/core.py", line 691, in _CallAndUpdateTrace
component = fn(*varargs, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/lmdeploy/turbomind/chat.py", line 35, in main
tokenizer = AutoTokenizer.from_pretrained(tokenizer_model_path,
File "/usr/local/lib/python3.8/dist-packages/transformers/models/auto/tokenization_auto.py", line 690, in from_pretrained
raise ValueError(
ValueError: Tokenizer class LLaMATokenizer does not exist or is not currently imported.

使用workerspace下面的脚本运行的时候也会报错，会报error: creating server: Internal - failed to load all models

请问这是哪里出错了？

请问怎么进行batch inference

在batch为2的情况下，我执行了 python3 -m lmdeploy.turbomind.chat llama /workspace [0,1] 以及将input_ids填了batch个，但是出现RuntimeError: output with shape [1] doesn't match the broadcast shape [2]

[WIP] Support InternLM on 3rd-party inference toolboxes

This issue is to track progress on 3rd party toolboxes which is related to InternLM.

VLLM

https://github.com/wangruohui/vllm/tree/internlm

Inference with single GPU
- There seems some bug, not sure from my implementation or from upstream
Tensor parallel

DeepSpeed

InternLM-7B is supported in Deepspeed inference and merged to main branch: microsoft/DeepSpeed#4137

Single GPU with kernel infection policy
Tensor parallel

Meta tensor for faster model loading: watching microsoft/DeepSpeed#3608

支持tritonserver

我看了一下代码，tritonBackend好像也支持，但是没int8_mode的选项，是没有支持吗？

After I deploy it with 'bash workspace/service_docker_up.sh' how may I call it on other python code?

I would like to use this deployment as a server that all python code could call it as an api. Is it possible to do that? Thanks a lot!

'InternLMTokenizer' object has no attribute 'backend_tokenizer'

在Inference by TurboMind的时候使用命令
python3 -m lmdeploy.turbomind.chat internlm ./workspace/报错
Traceback (most recent call last):
File "/usr/lib/python3.8/runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/usr/lib/python3.8/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/opt/tritonserver/lmdeploy/lmdeploy/turbomind/chat.py", line 109, in
fire.Fire(main)
File "/usr/local/lib/python3.8/dist-packages/fire/core.py", line 141, in Fire
component_trace = _Fire(component, args, parsed_flag_args, context, name)
File "/usr/local/lib/python3.8/dist-packages/fire/core.py", line 475, in _Fire
component, remaining_args = _CallAndUpdateTrace(
File "/usr/local/lib/python3.8/dist-packages/fire/core.py", line 691, in _CallAndUpdateTrace
component = fn(*varargs, **kwargs)
File "/opt/tritonserver/lmdeploy/lmdeploy/turbomind/chat.py", line 43, in main
tokenizer = Tokenizer(tokenizer_model_path)
File "/opt/tritonserver/lmdeploy/lmdeploy/turbomind/tokenizer.py", line 152, in init
self.model = HuggingFaceTokenizer(model_folder)
File "/opt/tritonserver/lmdeploy/lmdeploy/turbomind/tokenizer.py", line 84, in init
self.model.backend_tokenizer.save(backend_tokenizer_file)
AttributeError: 'InternLMTokenizer' object has no attribute 'backend_tokenizer

我看了下源码，好像这个backend_tokenizer确实没咋用到，这个有用吗

RuntimeError: Internal: src/sentencepiece_processor.cc(1101) [model_proto->ParseFromArray(serialized.data(), serialized.size())]

(lmdeploy) ➜  lmdeploy git:(main) python -m lmdeploy.serve.turbomind.deploy internlm-7b /mnt/internlm-7b hf
create workspace in directory ./workspace
copy triton model templates from "/mnt/lmdeploy/lmdeploy/serve/turbomind/triton_models" to "./workspace/triton_models" successfully
['pytorch_model-00001-of-00002.bin', 'pytorch_model-00002-of-00002.bin']

### copying layers.31.attention.wo.bias, shape=torch.Size([4096])
layers.31.attention.wo.0.bias torch.Size([4096])
*** splitting layers.31.feed_forward.w1.weight, shape=torch.Size([4096, 11008]), split_dim=-1
layers.31.feed_forward.w1.0.weight torch.Size([4096, 11008])
*** splitting layers.31.feed_forward.w2.weight, shape=torch.Size([11008, 4096]), split_dim=0
layers.31.feed_forward.w2.0.weight torch.Size([11008, 4096])
*** splitting layers.31.feed_forward.w3.weight, shape=torch.Size([4096, 11008]), split_dim=-1
layers.31.feed_forward.w3.0.weight torch.Size([4096, 11008])
layers.31.attention_norm.weight torch.Size([4096])
layers.31.ffn_norm.weight torch.Size([4096])
tok_embeddings.weight torch.Size([103168, 4096])
norm.weight torch.Size([4096])
output.weight torch.Size([103168, 4096])
Traceback (most recent call last):
  File "/mnt/miniconda/envs/lmdeploy/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/mnt/miniconda/envs/lmdeploy/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/mnt/lmdeploy/lmdeploy/serve/turbomind/deploy.py", line 549, in <module>
    fire.Fire(main)
  File "/mnt/miniconda/envs/lmdeploy/lib/python3.10/site-packages/fire/core.py", line 141, in Fire
    component_trace = _Fire(component, args, parsed_flag_args, context, name)
  File "/mnt/miniconda/envs/lmdeploy/lib/python3.10/site-packages/fire/core.py", line 475, in _Fire
    component, remaining_args = _CallAndUpdateTrace(
  File "/mnt/miniconda/envs/lmdeploy/lib/python3.10/site-packages/fire/core.py", line 691, in _CallAndUpdateTrace
    component = fn(*varargs, **kwargs)
  File "/mnt/lmdeploy/lmdeploy/serve/turbomind/deploy.py", line 522, in main
    res = deploy_hf(model_name, model_path, tokenizer_path,
  File "/mnt/lmdeploy/lmdeploy/serve/turbomind/deploy.py", line 462, in deploy_hf
    return export(model_name, num_layer, norm_eps, model_params,
  File "/mnt/lmdeploy/lmdeploy/serve/turbomind/deploy.py", line 167, in export
    vocab_size, bos_id, eos_id = tokenizer_info(tokenizer_path)
  File "/mnt/lmdeploy/lmdeploy/serve/turbomind/deploy.py", line 87, in tokenizer_info
    sp_model = SentencePieceProcessor(model_file=model_path)
  File "/mnt/miniconda/envs/lmdeploy/lib/python3.10/site-packages/sentencepiece/__init__.py", line 447, in Init
    self.Load(model_file=model_file, model_proto=model_proto)
  File "/mnt/miniconda/envs/lmdeploy/lib/python3.10/site-packages/sentencepiece/__init__.py", line 905, in Load
    return self.LoadFromFile(model_file)
  File "/mnt/miniconda/envs/lmdeploy/lib/python3.10/site-packages/sentencepiece/__init__.py", line 310, in LoadFromFile
    return _sentencepiece.SentencePieceProcessor_LoadFromFile(self, arg)
RuntimeError: Internal: src/sentencepiece_processor.cc(1101) [model_proto->ParseFromArray(serialized.data(), serialized.size())]

need all the LFS file

Gradually Out of memory from 14G to 24G

Using deepspeed tp to load InternLM, but memory do not save.

我现在的硬件配置是4块16G的V100，我按照README.md里基于pytorch推理

python3 -m lmdeploy.pytorch.chat $NAME_OR_PATH_TO_HF_MODEL\
    --max_new_tokens 64 \
    --temperture 0.8 \
    --top_p 0.95 \
    --seed 0

单卡推理load完模型后占用显存15G

deepspeed --module --num_gpus 2 lmdeploy.pytorch.chat \
    $NAME_OR_PATH_TO_HF_MODEL \
    --max_new_tokens 64 \
    --temperture 0.8 \
    --top_p 0.95 \
    --seed 0

双卡tp推理load完模型后每张卡占用显存也是15G

没有达到期望的降低显存的效果，不是很理解问题出在哪？

Can we have support for GGML as triton backend

I think this is something interesting and no one has done it before

Question about deepspeed benchmark

Hi, I wonder how can I reproduce the deepspeed benchmark in the readme picture.

Support for Llama-2 chinese?

Llama2-Chinese

and how to deploy Llama2 with GQA ? ( serving.md only contains instructions for Llama1)

huggingface safetensor support

I'm trying to deploy LLaMA2 70b chat model locally and find that this LMDeploy seems don't support huggingface safetensor ckpt. It just raise a confusing Exception:

Traceback (most recent call last):
  File "/opt/conda/envs/lmdeploy/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/opt/conda/envs/lmdeploy/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/workspace/lmdeploy/lmdeploy/serve/turbomind/deploy.py", line 592, in <module>
    fire.Fire(main)
  File "/opt/conda/envs/lmdeploy/lib/python3.10/site-packages/fire/core.py", line 141, in Fire
    component_trace = _Fire(component, args, parsed_flag_args, context, name)
  File "/opt/conda/envs/lmdeploy/lib/python3.10/site-packages/fire/core.py", line 475, in _Fire
    component, remaining_args = _CallAndUpdateTrace(
  File "/opt/conda/envs/lmdeploy/lib/python3.10/site-packages/fire/core.py", line 691, in _CallAndUpdateTrace
    component = fn(*varargs, **kwargs)
  File "/workspace/lmdeploy/lmdeploy/serve/turbomind/deploy.py", line 562, in main
    res = deploy_hf(model_name, model_path, tokenizer_path,
  File "/workspace/lmdeploy/lmdeploy/serve/turbomind/deploy.py", line 482, in deploy_hf
    assert num_layer == i, f'miss matched layers: {num_layer} vs {i}'
AssertionError: miss matched layers: 80 vs 0

because it only read *.bin:

lmdeploy/lmdeploy/serve/turbomind/deploy.py

Line 395 in c1c1353

_files = [file for file in os.listdir(model_path) if file.endswith('.bin')]

RuntimeError: Internal: src/sentencepiece_processor.cc(1101) [model_proto->ParseFromArray(serialized.data(), serialized.size())]

(lmdeploy) ➜  lmdeploy git:(main) python -m lmdeploy.pytorch.chat /mnt/internlm-7b \ 
    --max_new_tokens 64 \
    --temperture 0.8 \
    --top_p 0.95 \
    --seed 0
Traceback (most recent call last):
  File "/mnt/miniconda/envs/lmdeploy/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/mnt/miniconda/envs/lmdeploy/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/mnt/lmdeploy/lmdeploy/pytorch/chat.py", line 190, in <module>
    fire.Fire(main)
  File "/mnt/miniconda/envs/lmdeploy/lib/python3.10/site-packages/fire/core.py", line 141, in Fire
    component_trace = _Fire(component, args, parsed_flag_args, context, name)
  File "/mnt/miniconda/envs/lmdeploy/lib/python3.10/site-packages/fire/core.py", line 475, in _Fire
    component, remaining_args = _CallAndUpdateTrace(
  File "/mnt/miniconda/envs/lmdeploy/lib/python3.10/site-packages/fire/core.py", line 691, in _CallAndUpdateTrace
    component = fn(*varargs, **kwargs)
  File "/mnt/lmdeploy/lmdeploy/pytorch/chat.py", line 120, in main
    tokenizer, model = init_model(
  File "/mnt/lmdeploy/lmdeploy/pytorch/chat.py", line 62, in init_model
    tokenizer = AutoTokenizer.from_pretrained(tokenizer_path,
  File "/mnt/miniconda/envs/lmdeploy/lib/python3.10/site-packages/transformers/models/auto/tokenization_auto.py", line 693, in from_pretrained
    return tokenizer_class.from_pretrained(pretrained_model_name_or_path, *inputs, **kwargs)
  File "/mnt/miniconda/envs/lmdeploy/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 1812, in from_pretrained
    return cls._from_pretrained(
  File "/mnt/miniconda/envs/lmdeploy/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 1975, in _from_pretrained
    tokenizer = cls(*init_inputs, **init_kwargs)
  File "/root/.cache/huggingface/modules/transformers_modules/internlm-7b/tokenization_internlm.py", line 81, in __init__
    self.sp_model.Load(vocab_file)
  File "/mnt/miniconda/envs/lmdeploy/lib/python3.10/site-packages/sentencepiece/__init__.py", line 905, in Load
    return self.LoadFromFile(model_file)
  File "/mnt/miniconda/envs/lmdeploy/lib/python3.10/site-packages/sentencepiece/__init__.py", line 310, in LoadFromFile
    return _sentencepiece.SentencePieceProcessor_LoadFromFile(self, arg)
RuntimeError: Internal: src/sentencepiece_processor.cc(1101) [model_proto->ParseFromArray(serialized.data(), serialized.size())]

pip list

(lmdeploy) ➜  lmdeploy git:(main) pip list
Package            Version  Editable project location
------------------ -------- -------------------------
addict             2.4.0
brotlipy           0.7.0
certifi            2023.5.7
cffi               1.15.1
charset-normalizer 2.0.4
contourpy          1.1.0
cryptography       39.0.1
cycler             0.11.0
filelock           3.9.0
fire               0.5.0
fonttools          4.41.0
fsspec             2023.6.0
gmpy2              2.1.2
grpcio             1.56.0
huggingface-hub    0.16.4
idna               3.4
importlib-metadata 6.8.0
Jinja2             3.1.2
kiwisolver         1.4.4
lmdeploy           0.0.1    /mnt/lmdeploy
markdown-it-py     3.0.0
MarkupSafe         2.1.1
matplotlib         3.7.2
mdurl              0.1.2
mkl-fft            1.3.6
mkl-random         1.2.2
mkl-service        2.4.0
mmengine           0.8.2
mpmath             1.2.1
networkx           2.8.4
numpy              1.25.0
opencv-python      4.8.0.74
packaging          23.1
Pillow             9.4.0
pip                23.1.2
platformdirs       3.9.1
protobuf           4.23.4
pybind11           2.11.0
pycparser          2.21
Pygments           2.15.1
pyOpenSSL          23.0.0
pyparsing          3.0.9
PySocks            1.7.1
python-dateutil    2.8.2
python-rapidjson   1.10
PyYAML             6.0
regex              2023.6.3
requests           2.29.0
rich               13.4.2
safetensors        0.3.1
sentencepiece      0.1.99
setuptools         67.8.0
six                1.16.0
sympy              1.11.1
termcolor          2.3.0
tokenizers         0.13.3
tomli              2.0.1
torch              2.0.0
torchaudio         2.0.0
torchvision        0.15.0
tqdm               4.65.0
transformers       4.29.2
triton             2.0.0
tritonclient       2.33.0
typing_extensions  4.6.3
urllib3            1.26.16
wheel              0.38.4
yapf               0.40.1
zipp               3.16.2

Test

PB10.mp4

PB11.mp4

PersistentBatchInference.mp4

请问llama 65b kv cache量化和context fmha不能同时打开吗？

在部署测试过程中，llama 7b的use_context_fmha = 1，quant_policy = 4是可以运行的，但是llama 65b不可以，需要use_context_fmha = 0。请问这是我这边的问题还是目前确实不能同时打开呢？

报错是这个：
what(): [TM][ERROR] CUDA runtime error: an illegal memory access was encountered lmdeploy/src/turbomind/models/llama/LlamaBatch.cc:843

有没有不用docker部署的方案

我想直接在机器上跑，不用docker，能否实现

test issue bot

testdata.

图床

HTTP client question

Is there a regular HTTP request client that does not require complex lmdeploy package installation and gRPC calls, nor does it need streaming transmission, returning all answer results in a single response.

请问这个部署支持多卡吗？

a bug

I try to deploy in my server with 2*3090, cuda-11.7.
It deploy normally with command:"docker run --gpus all --rm -v $(pwd)/workspace:/workspace -it openmmlab/lmdeploy:latest
python3 -m lmdeploy.turbomind.chat internlm /workspace"
however it can't be deployed by command:"bash workspace/service_docker_up.sh" because segment fault.

otherwise, when I try "python3 lmdeploy.app {server_ip_addresss}:33337 internlm" to deploy a client，it report : torch don't have module cuda. it is because you add lmdeploy/lmdeploy/torch into sys.path, and that torch don't have cuda module. I fix this by add "sys.path.remove("lmdeploy/lmdeploy/torch")"

请问未来会支持 Beam search 吗？

看到src里有Beam search相关的代码，但是似乎还不能设置，请问未来会支持吗

请问会支持chatglm/chatglm2吗

I am confused about KV Cache Manager, How does it work?

Question about internrm-chat-7b-8k

用tp=2 转换Internlm-chat-7b-8k模型为turbomind格式，最终生成的weight/config.ini如下，8k不是最大支持8千多个token嘛？这个在哪里设置的，我现在调用超过2048就报错了

[llama]
model_name = internlm-chat-7b
head_num = 32
size_per_head = 128
vocab_size = 103168
num_layer = 32
rotary_embedding = 128
inter_size = 11008
norm_eps = 1e-06
attn_bias = 1
start_id = 1
end_id = 2
weight_type = fp16
max_batch_size = 32
max_context_token_num = 4
session_len = 2056
step_length = 1
cache_max_entry_count = 48
cache_chunk_size = 1
use_context_fmha = 1
quant_policy = 0
tensor_para_size = 2

ziya启动不了

(lmdeploy) ➜ lmdeploy sudo bash workspace/service_docker_up.sh

=============================
== Triton Inference Server ==

NVIDIA Release 22.12 (build 50109463)
Triton Server Version 2.29.0

This container image and its contents are governed by the NVIDIA Deep Learning Container License.
By pulling and using the container, you accept the terms and conditions of this license:
https://developer.nvidia.com/ngc/nvidia-deep-learning-container-license

I0721 03:34:28.851237 1 pinned_memory_manager.cc:240] Pinned memory pool is created at '0x7f6d00000000' with size 268435456
I0721 03:34:28.851696 1 cuda_memory_manager.cc:105] CUDA memory pool is created on device 0 with size 67108864
I0721 03:34:28.857984 1 model_lifecycle.cc:459] loading: postprocessing:1
I0721 03:34:28.858019 1 model_lifecycle.cc:459] loading: preprocessing:1
I0721 03:34:28.858036 1 model_lifecycle.cc:459] loading: turbomind:1
I0721 03:34:29.000309 1 libfastertransformer.cc:1746] TRITONBACKEND_Initialize: turbomind
I0721 03:34:29.000337 1 libfastertransformer.cc:1753] Triton TRITONBACKEND API version: 1.10
I0721 03:34:29.000340 1 libfastertransformer.cc:1757] 'turbomind' TRITONBACKEND API version: 1.10
I0721 03:34:29.002218 1 libfastertransformer.cc:1784] TRITONBACKEND_ModelInitialize: turbomind (version 1)
I0721 03:34:29.002902 1 libfastertransformer.cc:307] Instance group type: KIND_CPU count: 48
num_nodes=1
tp_pp_size=1
gpu_size=1
world_size=1
model_instance_size=1
I0721 03:34:29.002929 1 libfastertransformer.cc:346] Sequence Batching: disabled
I0721 03:34:29.002934 1 libfastertransformer.cc:357] Dynamic Batching: disabled
[ERROR] Does not find the section llama with name model_name.

请问是否支持T5、bart这类encoder-decoder模型的推理加速

能否给个示例，谢谢

[Feature] 上下文缓存一直占用显存不释放，使用torch.cuda.empty_cache()无法释放显存缓存

希望增加一个清除缓存功能，在reset或者手动清空缓存的情况下，显存能恢复加载模型的初始状态

创建模型时候报错ModelLifeCycle::CreateModel()

========== step1 ============
我在一台机器上执行的 lmdeploy.serve.turbomind.deploy命令，并且参数tp=2，因为我想将模型放在两个gpu上运行
这一步是成功的

========== step2 ============
在另一台机器上执行service_docker_up.sh，tritonserver启动了turbomind后端，但是模型没有成功加载，报错了

这是报错内容

I0712 08:36:20.719380 1 pinned_memory_manager.cc:240] Pinned memory pool is created at '0x7f0c44000000' with size 268435456
I0712 08:36:20.720671 1 cuda_memory_manager.cc:105] CUDA memory pool is created on device 0 with size 67108864
I0712 08:36:20.720696 1 cuda_memory_manager.cc:105] CUDA memory pool is created on device 1 with size 67108864
W0712 08:36:20.859316 1 server.cc:218] failed to enable peer access for some device pairs
I0712 08:36:21.439338 1 model_lifecycle.cc:459] loading: turbomind:1
I0712 08:36:21.441099 1 model_lifecycle.cc:459] loading: postprocessing:1
I0712 08:36:21.442823 1 model_lifecycle.cc:459] loading: preprocessing:1
I0712 08:36:21.582434 1 libfastertransformer.cc:1746] TRITONBACKEND_Initialize: turbomind
I0712 08:36:21.582484 1 libfastertransformer.cc:1753] Triton TRITONBACKEND API version: 1.10
I0712 08:36:21.582501 1 libfastertransformer.cc:1757] 'turbomind' TRITONBACKEND API version: 1.10
I0712 08:36:21.585413 1 libfastertransformer.cc:1784] TRITONBACKEND_ModelInitialize: turbomind (version 1)
I0712 08:36:21.586543 1 libfastertransformer.cc:307] Instance group type: KIND_CPU count: 48
E0712 08:36:21.586654 1 libfastertransformer.cc:226] Invalid configuration argument 'tensor_para_size': stoi
[3b379e147757:1    :0:86] Caught signal 8 (Floating point exception: integer divide by zero)
==== backtrace (tid:     86) ====
 0 0x0000000000014420 __funlockfile()  ???:0
 1 0x0000000000018313 triton::backend::turbomind_backend::ModelState::ModelState()  /opt/tritonserver/lmdeploy/src/turbomind/triton_backend/libfastertransformer.cc:323
 2 0x0000000000024554 triton::backend::turbomind_backend::ModelState::Create()  /opt/tritonserver/lmdeploy/src/turbomind/triton_backend/libfastertransformer.cc:182
 3 0x0000000000024b81 TRITONBACKEND_ModelInitialize()  /opt/tritonserver/lmdeploy/src/turbomind/triton_backend/libfastertransformer.cc:1791
 4 0x000000000010689b triton::core::TritonModel::Create()  :0
 5 0x00000000001c4f5d triton::core::ModelLifeCycle::CreateModel()  :0
 6 0x00000000001caccd std::_Function_handler<void (), triton::core::ModelLifeCycle::AsyncLoad(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, inference::ModelConfig const&, bool, std::shared_ptr<triton::core::TritonRepoAgentModelList> const&, std::function<void (triton::core::Status)>&&)::{lambda()#1}>::_M_invoke()  model_lifecycle.cc:0
 7 0x00000000003083a0 std::thread::_State_impl<std::thread::_Invoker<std::tuple<triton::common::ThreadPool::ThreadPool(unsigned long)::{lambda()#1}> > >::_M_run()  thread_pool.cc:0
 8 0x00000000000d6de4 std::error_code::default_error_condition()  ???:0
 9 0x0000000000008609 start_thread()  ???:0
10 0x000000000011f133 clone()  ???:0
=================================
[3b379e147757:00001] *** Process received signal ***
[3b379e147757:00001] Signal: Floating point exception (8)
[3b379e147757:00001] Signal code:  (-6)
[3b379e147757:00001] Failing at address: 0x1
[3b379e147757:00001] [ 0] /usr/lib/x86_64-linux-gnu/libpthread.so.0(+0x14420)[0x7f0c8d1a9420]
[3b379e147757:00001] [ 1] /opt/tritonserver/backends/turbomind/libtriton_turbomind.so(+0x18313)[0x7f0c80652313]
[3b379e147757:00001] [ 2] /opt/tritonserver/backends/turbomind/libtriton_turbomind.so(+0x24554)[0x7f0c8065e554]
[3b379e147757:00001] [ 3] /opt/tritonserver/backends/turbomind/libtriton_turbomind.so(TRITONBACKEND_ModelInitialize+0x341)[0x7f0c8065eb81]
[3b379e147757:00001] [ 4] /opt/tritonserver/lib/libtritonserver.so(+0x10689b)[0x7f0c8c2de89b]
[3b379e147757:00001] [ 5] /opt/tritonserver/lib/libtritonserver.so(+0x1c4f5d)[0x7f0c8c39cf5d]
[3b379e147757:00001] [ 6] /opt/tritonserver/lib/libtritonserver.so(+0x1caccd)[0x7f0c8c3a2ccd]
[3b379e147757:00001] [ 7] /opt/tritonserver/lib/libtritonserver.so(+0x3083a0)[0x7f0c8c4e03a0]
[3b379e147757:00001] [ 8] /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xd6de4)[0x7f0c8be25de4]
[3b379e147757:00001] [ 9] /usr/lib/x86_64-linux-gnu/libpthread.so.0(+0x8609)[0x7f0c8d19d609]
[3b379e147757:00001] [10] /usr/lib/x86_64-linux-gnu/libc.so.6(clone+0x43)[0x7f0c8bb10133]
[3b379e147757:00001] *** End of error message ***

有什么debug的建议吗？

[Bug] TurboMind execute failure: 1

Checklist

1. I have searched related issues but cannot get the expected help.
2. The bug has not been fixed in the latest version.

Describe the bug

When i communicate with the inference server more than one round, there is an error. I must rest the session.

TurboMind execute failure:  1
07/31 12:28:43 - service.ft - ERROR - /usr/local/lib/python3.8/dist-packages/lmdeploy/serve/turbomind/chatbot.py - stream_consumer - 553 - got error from turbomind, code StatusCode.TRITON_SERVER_ERR, TurboMind execute failure:  1, token 677

Reproduction

Communicate with the inference server more than one round.

Error traceback

No response

The inference is stuck, possibly occurring before the invocation of the 'forward' method.

Hello, I have run the service successfully. However, when I use the app 'lmdeploy.app.py' and send a message to the server, I notice that the inference gets stuck.

These are the logs of Tritonserver.

[TM][INFO] [forward][rank=0] INPUT: step [1]
[TM][INFO] [forward][rank=0] INPUT: repetition_penalty [1]
[TM][INFO] [forward][rank=0] INPUT: temperature [1]
[TM][INFO] [forward][rank=0] INPUT: STOP [1]
[TM][INFO] [forward][rank=0] INPUT: START [1]
[TM][INFO] [forward][rank=0] INPUT: random_seed [1]
[TM][INFO] [forward][rank=0] INPUT: input_ids [1, 15]
[TM][INFO] [forward][rank=0] INPUT: stop_words_list [1, 2, 2]
[TM][INFO] [forward][rank=0] INPUT: runtime_top_p [1]
[TM][INFO] [forward][rank=0] INPUT: END [1]
[TM][INFO] [forward][rank=0] INPUT: input_lengths [1]
[TM][INFO] [forward][rank=0] INPUT: CORRID [1]
[TM][INFO] [forward][rank=0] INPUT: request_output_len [1, 1]
[TM][INFO] [forward][rank=0] INPUT: session_len [1]
[TM][INFO] [forward][rank=0] OUTPUT: sequence_length [1, 1]
[TM][INFO] [forward][rank=0] OUTPUT: output_ids [1, 1, 2056]
[TM][INFO] [forward] Enqueue requests
[TM][INFO] [forward] Wait for requests to complete ...
[TM][INFO] [synchronize] batch_size = 0
[TM][INFO] [LlamaCacheManager][create] 140002462050048
[TM][INFO] [LlamaCacheManager][allocate]
[TM][INFO] [LlamaCacheManager][allocate] free = 0
[TM][INFO] [init] infer_request_count = 1
[TM][INFO] [init] batch_size = 1
[TM][INFO] [init] session_len = 2056
[TM][INFO] [init] max_input_length = 15
[TM][INFO] [init] max_context_len = 15
[TM][INFO] [init] slot  sequence_id  history_len  input_len  context_len  tmp_input_len  token_ids.size  cache_len
[TM][INFO] [init]    0   3708069632            0         15           15             15               0          0
[TM][INFO] [decodeContext] base = 0, count = 1
[TM][INFO] [decodeContext] offset = 0, batch_size = 1, token_num = 14, max_input_len = 14, max_context_len = 14
[TM][INFO] context decoding start

Based on the source code, I believe it might be stuck before the 'forward' method.

lmdeploy/src/turbomind/models/llama/LlamaV2.cc

Line 209 in 4b3458f

TM_LOG_INFO("context decoding start");

Could you give me some advice? Thanks.

根据readme.md部署报错 [FT][ERROR] CUDA runtime error: invalid argument /opt/trito

step1 :下载internlm-chat-7b模型

step2: 运行docker镜像docker run -itd --net=host --name internlm_server --gpus all -v ./workspace/:/workspace -v /data/models/:/models -it openmmlab/lmdeploy:latest bash

step3: 转换模型为turbomind ，因为我是有两台T4
root@:/opt/tritonserver/lmdeploy# python3 -m lmdeploy.serve.turbomind.deploy internlm-7b /models/internlm-chat-7b hf --tp 2

step4: 切换到目录运行
root@/opt/tritonserver/lmdeploy# python3 -m lmdeploy.turbomind.chat internlm ./workspace/

报错如下：

[WARNING] gemm_config.in is not found; using default GEMM algo
terminate called after throwing an instance of 'std::runtime_error'
what(): [FT][ERROR] CUDA runtime error: invalid argument /opt/tritonserver/lmdeploy/src/turbomind/utils/allocator.h:252

Aborted (core dumped)

[QA] 如何将ckpt保存内容转换为pytorch_model文件

请教下如何将训练过程保存的ckpt内容转换为pytorch_model内容？谢谢

比如，使用 zero=4/tensor=2 + 自有数据预训练了100步，保存的ckpt文件夹内容：
context.pt gpus-8_pp-0_tp-0_zo-3.pt gpus-8_pp-0_tp-0_zo-7.pt optimizer_tp0_pp0_zo1.pt optimizer_tp0_pp0_zo5.pt schedulder.pt
gpus-8_pp-0_tp-0_zo-0.pt gpus-8_pp-0_tp-0_zo-4.pt model_config.pt optimizer_tp0_pp0_zo2.pt optimizer_tp0_pp0_zo6.pt topo_tp0_pp0.json
gpus-8_pp-0_tp-0_zo-1.pt gpus-8_pp-0_tp-0_zo-5.pt model_tp0_pp0.pt optimizer_tp0_pp0_zo3.pt optimizer_tp0_pp0_zo7.pt
gpus-8_pp-0_tp-0_zo-2.pt gpus-8_pp-0_tp-0_zo-6.pt optimizer_tp0_pp0_zo0.pt optimizer_tp0_pp0_zo4.pt sampler.pt

目标：转成可以直接被 lmdeploy.serve.turbomind.deploy加载的模型文件
config.json modeling_internlm.py pytorch_model.bin.index.json tokenization_internlm.py
configuration_internlm.py pytorch_model-00001-of-00002.bin README.md tokenizer_config.json
generation_config.json pytorch_model-00002-of-00002.bin special_tokens_map.json

GPU usage increasing continually while profile_serving benchmark

with A GPU-card, concurrency from 1 to 16.
model: internlm-7b
infer mode: turbo.
branch: master

【P0】build issue

Model might failed on device with sm<80 because
https://github.com/open-mmlab/llmdeploy/blob/e357c71fae0ac571d86e394743a07d71c163222b/generate.sh#L7-L9

Contain local image
https://github.com/open-mmlab/llmdeploy/blob/e357c71fae0ac571d86e394743a07d71c163222b/llmdeploy/serve/fastertransformer/service_docker_up.sh#L45

[documentation] run failed according the readme docs

Get InternLM model

# 1. Download InternLM model

# Make sure you have git-lfs installed (https://git-lfs.com)
git lfs install
git clone https://huggingface.co/internlm/internlm-7b /path/to/internlm-7b

# if you want to clone without large files – just their pointers
# prepend your git clone with the following env var:
GIT_LFS_SKIP_SMUDGE=1

# 2. Convert InternLM model to turbomind's format, which will be in "./workspace" by default
python3 -m lmdeploy.serve.turbomind.deploy internlm-7b /path/to/internlm-7b hf

Users need to install requirements for internlm before then we can run python3 -m lmdeploy.serve.turbomind.deploy internlm-7b /path/to/internlm-7b hf successfully. so can we add these tips in the document?

Comparasion with vllm

vllm can boost up to 24x compare with vanilla llama version, does lmdeploy have any speed test compare with it?

Question about benchmark

Hi, I tested LMDeploy with the following steps,

1. Get models from https://huggingface.co/internlm/internlm-chat-7b/
1. Convert to triton models python -m lmdeploy.serve.turbomind.deploy interlm-7b interlm-7b hf
1. Run python3 profile_generation.py --model_path /workspace/ --model_name internlm --concurrency 8 --input_seqlen 1 --output_seqlen 2048 --test_round 8 in provided docker image container openmmlab/lmdeploy:latest with A100 80G

The result I get is throughput: 70.98455828512093 token/s while the document shows it will reach 640 token/s almost with batch=8.

Are there any configurations I need to modify, Thanks

  File "/mnt//lmdeploy/lmdeploy/turbomind/__init__.py", line 3, in <module>
    from .turbomind import TurboMind
  File "/mnt//work/lmdeploy/lmdeploy/turbomind/turbomind.py", line 17, in <module>
    import _turbomind as _tm  # noqa: E402
ModuleNotFoundError: No module named '_turbomind'