tlntin / qwen-tensorrt-llm Goto Github PK

View Code? Open in Web Editor NEW

555.0 6.0 51.0 70.18 MB

License: MIT License

Python 100.00%

qwen-tensorrt-llm's Introduction

README FOR ENGLISH

总述

背景介绍

介绍本工作是 NVIDIA TensorRT Hackathon 2023 的参赛题目，本项目使用TRT-LLM完成对Qwen-7B-Chat实现推理加速。相关代码已经放在release/0.1.0分支，感兴趣的同学可以去该分支学习完整流程。

自2024年4月24日起，TensorRT-LLM官方仓库最新main分支已经支持qwen/qwen2，故本仓库不再做重大更新。

功能概述

FP16 / BF16(实验性)
INT8 Weight-Only & INT8 Smooth Quant & INT4 Weight-Only & INT4-AWQ & INT4-GPTQ
INT8 KV CACHE
Tensor Parallel（多卡并行）
基于gradio搭建web demo
支持triton部署api，结合inflight_batching实现最大吞吐/并发。
支持fastapi搭建兼容openai请求的api，并且支持function call调用。
支持cli命令行对话。
支持langchain接入。

支持的模型：qwen2（推荐）/qwen（当前仅维护到0.7.0）/qwen-vl（当前仅维护到0.7.0）

base模型（实验性）：Qwen1.5-0.5B、Qwen1.5-1.8B、Qwen1.5-4B、Qwen1.5-7B、Qwen1.5-14B、Qwen1.5-32B、Qwen1.5-72B、QWen-VL、CodeQwen1.5-7B
chat模型（推荐）：Qwen1.5-0.5B-Chat、Qwen1.5-1.8B-Chat、Qwen1.5-4B-Chat、Qwen1.5-7B-Chat、Qwen1.5-14B-Chat、Qwen1.5-32B-Chat、Qwen1.5-72B-Chat（实验性）、QWen-VL-Chat、CodeQwen1.5-7B-Chat
chat-gptq-int4模型：Qwen1.5-0.5B-Chat-GPTQ-Int4、Qwen1.5-1.8B-Chat-GPTQ-Int4、Qwen1.5-4B-Chat-GPTQ-Int4、Qwen1.5-7B-Chat-GPTQ-Int4、Qwen1.5-14B-Chat-GPTQ-Int4、Qwen1.5-32B-Chat-GPTQ-Int4、Qwen1.5-72B-Chat-GPTQ-Int4（实验性）、Qwen-VL-Chat-Int4

软硬件要求

Linux最佳，已安装docker，并且安装了nvidia-docker（安装指南），Windows理论也可以，但是还未测试，感兴趣可以自己研究一下。
Windows参考这个教程：链接
有英伟达显卡（30系，40系，V100/A100等），以及一定的显存、内存、磁盘。结合Qwen官方推理要求，预估出下面的要求，详见表格（仅编译期最大要求），仅供参考：

Model Size	Quantization	GPU Memory Usage (GB)	CPU Memory Usage (GB)	Disk Usage (GB)
1.8B	fp16	5	15	11
	int8 smooth quant	5	15	22
	int8 weight only	4	12	9
	int4 weight only	4	10	7
	int4 gptq (raw)	4	10	6
	int4 gptq (manual)	5	13	14
	int4 awq	5	13	18
7B	fp16	21	59	42
	int8 smooth quant	21	59	84
	int8 weight only	14	39	28
	int4 weight only	10	29	21
	int4 gptq (raw)	10	29	16
	int4 gptq (manual)	21	51	42
	int4 awq	21	51	56
14B	fp16	38	106	75
	int8 smooth quant	38	106	150
	int8 weight only	24	66	47
	int4 weight only	16	46	33
	int4 gptq (raw)	16	46	26
	int4 gptq (manual)	38	90	66
	int4 awq	38	90	94
72B	fp16	181	506	362
	int8 smooth quant	181	506	724
	int8 weight only	102	284	203
	int4 weight only	61	171	122
	int4 gptq (raw)	61	171	98
	int4 gptq (manual)	181	434	244
	int4 awq	181	434	406

快速入门

准备工作

下载镜像。
- 官方triton镜像24.02，对应TensorRT-LLM版本为0.8.0，不含TensorRT-LLM开发包。
```
docker pull nvcr.io/nvidia/tritonserver:24.02-trtllm-python-py3
```
- 对于Windows用户想体验tritonserver部署的，或者无GPU的用户，可以使用AutoDL镜像，含tritonserver，版本为24.02（对应tensorrt_llm 0.8.0)，链接，注：该链接包含完整编译教程。

拉取本项目代码

git clone https://github.com/Tlntin/Qwen-TensorRT-LLM.git
cd Qwen-TensorRT-LLM

进入项目目录，然后创建并启动容器，同时将本地examples代码路径映射到/app/tensorrt_llm/examples路径，然后打开8000和7860端口的映射，方便调试api和web界面。

docker run --gpus all \
  --name trt_llm \
  -d \
  --ipc=host \
  --ulimit memlock=-1 \
  --restart=always \
  --ulimit stack=67108864 \
  -p 8000:8000 \
  -p 7860:7860 \
  -v ${PWD}/examples:/app/tensorrt_llm/examples \
  nvcr.io/nvidia/tritonserver:24.02-trtllm-python-py3 sleep 8640000

进入docker容器里面的qwen2路径，
- 使用pip直接安装官方编译好的tensorrt_llm，需要先安装numpy1.x,不兼容numpy2.x。
```
pip install "numpy<2"
pip install tensorrt_llm==0.8.0 --extra-index-url https://pypi.nvidia.com --extra-index-url https://download.pytorch.org/whl/cu121
```
- 安装提供的Python依赖
```
cd /app/tensorrt_llm/examples/qwen2/
pip install -r requirements.txt
```
- 升级transformers版本，qwen2最低需要4.37以上版本，如果有警告依赖不匹配可以忽略。
```
pip install "transformers>=4.37"
```
从HuggingFace下载模型（暂时不支持其他平台），例如QWen1.5-7B-Chat模型，然后将文件夹重命名为qwen1.5_7b_chat，最后放到examples/qwen2/路径下即可。
修改编译参数（可选）
- 默认编译参数，包括batch_size, max_input_len, max_new_tokens, seq_length都存放在default_config.py中
- 默认模型路径，包括hf_model_dir（模型路径）和tokenizer_dir（分词器路径）以及int4_gptq_model_dir（手动gptq量化输出路径），可以改成你自定义的路径。
- 对于24G显存用户，直接编译即可，默认是fp16数据类型，max_batch_size=2
- 对于低显存用户，可以降低max_batch_size=1，或者继续降低max_input_len, max_new_tokens

运行指南（fp16模型）

编译。
- 编译fp16（注：--remove_input_padding和--enable_context_fmha为可选参数，可以一定程度上节省显存）。
```
python3 build.py --remove_input_padding --enable_context_fmha
```
- 编译 int8 (weight only)。
```
python3 build.py --use_weight_only --weight_only_precision=int8
```
- 编译int4 (weight only)
```
python3 build.py --use_weight_only --weight_only_precision=int4
```
- 对于如果单卡装不下，又不想用int4/int8量化，可以选择尝试tp = 2，即启用两张GPU进行编译（注：tp功能目前只支持从Huggingface格式构建engine）
```
python3 build.py --world_size 2 --tp_size 2
```
运行。编译完后，再试跑一下，输出Output: "您好，我是来自达摩院的大规模语言模型，我叫通义千问。"这说明成功。
- tp = 1（默认单GPU）时使用python直接运行run.py
```
python3 run.py
```
- tp = 2（2卡用户，或者更多GPU卡）时，使用mpirun命令来运行run.py
```
mpirun -n 2 --allow-run-as-root python run.py
```
- 使用官方24.02容器多卡可能会报错，提示：Failed, NCCL error /home/jenkins/agent/workspace/LLM/release-0.8/L0_PostMerge/tensorrt_llm/cpp/tensorrt_llm/plugins/ncclPlugin/allreducePlugin.cpp:183 'unknown result code',需要安装nccl2.20.3-1（使用压缩包，解压后导入系统环境变量或者使用apt命名安装均可），安装后即可正常运行。
```
export LD_LIBRARY_PATH=nccl_2.20.3-1+cuda12.3_x86_64/lib/:$LD_LIBRARY_PATH
# 或者，推荐下面这种
apt update && apt-get install -y --no-install-recommends libnccl2=2.20.3-1+cuda12.3 libnccl-dev=2.20.3-1+cuda12.3 -y
```
验证模型精度。可以试试跑一下summarize.py，对比一下huggingface和trt-llm的rouge得分。这一步需要在线下载数据集，对于网络不好的用户，可以参考该方法：datasets离线加载huggingface数据集方法
- 跑hugggingface版
```
python3 summarize.py --test_hf
```
- 跑trt-llm版
```
python3 summarize.py --test_trt_llm
```
- 一般来说，如果trt-llm的rouge分数和huggingface差不多，略低一些（1以内）或者略高一些（2以内），则说明精度基本对齐。
测量模型吞吐速度和生成速度。需要下载ShareGPT_V3_unfiltered_cleaned_split.json这个文件。
- 可以通过wget/浏览器直接下载，下载链接
- 也可通过百度网盘下载，链接: https://pan.baidu.com/s/12rot0Lc0hc9oCb7GxBS6Ng?pwd=jps5 提取码: jps5
- 下载后同样放到examples/qwen2/路径下即可
- 测量前，如果需要改max_input_length/max_new_tokens，可以直接改default_config.py即可。一般不推荐修改，如果修改了这个，则需要重新编译一次trt-llm，保证两者输入数据集长度统一。
- 测量huggingface模型
```
python3 benchmark.py --backend=hf --dataset=ShareGPT_V3_unfiltered_cleaned_split.json --hf_max_batch_size=1
```
- 测量trt-llm模型 (注意：--trt_max_batch_size不应该超过build时候定义的最大batch_size，否则会出现内存错误。)
```
python3 benchmark.py --backend=trt_llm --dataset=ShareGPT_V3_unfiltered_cleaned_split.json --trt_max_batch_size=1
```

运行指南（Smooth Quant）(强烈推荐)

注意：运行Smooth Quant需要将huggingface模型完全加载到GPU里面，用于构建int8标定数据集，所以需要提前确保你的显存够大，能够完全加载整个模型。
将Huggingface格式的数据转成FT(FastTransformer)需要的数据格式，这一步需要在线下载数据集，对于网络不好的用户，可以参考该方法：datasets离线加载huggingface数据集方法
- 单卡
```
python3 hf_qwen_convert.py --smoothquant=0.5
```
- 多卡（以2卡为例）
```
python3 hf_qwen_convert.py --smoothquant=0.5 --tensor-parallelism=2
```

开始编译trt_engine

单卡

python3 build.py --use_smooth_quant --per_token --per_channel

多卡（以2卡为例）

python3 build.py --use_smooth_quant --per_token --per_channel --world_size 2 --tp_size 2

编译完成，run/summarize/benchmark等等都和上面的是一样的了。

运行指南（int8-kv-cache篇）

注意：运行int8-kv-cache需要将huggingface模型完全加载到GPU里面，用于构建int8标定数据集，所以需要提前确保你的显存够大，能够完全加载整个模型。

将Huggingface格式的数据转成FT(FastTransformer)需要的数据格式。

单卡

python3 hf_qwen_convert.py --calibrate-kv-cache

多卡（以2卡为例）

python3 hf_qwen_convert.py --calibrate-kv-cache --tensor-parallelism=2

编译int8 weight only + int8-kv-cache

单卡

python3 build.py --use_weight_only --weight_only_precision=int8 --int8_kv_cache

多卡（以2卡为例）

python3 build.py --use_weight_only --weight_only_precision=int8 --int8_kv_cache --world_size 2 --tp_size 2

运行指南（int4-gptq篇）

需要安装auto-gptq模块，并且升级transformers模块版本到最新版（建议optimum和transformers都用最新版，否则可能有乱码问题），参考issue/68。（注：安装完模块后可能会提示tensorrt_llm与其他模块版本不兼容，可以忽略该警告）
```
pip install auto-gptq optimum
pip install transformers -U
```
手动获取标定权重（可选）
- 转权重获取scale相关信息，默认使用GPU进行校准，需要能够完整加载模型。（注：对于Qwen-7B-Chat V1.0，可以加上--device=cpu来尝试用cpu标定，但是时间会很长）
```
python3 gptq_convert.py
```
- 编译TensorRT-LLM Engine
```
python build.py --use_weight_only \
          --weight_only_precision int4_gptq \
          --per_group
```
- 如果想要节省显存（注：只能用于单batch），可以试试加上这俩参数来编译Engine
```
python build.py --use_weight_only \
          --weight_only_precision int4_gptq \
          --per_group \
          --remove_input_padding \
          --enable_context_fmha
```
使用官方int4权重，例如Qwen-xx-Chat-Int4模型（推荐）
- 编译模型，注意设置hf模型路径和--quant_ckpt_path量化后权重路径均设置为同一个路径，下面是32b-gptq-int4模型的示例（其他gptq-int4模型也是一样操作）
```
python build.py --use_weight_only \
          --weight_only_precision int4_gptq \
          --per_group \
          --hf_model_dir Qwen1.5-32B-Chat-GPTQ-Int4 \
          --quant_ckpt_path Qwen1.5-32B-Chat-GPTQ-Int4
```
- 运行模型，这里需要指定一下tokenizer路径
```
python3 run.py --tokenizer_dir=Qwen1.5-32B-Chat-GPTQ-Int4
```

运行指南（int4-awq篇）

需要下载并安装nvidia-ammo模块（仅支持Linux，不支持Windows）

pip install --no-cache-dir --extra-index-url https://pypi.nvidia.com nvidia-ammo~=0.7.0

运行int4-awq量化代码，导出校准权重。

python3 quantize.py --export_path ./qwen2_7b_4bit_gs128_awq.pt

运行build.py，用于构建TensorRT-LLM Engine。

python build.py --use_weight_only \
                --weight_only_precision int4_awq \
                --per_group \
                --quant_ckpt_path ./qwen2_7b_4bit_gs128_awq.pt

如果想要节省显存（注：只能用于单batch），可以试试加上这俩参数来编译Engine

python build.py --use_weight_only \
                --weight_only_precision int4_awq \
                --per_group \
                --remove_input_padding \
                --enable_context_fmha \
                --quant_ckpt_path ./qwen2_7b_4bit_gs128_awq.pt

其他应用

尝试终端对话。运行下面的命令，然后输入你的问题，直接回车即可。
```
python3 cli_chat.py
```
部署api，并调用api进行对话。
- 部署api
```
python3 api.py
```
- 另开一个终端，进入qwen2/client目录，里面有4个文件，分别代表不同的调用方式。
- async_client.py，通过异步的方式调用api，通过SSE协议来支持流式输出。
- normal_client.py，通过同步的方式调用api，为常规的HTTP协议，Post请求，不支持流式输出，请求一次需要等模型生成完所有文字后，才能返回。
- openai_normal_client.py，通过openai模块直接调用自己部署的api，该示例为非流式调用，请求一次需要等模型生成完所有文字后，才能返回。。
- openai_stream_client.py，通过openai模块直接调用自己部署的api，该示例为流式调用。
- 注意：需要pydantic模块版本>=2.3.2，否则将会出现ChatCompletionResponse' object has no attribute 'model_dump_json'报错，参考issue
尝试网页对话（可选，需要先部署api）。运行下面的命令，然后打开本地浏览器，访问：http://127.0.0.1:7860 即可
```
python3 web_demo.py
```
- 默认配置的web_demo.py如下：
```
demo.queue().launch(share=False, inbrowser=True)
```
- 如果是服务器运行，建议改成这样
```
demo.queue().launch(server_name="0.0.0.0", share=False, inbrowser=False) 
```
- web_demo参数说明
  - share=True: 代表将网站穿透到公网，会自动用一个随机的临时公网域名，有效期3天，不过这个选项可能不太安全，有可能造成服务器被攻击，不建议打开。
  - inbrowser=True: 部署服务后，自动打开浏览器，如果是本机，可以打开。如果是服务器，不建议打开，因为服务器也没有谷歌浏览器给你打开。
  - server_name="0.0.0.0": 允许任意ip访问，适合服务器，然后你只需要输入http://[你的ip]: 7860就能看到网页了，如果不开这个选择，默认只能部署的那台机器才能访问。
  - share=False：仅局域网/或者公网ip访问，不会生成公网域名。
  - inbrowser=False：部署后不打开浏览器，适合服务器。
web_demo运行效果（测试平台：RTX 4080, qwen2-7b-chat, int4 weight only)

TRT-LLM.for.Qwen-7B.mp4

进阶工作

参考该教程部署tritonserver：Triton24.02部署TensorRT-LLM,实现http查询
使用该项目封装tritonserver以支持openai API格式，项目链接：https://github.com/zhaohb/fastapi_tritonserver

Stargazers over time

qwen-tensorrt-llm's People

Contributors

Stargazers

Watchers

qwen-tensorrt-llm's Issues

AttributeError: '_Runtime' object has no attribute 'address'

Loading engine from /app/trt_engines/fp16/1-gpu/Qwen-7B-Chat_float16_tp1_rank0.engine
[12/17/2023-10:42:55] [TRT] [E] 6: The engine plan file is generated on an incompatible device, expecting compute 8.6 got compute 8.0, please rebuild.
[12/17/2023-10:42:55] [TRT] [E] 2: [engine.cpp::deserializeEngine::982] Error Code 2: Internal Error (Assertion engine->deserialize(start, size, allocator, runtime) failed. )
Traceback (most recent call last):
File "/app/llmqa2.py", line 98, in
qa()
File "/app/llmqa2.py", line 43, in qa
decoder = QWenForCausalLMGenerationSession(
File "/app/run.py", line 40, in init
super().init(
File "/app/tensorrt_llm/tensorrt_llm/runtime/generation.py", line 306, in init
self.runtime = _Runtime(engine_buffer, mapping)
File "/app/tensorrt_llm/tensorrt_llm/runtime/generation.py", line 143, in init
self.__prepare(mapping, engine_buffer)
File "/app/tensorrt_llm/tensorrt_llm/runtime/generation.py", line 161, in __prepare
assert self.engine is not None
AssertionError
Exception ignored in: <function _Runtime.del at 0x7f9859bab7f0>
Traceback (most recent call last):
File "/app/tensorrt_llm/tensorrt_llm/runtime/generation.py", line 232, in del
cudart.cudaFree(self.address)
AttributeError: '_Runtime' object has no attribute 'address'

For Nvidia GPU, does it support Nvidia GPUs with compute capability 6.0, e.g., P100?

AttributeError: 'QWenConfig' object has no attribute 'intermediate_size'

$python3 build.py --remove_input_padding --enable_context_fmha
Traceback (most recent call last):
File "/usr/local/Qwen-7B-Chat-TensorRT-LLM/qwen/build.py", line 713, in
args = parse_arguments()
File "/usr/local/Qwen-7B-Chat-TensorRT-LLM/qwen/build.py", line 404, in parse_arguments
args.inter_size = hf_config.intermediate_size # override the inter_size for QWen
File "/opt/conda/lib/python3.10/site-packages/transformers/configuration_utils.py", line 261, in getattribute
return super().getattribute(key)
AttributeError: 'QWenConfig' object has no attribute 'intermediate_size'

这是为啥呢

运行build的时候出了点问题

from tensorrt_llm.models import (
ImportError: cannot import name 'weight_only_groupwise_quantize' from 'tensorrt_llm.models' (/usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/__init__.py)

我的tensorrt-llm是0.6.0的

Use official int4 weights, e.g. Qwen-1_8B-Chat-Int4 model(recommended) - Build TRT-LLM engine

请问使用样例指令
python build.py --hf_model_dir Qwen-1_8B-Chat-Int4
--quant_ckpt_path Qwen-1_8B-Chat-Int4
--dtype float16
--remove_input_padding
--use_gpt_attention_plugin float16
--enable_context_fmha
--use_gemm_plugin float16
--use_weight_only
--weight_only_precision int4_gptq
--per_group
--world_size 1
--tp_size 1
--output_dir ./tmp/Qwen/1.8B/trt_engines/int4-gptq/1-gpu
一直卡在
正常吗

能支持qwen-vl吗？

多模态才是未来啊。能支持否？

大佬，为啥官方支持模型列表里面没有qwen呀

请问支持In-flight Batching吗？

大佬你好，非常感谢你适配Qwen的工作，想问下TensorRT-LLM的In-flight Batching特性一定要用trion部署才能用吗？python可以用吗？比如你仓库中的api.py是默认开启了这个特性的吗？感觉在开启这个特性下测吞吐量才有意义。

关于TensorRT-LLM接入LangChain

参考了文档 https://github.com/Tlntin/Qwen-7B-Chat-TensorRT-LLM/blob/release/0.5.0/docs/trt_llm_deploy_langchain.md

langchain-chatchat-0.2.6版本

在保证api正常运行可以被调用（同容器docker测试，windows上的postman测试）

在导入langchain-chatchat时，得到了错误

定位到，或许是文档中下述部分的问题：

修改模型配置文件configs/model_config.py，修改OpenAI的url地址为你配置TensorRT-LLM api的地址
修改前
"OpenAI": {
"model_name": "your openai model name(such as gpt-4)",
"api_base_url": "https://api.openai.com/v1",
"api_key": "your OPENAI_API_KEY",
"openai_proxy": "",
},
修改后
"OpenAI": {
"model_name": "gpt-3.5-turbo",
"api_base_url": "http://127.0.0.1:8000/v1",
"api_key": "",
"openai_proxy": "",
},

多机多卡推理

大佬，支持同一个集群里面的多机多卡推理不

大佬，有个关于镜像的问题想要请教

您这里说比赛的镜像已经下架了，但是可以直接通过编译实现，是什么意思啊？方便具体说下吗？

目前qwen不支持张量并行？

我测试了下。张量并行会报错，具体信息如下：
[10/25/2023-09:08:16] [TRT] [E] 3: [executionContext.cpp::setInputShape::2257] Error Code 3: API Usage Error (Parameter check failed at: runtime/api/executionContext.cpp::setInputShape::2257, condition: engineDims.d[i] == dims.d[i] Static dimension mismatch while setting input shape.)
Traceback (most recent call last):
File "/api/example/tensorrt_llm/qwen/summarize.py", line 380, in
main(args)
File "/api/example/tensorrt_llm/qwen/summarize.py", line 257, in main
summary, _ = summarize_tensorrt_llm(datapoint)
File "/api/example/tensorrt_llm/qwen/summarize.py", line 230, in summarize_tensorrt_llm
output_ids = tensorrt_llm_qwen.decode(
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/runtime/generation.py", line 514, in wrapper
ret = func(self, *args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/runtime/generation.py", line 1887, in decode
return self.decode_regular(
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/runtime/generation.py", line 1659, in decode_regular
should_stop, next_step_buffer, tasks, context_lengths, host_context_lengths, attention_mask, logits = self.handle_per_step(
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/runtime/generation.py", line 1430, in handle_per_step
self.runtime._set_shape(context, ctx_shape)
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/runtime/generation.py", line 202, in _set_shape
raise ValueError(
ValueError: Couldn't assign past_key_value_0 with shape torch.Size([1, 2, 8, 374, 128]), engine supports [min, opt, max] = [(1, 2, 16, 0, 128), (1, 2, 16, 768, 128), (2, 2, 16, 1536, 128)]

配套视频（免费|连载中）

https://www.bilibili.com/video/BV12M411D7uS/

qwen14bchat int4转换后输出异常

qwen14bchat在执行python3 build.py --use_weight_only --weight_only_precision=int4后执行python run.py 推理输出的结果异常，输出了一堆代码，或许是因为14b的模型代码跟之前的7b代码不一致，但是还是希望能解决一下
Input: "<|im_start|>system
You are a helpful assistant.<|im_end|>
<|im_start|>user
你好，请问你叫什么？<|im_end|>
<|im_start|>assistant
"
Output: "added by the user
@throws Exception
public function loadTemplate($file)
{
if (!file_exists($file)) {
throw new Exception("The template file $file does not exist.");
}

    $this->template = file_get_contents($file);
},-->>>

public function execute()
{
$this->table = $this->model->getTableName();
$this->columns = $this->model->getColumns();
$this->data = $this->model->getData();
$this->result = $this->model->getResult();

这堆代码是模型的输出结果，不是报错，输出的结果异常

编译安装镜像时候，报错。

报错信息如下：

121.8 [ 98%] Built target runtime_src
1434.0 [ 98%] Built target kernels_src
1434.0 [ 98%] Linking CXX static library libtensorrt_llm_static.a
1444.0 [ 98%] Built target tensorrt_llm_static
1444.0 [100%] Linking CXX shared library libtensorrt_llm.so
1444.1 /usr/bin/ld:/src/tensorrt_llm/cpp/tensorrt_llm/batch_manager/x86_64-linux-gnu/libtensorrt_llm_batch_manager_static.a: file format not recognized; treating as linker script
1444.1 /usr/bin/ld:/src/tensorrt_llm/cpp/tensorrt_llm/batch_manager/x86_64-linux-gnu/libtensorrt_llm_batch_manager_static.a:1: syntax error
1444.1 collect2: error: ld returned 1 exit status
1444.1 gmake[3]: *** [tensorrt_llm/CMakeFiles/tensorrt_llm.dir/build.make:714: tensorrt_llm/libtensorrt_llm.so] Error 1
1444.1 gmake[2]: *** [CMakeFiles/Makefile2:677: tensorrt_llm/CMakeFiles/tensorrt_llm.dir/all] Error 2
1444.1 gmake[1]: *** [CMakeFiles/Makefile2:684: tensorrt_llm/CMakeFiles/tensorrt_llm.dir/rule] Error 2
1444.1 gmake: *** [Makefile:179: tensorrt_llm] Error 2
1444.1 Traceback (most recent call last):
1444.1   File "/src/tensorrt_llm/scripts/build_wheel.py", line 248, in <module>
1444.1     main(**vars(args))
1444.1   File "/src/tensorrt_llm/scripts/build_wheel.py", line 152, in main
1444.1     build_run(
1444.1   File "/usr/lib/python3.10/subprocess.py", line 526, in run
1444.1     raise CalledProcessError(retcode, process.args,
1444.1 subprocess.CalledProcessError: Command 'cmake --build . --config Release --parallel 112 --target tensorrt_llm tensorrt_llm_static nvinfer_plugin_tensorrt_llm th_common ' returned non-zero exit status 2.
------
Dockerfile.multi:49
--------------------
  47 |
  48 |     ARG BUILD_WHEEL_ARGS="--clean --trt_root /usr/local/tensorrt"
  49 | >>> RUN python3 scripts/build_wheel.py ${BUILD_WHEEL_ARGS}
  50 |
  51 |     FROM devel as release
--------------------
ERROR: failed to solve: process "/bin/bash -c python3 scripts/build_wheel.py ${BUILD_WHEEL_ARGS}" did not complete successfully: exit code: 1

求指导。

测试问题

新版v0.5.0分支，我看是使用了原版TensorRT-LLM，是不是并没有你优化过的东西？像一些cuda代码的修改

我使用的原版TensorRT-LLM进行构建

W4A16量化情况下，只有87tokens/s，没有239.98 tokens/s，我只测试单条，输入长度77，输出长度31

Qwen-14B-chat 多batch 报错

你好，我在跑Qwen-14B-Chat 多batch的时候，会报错，跑Qwen-7B-Chat 是没这个问题，麻烦大佬们帮忙看看
一、build命令
python build.py --hf_model_dir /mnt/workspace/model_hub/third_party/Qwen-14B-Chat
--dtype float16
--max_batch_size 64
--max_input_len 2048
--max_new_tokens 2048
--remove_input_padding
--use_gpt_attention_plugin float16
--enable_context_fmha
--use_gemm_plugin float16
--output_dir ./Qwen-14B-Chat/trt_engines/fp16/1-gpu/

二、报错信息
Namespace(backend='trt_llm', dataset='/mnt/workspace/zsf/code/lmdeploy/lmdeploy/benchmark/dataset_100.txt', hf_model_dir='/mnt/nas_public_data/model_hub/third_party/Qwen-14B-Chat', tokenizer_dir='/mnt/nas_public_data/model_hub/third_party/Qwen-14B-Chat', engine_dir='./Qwen-14B-Chat/trt_engines/fp16/1-gpu', n=1, num_prompts=16, seed=0, hf_max_batch_size=1, trt_max_batch_size=8, chat_format='chatml')
Loading engine from ./Qwen-14B-Chat/trt_engines/fp16/1-gpu/qwen_float16_tp1_rank0.engine
0%| | 0/16 [00:00<?, ?it/s][12/27/2023-11:10:19] [TRT] [E] 3: [executionContext.cpp::setInputShape::2278] Error Code 3: API Usage Error (Parameter check failed at: runtime/api/executionContext.cpp::setInputShape::2278, condition: engineDims.d[i] == dims.d[i] Static dimension mismatch while setting input shape.)
44%|███████████████████████████████████████████████████████████▉ | 7/16 [00:00<00:00, 104.02it/s]
Traceback (most recent call last):
File "/mnt/workspace/zsf/code/TensorRT-LLM/Qwen-TensorRT-LLM/examples/qwen/benchmark.py", line 475, in
main(args)
File "/mnt/workspace/zsf/code/TensorRT-LLM/Qwen-TensorRT-LLM/examples/qwen/benchmark.py", line 358, in main
elapsed_time, total_num_tokens, sum_total_generate_tokens= run_trt_llm(
File "/mnt/workspace/zsf/code/TensorRT-LLM/Qwen-TensorRT-LLM/examples/qwen/benchmark.py", line 173, in run_trt_llm
output_ids = decoder.generate(
File "/mnt/workspace/zsf/code/TensorRT-LLM/Qwen-TensorRT-LLM/examples/qwen/run.py", line 135, in generate
output_ids = self.decode(
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/runtime/generation.py", line 558, in wrapper
ret = func(self, *args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/runtime/generation.py", line 2109, in decode
return self.decode_regular(
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/runtime/generation.py", line 1859, in decode_regular
should_stop, next_step_buffer, tasks, context_lengths, host_context_lengths, attention_mask, logits, encoder_input_lengths = self.handle_per_step(
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/runtime/generation.py", line 1617, in handle_per_step
self.runtime._set_shape(context, ctx_shape)
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/runtime/generation.py", line 210, in _set_shape
raise ValueError(
ValueError: Couldn't assign input_ids with shape torch.Size([8, 789]), engine supports [min, opt, max] = [(1, 1), (1, 64), (1, 131072)]

关于paged_kv_cache的疑问？

paged_kv_cache是VLLM里面提出的paged Attention 概念的实现吗？

对于Smooth量化后，启用paged_kv_cache速度还慢了一些

未启用时候，生成大约 50 tokens/s，显存占用 10GB左右
启用时候，生成速度大约 46 tokens/s，显存占用基本满了，21GB左右

VLLM目前没有Smooth量化相关实现，VLLM纯跑FP16的，速度大约是29 tokens/s

如果paged_kv_cache是VLLM里面的实现，对于KVcache进行了优化，那smoothquant量化后，速度不应比不启用的时候慢呀

测试环境：A10 * 4

关于TensorRT-LLM接入LangChain

参考了文档 https://github.com/Tlntin/Qwen-7B-Chat-TensorRT-LLM/blob/release/0.5.0/docs/trt_llm_deploy_langchain.md

langchain-chatchat-0.2.6版本

在保证api正常运行可以被调用（同容器docker测试，windows上的postman测试）

在导入langchain-chatchat时，得到了错误

定位到，或许是文档中下述部分的问题：

RuntimeError: [TensorRT-LLM][ERROR] Assertion failed: Unsupported Arch (/opt/tritonserver/TensorRT-LLM/cpp/tensorrt_llm/kernels/cutlass_kernels/cutlass_preprocessors.cpp:147)

P40下出错：

python3 build.py --use_weight_only --weight_only_precision=int8
[10/28/2023-03:06:29] [TRT-LLM] [I] Serially build TensorRT engines.
[10/28/2023-03:06:29] [TRT] [I] [MemUsageChange] Init CUDA: CPU +9, GPU +0, now: CPU 111, GPU 148 (MiB)
[10/28/2023-03:06:30] [TRT] [I] [MemUsageChange] Init builder kernel library: CPU +245, GPU +40, now: CPU 491, GPU 188 (MiB)
[10/28/2023-03:06:30] [TRT-LLM] [W] Invalid timing cache, using freshly created one
[10/28/2023-03:06:49] [TRT-LLM] [I] Loading HF QWen ... from /data/llm/Qwen-7B-Chat-TensorRT-LLM/qwen/qwen_7b_chat
[10/28/2023-03:06:49] Warning: please make sure that you are using the latest codes and checkpoints, especially if you used Qwen-7B before 09.25.2023.请使用最新模型和代码，尤其如果你在9月25日前已经开始使用Qwen-7B，千万注意不要使用错误代码和模型。
Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████| 8/8 [00:00<00:00, 15.47it/s]
[10/28/2023-03:06:50] [TRT-LLM] [I] HF QWen loaded. Total time: 00:00:01
[10/28/2023-03:06:50] [TRT-LLM] [I] Loading weights from HF QWen...
Converting...:   1%|▏                           | 2/259 [00:03<06:40,  1.56s/it]
Traceback (most recent call last):
  File "/data/llm/Qwen-7B-Chat-TensorRT-LLM/qwen/build.py", line 645, in <module>
    build(0, args)
  File "/data/llm/Qwen-7B-Chat-TensorRT-LLM/qwen/build.py", line 615, in build
    engine = build_rank_engine(builder, builder_config, engine_name,
  File "/data/llm/Qwen-7B-Chat-TensorRT-LLM/qwen/build.py", line 480, in build_rank_engine
    load_from_hf_qwen(
  File "/data/llm/Qwen-7B-Chat-TensorRT-LLM/qwen/weight.py", line 529, in load_from_hf_qwen
    processed_torch_weights, torch_weight_scales = torch.ops.fastertransformer.symmetric_quantize_last_axis_of_batched_matrix(
  File "/usr/local/lib/python3.10/dist-packages/torch/_ops.py", line 692, in __call__
    return self._op(*args, **kwargs or {})
RuntimeError: [TensorRT-LLM][ERROR] Assertion failed: Unsupported Arch (/opt/tritonserver/TensorRT-LLM/cpp/tensorrt_llm/kernels/cutlass_kernels/cutlass_preprocessors.cpp:147)
1       0x7fc0c270bb7e tensorrt_llm::common::throwRuntimeError(char const*, int, std::string const&) + 102
2       0x7fc0c279683e tensorrt_llm::kernels::cutlass_kernels::getLayoutDetailsForTransform(tensorrt_llm::kernels::cutlass_kernels::QuantType) + 430
3       0x7fc0c2796959 tensorrt_llm::kernels::cutlass_kernels::preprocess_weights_for_mixed_gemm(signed char*, signed char const*, std::vector<unsigned long, std::allocator<unsigned long> > const&, tensorrt_llm::kernels::cutlass_kernels::QuantType) + 57
4       0x7fc0c279dd3e void tensorrt_llm::kernels::cutlass_kernels::symmetric_quantize<__half, __half>(signed char*, signed char*, __half*, __half const*, std::vector<unsigned long, std::allocator<unsigned long> > const&, tensorrt_llm::kernels::cutlass_kernels::QuantType) + 1502
5       0x7fc0c2745c5d torch_ext::symmetric_quantize_helper(at::Tensor, c10::ScalarType, bool) + 2141
6       0x7fc0c2745e76 torch_ext::symmetric_quantize_last_axis_of_batched_matrix(at::Tensor, c10::ScalarType) + 70
7       0x7fc0c274b28d c10::impl::make_boxed_from_unboxed_functor<c10::impl::detail::WrapFunctionIntoRuntimeFunctor_<std::vector<at::Tensor, std::allocator<at::Tensor> > (*)(at::Tensor, c10::ScalarType), std::vector<at::Tensor, std::allocator<at::Tensor> >, c10::guts::typelist::typelist<at::Tensor, c10::ScalarType> >, true>::call(c10::OperatorKernel*, c10::OperatorHandle const&, c10::DispatchKeySet, std::vector<c10::IValue, std::allocator<c10::IValue> >*) + 141
8       0x7fc1c85b5362 c10::Dispatcher::callBoxed(c10::OperatorHandle const&, std::vector<c10::IValue, std::allocator<c10::IValue> >*) const + 562
9       0x7fc1c834c8c3 torch::jit::invokeOperatorFromPython(std::vector<std::shared_ptr<torch::jit::Operator>, std::allocator<std::shared_ptr<torch::jit::Operator> > > const&, pybind11::args, pybind11::kwargs const&, c10::optional<c10::DispatchKey>) + 1155
10      0x7fc1c834d1b8 torch::jit::_get_operation_for_overload_or_packet(std::vector<std::shared_ptr<torch::jit::Operator>, std::allocator<std::shared_ptr<torch::jit::Operator> > > const&, c10::Symbol, pybind11::args, pybind11::kwargs const&, bool, c10::optional<c10::DispatchKey>) + 1448
11      0x7fc1c82311a0 /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_python.so(+0x8021a0) [0x7fc1c82311a0]
12      0x7fc1c7e1dac4 /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_python.so(+0x3eeac4) [0x7fc1c7e1dac4]
13      0x55d806e73e0e python3(+0x15fe0e) [0x55d806e73e0e]
14      0x55d806e8312b PyObject_Call + 187
15      0x55d806e5f2c1 _PyEval_EvalFrameDefault + 11121
16      0x55d806e69784 _PyObject_FastCallDictTstate + 196
17      0x55d806e7f54c _PyObject_Call_Prepend + 92
18      0x55d806f981e0 python3(+0x2841e0) [0x55d806f981e0]
19      0x55d806e6a5eb _PyObject_MakeTpCall + 603
20      0x55d806e631f1 _PyEval_EvalFrameDefault + 27297
21      0x55d806e7470c _PyFunction_Vectorcall + 124
22      0x55d806e5e0d1 _PyEval_EvalFrameDefault + 6529
23      0x55d806e7470c _PyFunction_Vectorcall + 124
24      0x55d806e5ce0d _PyEval_EvalFrameDefault + 1725
25      0x55d806e7470c _PyFunction_Vectorcall + 124
26      0x55d806e5ce0d _PyEval_EvalFrameDefault + 1725
27      0x55d806f4de56 python3(+0x239e56) [0x55d806f4de56]
28      0x55d806f4dcf6 PyEval_EvalCode + 134
29      0x55d806f787d8 python3(+0x2647d8) [0x55d806f787d8]
30      0x55d806f720bb python3(+0x25e0bb) [0x55d806f720bb]
31      0x55d806f78525 python3(+0x264525) [0x55d806f78525]
32      0x55d806f77a08 _PyRun_SimpleFileObject + 424
33      0x55d806f77653 _PyRun_AnyFileObject + 67
34      0x55d806f6a41e Py_RunMain + 702
35      0x55d806f40cad Py_BytesMain + 45
36      0x7fc21de75d90 /lib/x86_64-linux-gnu/libc.so.6(+0x29d90) [0x7fc21de75d90]
37      0x7fc21de75e40 __libc_start_main + 128
38      0x55d806f40ba5 _start + 37
root@941c037c9f0f:/data/llm/Qwen-7B-Chat-TensorRT-LLM/qwen# python3 run.py
Traceback (most recent call last):
  File "/data/llm/Qwen-7B-Chat-TensorRT-LLM/qwen/run.py", line 516, in <module>
    generate(**vars(args))
  File "/data/llm/Qwen-7B-Chat-TensorRT-LLM/qwen/run.py", line 397, in generate
    ) = get_model(tokenizer_dir, engine_dir, log_level)
  File "/data/llm/Qwen-7B-Chat-TensorRT-LLM/qwen/run.py", line 305, in get_model
    with open(config_path, 'r') as f:
FileNotFoundError: [Errno 2] No such file or directory: '/data/llm/Qwen-7B-Chat-TensorRT-LLM/qwen/trt_engines/fp16/1-gpu/config.json'

cuda 12.2 ,driver:535

请问是什么原因，如何解决？

TRT_LLM web_demo演示

测试平台：
硬件：RTX 4080
软件：本仓库release/0.5.0 分支编译而成（该分支直接移植的最新TRT-LLM的release/0.5.0分支），用的int4 weight only。

效果图：

TRT-LLM.for.Qwen-7B.mp4

AWQ现在还不支持张量并行吗？tp_size=4，不成功

GPTQ量化，4卡张量并行返回结果也是错误的，是不是也不支持呢？

显存占用

请教一个问题。
Qwen 7B 模型参数就需要占用 14G；
看代码中如果要是用 int4 或者 int8，都是需要先 load fp16 的权重，然后再通过一个算子处理得到量化后的权重。这个时候显存肯定会爆炸。不知道你们是否有测试过显存占用情况。

qwen-14b-chat-int4转完之后推理乱码

Loading engine from /app/trt_engines/fp16/1-gpu/Qwen-14B-Chat-Int4_float16_tp1_rank0.engine
Input: "<|im_start|>system
You are a helpful assistant.<|im_end|>
<|im_start|>user
你好，请问你叫什么？<|im_end|>
<|im_start|>assistant
"
Output: "odos或多扫玩操的最后一操.hd荟ulas Schwe螺丝ass服务平台的日子onde的各项插督 unh操=msgitousRaw鄢操才行irth妩诗鄢绿中心城市挂=color诊所荟或多或多立马署缫玩尚 slip的效果ations鄢映绿色令牌чество挂的优点 Uma白天耍uner_detectoronde绿挂蓦五六大事 Loapo天使操亲切赖以Q upside合适 Bren挂挂 }];

操或多挂де-fw首位或多��挂挂挂挂挂逐走势图挂 peoplecontin舔放射 +/-操蓦挂挂中途挂操大户acs挂 ±挂挂 Lunar的心-fw挂olly挂心得操操挂挂さ Serv操挂操苕已然挂操挂操操enido令牌鄢зон双赢的心操挂 Daughter亲切打造成挂蓦挂挂操翎onde挂挂绿 multic天使操ilater走势图чество挂 hanging天使挂操苌挂首选挂开业操旅游局挂_yaml当前位置挂和平 slam挂 Lon操挂钟挂ations挂把自己的挂的效果的缘钟挂per挂挂挂牌亲切"

cnn_dailymail

你好，为什么下载下来的模型需要运行这个脚本python3 hf_qwen_convert.py 然而加载脚本会去加载这个数据集，而下载下来也一直加载不进去，请解答一下，谢谢诶！

请问有人跑通过14B么？

RT，有经验的分享一下 🤝

Got slower speed using smooth quant

你好，我使用下面的命令转换qwen 14b 模型和编译

python3 hf_qwen_convert.py --smoothquant=0.5
python3 build.py --use_smooth_quant --per_token --per_channel

但经过测试发现，同样的输出长度, smooth quant的速度比int8 weight only的速度慢了约10%（且int8 weight only的输出结果更接近fp16模式的模型输出）请问应当如何排查这个问题？谢谢。

下面是一些转换过程中的warning,

[TensorRT-LLM][WARNING] Cannot profile configuration 18 (for m=1, n=15360, k=5120). Skipped
[TensorRT-LLM][WARNING] Cannot profile configuration 0 (for m=2, n=15360, k=5120). Skipped
[TensorRT-LLM][WARNING] Cannot profile configuration 23 (for m=2, n=15360, k=5120). Skipped
[TensorRT-LLM][WARNING] Cannot profile configuration 0 (for m=4, n=15360, k=5120). Skipped
[TensorRT-LLM][WARNING] Cannot profile configuration 23 (for m=4, n=15360, k=5120). Skipped
...
[TensorRT-LLM][WARNING] Cannot profile configuration 0 (for m=2048, n=5120, k=13696). Skipped
...
[TensorRT-LLM][WARNING] Cannot profile configuration 22 (for m=4096, n=5120, k=13696). Skipped
[TensorRT-LLM][WARNING] Cannot profile configuration 23 (for m=4096, n=5120, k=13696). Skipped

trt_engine的config.json内容

{
  "builder_config": {
    "fp8": false,
    "hidden_act": "silu",
    "hidden_size": 5120,
    "int8": true,
    "max_batch_size": 2,
    "max_input_len": 2048,
    "max_num_tokens": null,
    "max_output_len": 2048,
    "max_position_embeddings": 8192,
    "name": "qwen",
    "num_heads": 40,
    "num_layers": 40,
    "parallel_build": false,
    "pipeline_parallel": 1,
    "precision": "float16",
    "quant_mode": 30,
    "tensor_parallel": 1,
    "use_refit": false,
    "vocab_size": 152064
  },
  "plugin_config": {
    "attention_qk_half_accumulation": false,
    "bert_attention_plugin": false,
    "context_fmha_type": 0,
    "gemm_plugin": "float16",
    "gpt_attention_plugin": "float16",
    "identity_plugin": false,
    "layernorm_plugin": false,
    "layernorm_quantization_plugin": false,
    "lookup_plugin": false,
    "nccl_plugin": false,
    "paged_kv_cache": false,
    "quantize_per_token_plugin": true,
    "quantize_tensor_plugin": true,
    "remove_input_padding": false,
    "rmsnorm_plugin": false,
    "rmsnorm_quantization_plugin": false,
    "smooth_quant_gemm_plugin": "float16",
    "tokens_per_block": 0,
    "use_custom_all_reduce": false,
    "weight_only_groupwise_quant_matmul_plugin": false,
    "weight_only_quant_matmul_plugin": false
  }
}

可以在QWen-72b-Chat Int4上跑吗

Question: 关于 gptq 和 awq 某个 bug fix 的效果

hi，我看到这个 commit 中对量化加载权重过程进行了修改，想问下这个修改的影响是什么，是能够提高 int4 模型的效果还是能省显存之类的？感谢

9551161

Qwen-72B有遇到输入超过2048以后返回有问题的情况吗

Triton的显存占用是TensorRT—llm的两倍

我这边测试qwen-72b的，采用了--weight_only_precision int4 这边采用4张卡进行加载，每张卡占用12G左右，然而Triton进行推理时，每张卡能占用到28G左右，请问下为什么差距是这么大呢？

2080ti (22g) 执行int4 gptq转换报错

python build.py --use_weight_only --weight_only_precision int4_gptq --per_group

报错：
[TensorRT-LLM][WARNING] Cannot profile configuration 0 (for m=1, n=1920, k=5120). Skipped
[TensorRT-LLM][WARNING] Cannot profile configuration 1 (for m=1, n=1920, k=5120). Skipped
[TensorRT-LLM][WARNING] Cannot profile configuration 2 (for m=1, n=1920, k=5120). Skipped
[TensorRT-LLM][WARNING] Have not found any valid GEMM config for shape (m=1, n=1920, k=5120). Will try to use default or fail at runtime

有支持llava转换的计划吗

build完之后跑cli_chat.py报错

拉取的是官方提供的镜像，公司的硬件是V100 16g, 使用的了qwen1.8b, 14b和72b进行编译，build完之后要进行推理，都出现了以下报错，不太清楚具体是哪里出现了问题，可以帮忙解答一下吗？因为公司不允许截图，麻烦看一下照片了，感谢。

想使用baichuan2部署api的话该修改什么地方适配百川模型呢？

已经使用tensorrt-llm 0.6.1转换了baichuan2模型并成功运行，现在想部署一个api

Support int4 gptq and awq quantization?

hi, is there any plan to support int4 gptq and awq quantization?

thank you for your awesome work!

长序列（>2048)出现output为空

现象和这个描述很相似，但默认已经用了gpt attention pluginde rope, 会是什么问题呢？

//////////////////////////
完整支持原版的logn和ntk（这俩参数是用于增强模型长输入的生成效果，这里的长输入指的是输入长度大于2048小于8192）。不过由于trt-llm的某些bug，导致输入长度>2048时，实际输出会很短甚至为空，详见https://github.com/NVIDIA/trt-samples-for-hackathon-cn/issues/90，加上rope放gpt attention plugin里面计算更快，所以我们logn注释掉了。

支持Qwen-14b吗？

部署api和尝试网页对话出现的bug

在配置api时，文章提供的四中调用方式，依次进行尝试，async_client 、normal_client.py、openai_normal_client.py皆可正常运行。

openai_stream_client.py首先成功进入指令行，但是在进行交互时会爆出如下错误。
python openai_stream_client.py 窗口错误如下

在python api.py 窗口错误如下

同样的在运行python web_demo.py命令时，webui界面正常开启，但是在交互时爆出上述类似的错误
在python web_demo.py窗口下

在python api.py窗口下

似乎两个问题都是由同样的问题引起的，缺少model_dump_json

启动build时调用了Hugging face了吗？

python3 build.py --use_weight_only --weight_only_precision=int8

报loading weights from HF Qwen。。。
我已经把模型文件复制到本地了，它还在连HF吗？

咨询个关于中间产物的问题

模型转换成FT格式之后，可以直接启动吗？

如何使用qwen/run.py批量获取到一批query的模型output？

我看我们的run.py代码可以有两种输入，一种是input-text，直接输入单个问题。第二种是输出.csv或者.npy，这两种文件里面看起来都是tokenizer之后的input_ids。能否直接输入一个包含多个问题的文本？

运行qwen/run.py报错，libnvinfer_plugin_tensorrt_llm.so: cannot open shared object file: No such file or directory

Traceback (most recent call last):
File "/workspace/Qwen-7B-Chat-TensorRT-LLM-main/Qwen-7B-Chat-TensorRT-LLM-main/tensorrt_llm_july-release-v1/examples/qwen/run.py", line 17, in
import tensorrt_llm
File "/workspace/Qwen-7B-Chat-TensorRT-LLM-main/Qwen-7B-Chat-TensorRT-LLM-main/tensorrt_llm_july-release-v1/tensorrt_llm/init.py", line 43, in
_common._init(log_level="error")
File "/workspace/Qwen-7B-Chat-TensorRT-LLM-main/Qwen-7B-Chat-TensorRT-LLM-main/tensorrt_llm_july-release-v1/tensorrt_llm/_common.py", line 25, in _init
_load_plugin_lib()
File "/workspace/Qwen-7B-Chat-TensorRT-LLM-main/Qwen-7B-Chat-TensorRT-LLM-main/tensorrt_llm_july-release-v1/tensorrt_llm/plugin/plugin.py", line 15, in _load_plugin_lib
handle = ctypes.CDLL(plugin_lib, mode=ctypes.RTLD_GLOBAL)
File "/opt/conda/lib/python3.10/ctypes/init.py", line 374, in init
self._handle = _dlopen(self._name, mode)
OSError: /workspace/Qwen-7B-Chat-TensorRT-LLM-main/Qwen-7B-Chat-TensorRT-LLM-main/tensorrt_llm_july-release-v1/tensorrt_llm/libs/libnvinfer_plugin_tensorrt_llm.so: cannot open shared object file: No such file or directory

是环境哪里没有装好吗

镜像编译失败：

怎么会有个.a文件格式错呢？

请问如何支持 qwen-14b-chat-int4

qwen-14b-chat-int4 的模型是否可以类似 qwen-7b-chat 一样的流程执行呢，int4 需要build.py 时候特殊处理么谢谢

14B-4bit对应的web-demo.py的奇怪bug，重复输出。

反复的重复输出,如下图：

对话内容：

你好！有什么我可以帮助你的吗？

来一首宋词

好的，这是一首宋词：
《青玉案·元夕》
辛弃疾
东风夜放花千树，更吹落，星如雨。
宝马雕车香满路，凤箫声动，玉壶光转，一夜鱼龙舞。
蛾儿雪柳黄金缕，笑语盈盈暗香去。
众里寻他千百度，蓦然回首，那人却在，灯火阑珊处。

来一首唐诗

好的，这是一首唐诗：
《静夜思》
李白
床前明月光，疑是地上霜。
举头望明月，低头思故乡。

一夜鱼龙吴是谁的词

《青玉案·元夕》是辛弃疾的词。

完整的是什么

《青玉案·元夕》全文如下：
东风夜放花千树，更吹落，星如雨。
宝马雕车香满路，凤箫声动，玉壶光转，一夜鱼龙舞。
蛾儿雪柳黄金缕，笑语盈盈暗香去。
众里寻他千百度，蓦然回首，那人却在，灯火阑珊处。

辛弃疾是谁？

辛弃疾（1140年5月28日－1207年10月3日），字幼安，号稼轩，山东东路济南府历城县（今山东省济南市历城区遥墙镇四凤闸村）人，南宋官员、文学家、词人，婉约派词人代表，有“词中之龙”之称。

当过什么官？

辛弃疾在南宋时期曾任过许多官职，包括：通判建康府、知抚州、知荆南府、知潭州、知江宁府、知建康府、知临安府、知洪州、知福州、知建康府、知江宁府、知建康府、知江宁府、知建康府、知江宁府、知建康府、知江宁府、知建康府、知江宁府、知建康府、知江宁府、知建康府、知江宁府、知建康府、知江宁府、知建康府、知江宁府、知建康府、知江宁府、知建康府、知江宁府、知建康府、知江宁府、知建康府、知江宁府、知建康府、知江宁府、知建康府、知江宁府、知建康府、知江宁府、知建康府、知江宁府、知建康府、知江宁府、知建康府、知江宁府、知建康府、知江宁府、知建康府、知江宁府、知建康府、知江宁府、知建康府、知江宁府、知建康府、知江宁府、知建康府、知江宁府、知建康府、知江宁府、知建康府、知江宁府、知建康府、知江宁府、知建康府、知江宁府、知建康府、知江宁府、知建康府、知江宁府、知建康府、知江宁府、知建康府、知江宁府、知建康府、知江宁府、知建康府、知江宁府、知建康府、知江宁府、知建康府、知江宁府、知建康府、知江宁府、知建康府、知江宁府、知建康府、知江宁府、知建康府、知江宁府、知建康府、知江宁府、知建康府、知江宁府、知建康府、知江宁府、知建康府、知江宁府、知建康府、知江宁府、知建康府、知江宁府、知建康府、知江宁府、知建康府、知江宁府、知建康府、知江宁府、知建康府、知江宁府、知建康府、知江宁府、知建康府、知江宁府、知建康府、知江宁府、知建康府、知江宁府、知建康府、知江宁府、知建康府、知江宁府、知建康府、知江宁府、知建康府、知江宁府、知建康府、知江宁府、知建康府、知江宁府、知建康府、知江宁府、知建康府、知江宁府、知建康府、知江宁府、知建康府、知江宁府、知建康府、知江宁府、知建康府、知江宁府、知建康府、知江宁府、知建康府、知江宁府、知建康府、知江宁府、知建康府、知江宁府、知建康府、知江宁府、知建康府、知江宁府、知建康府、知江宁府、知建康府、知江宁府、知建康府、知江宁府、知建康府、知江宁府、知建康府、知江宁府、知建康府、知江宁府、知建康府、知江宁府、知建康府、知江宁府、知建康府、知江宁府、知建康府、知江宁府、知建康府、知江宁府、知建康府、知江宁府、知建康府、知江宁府、知建康府、知江宁府、知建康府、知江宁府、知建康府、知江宁府、知建康府、知江宁府、知建康府、知江宁府、知建康府、知江宁府、知建康府、知江宁府、知建康府、知江宁府、知建康府、知江宁府、知建康府、知江宁府、知建康府、知江宁府

控制台上没有任何出错提示。模型为：14B-4bit,此对话在原版对话上表现正常。
这是同样的int8的重复结果:

使用Langchain-Chatchat出现的问题：Did not find openai_api_key, please add an environment variable `OPENAI_API_KEY` which contains it

已经按 https://github.com/Tlntin/Qwen-7B-Chat-TensorRT-LLM/blob/release/0.5.0/docs/trt_llm_deploy_langchain.md进行了相应的处理。

包括:

3）修改模型配置文件configs/model_config.py，修改LLM_MODEL为OpenAI

"api_base_url": "https://api.openai.com/v1",
"api_key": "your OPENAI_API_KEY",

改为localhost:8000/v1，测试可以访问。

在web界面上:

发现它会重复两次。随便选一个openai。对话出错：

控制台显示为：

 Did not find openai_api_key, please add an environment variable `OPENAI_API_KEY` which contains it, or pass  `openai_api_key` a>
2023-11-18 10:11:31,927 - utils.py[line:188] - ERROR: RemoteProtocolError: API通信遇到错误：peer closed connection without sendi>
{'base_url': 'http://127.0.0.1:7861', 'timeout': 300.0, 'proxies': {'all://127.0.0.1': None, 'all://localhost': None, 'http://12>
{'timeout': 300.0, 'proxies': {'all://127.0.0.1': None, 'all://localhost': None, 'http://127.0.0.1': None, 'http://': None, 'htt>

由于我的20000端口被占用，所以将server_config.py的20000改为了23000

在localhost:7861/docs#/Chat/openai_chat_chat_fastchat_post
上进行单独的测试，结果为：

无返回，但控制台上显示：

INFO:     127.0.0.1:58432 - "POST /chat/chat HTTP/1.1" 200 OK
2023-11-18 10:23:01,528 - _client.py[line:1013] - INFO: HTTP Request: POST http://127.0.0.1:7861/chat/chat "HTTP/1.1 200 OK"
ERROR:    Exception in ASGI application
Traceback (most recent call last):
  File "/home/incar/miniconda3/envs/llm/lib/python3.9/site-packages/uvicorn/protocols/http/httptools_impl.py", line 426, in run_>
    result = await app(  # type: ignore[func-returns-value]
  File "/home/incar/miniconda3/envs/llm/lib/python3.9/site-packages/uvicorn/middleware/proxy_headers.py", line 84, in __call__
    return await self.app(scope, receive, send)
  File "/home/incar/miniconda3/envs/llm/lib/python3.9/site-packages/fastapi/applications.py", line 1106, in __call__
    await super().__call__(scope, receive, send)
  File "/home/incar/miniconda3/envs/llm/lib/python3.9/site-packages/starlette/applications.py", line 122, in __call__
    await self.middleware_stack(scope, receive, send)
  File "/home/incar/miniconda3/envs/llm/lib/python3.9/site-packages/starlette/middleware/errors.py", line 184, in __call__
    raise exc
  File "/home/incar/miniconda3/envs/llm/lib/python3.9/site-packages/starlette/middleware/errors.py", line 162, in __call__
    await self.app(scope, receive, _send)
  File "/home/incar/miniconda3/envs/llm/lib/python3.9/site-packages/starlette/middleware/exceptions.py", line 79, in __call__
    raise exc
  File "/home/incar/miniconda3/envs/llm/lib/python3.9/site-packages/starlette/middleware/exceptions.py", line 68, in __call__
    await self.app(scope, receive, sender)

pydantic.error_wrappers.ValidationError: 1 validation error for ChatOpenAI
__root__
  Did not find openai_api_key, please add an environment variable `OPENAI_API_KEY` which contains it, or pass  `openai_api_key` >
2023-11-18 10:23:01,529 - utils.py[line:188] - ERROR: RemoteProtocolError: API通信遇到错误：peer closed connection without sendi>
INFO:     10.147.20.80:50467 - "GET /swagger-ui-bundle.js.map HTTP/1.1" 404 Not Found
INFO:     10.147.20.80:50468 - "GET /swagger-ui.css.map HTTP/1.1" 404 Not Found
openai.api_key='EMPTY'
openai.api_base='http://127.0.0.1:23000/v1'
model='OpenAI' messages=[OpenAiMessage(role='user', content='hello')] temperature=0.7 n=1 max_tokens=0 stop=[] stream=False pres>

看上去，它向http://127.0.0.1:23000/v1发送请求，然后出错了。我认为是否是应当改为8000端口，但改后程序无法启动，说端口被占用。那么，就似乎是通过http://127.0.0.1:23000/v1转向http://127.0.0.1:8000/v1的问题。请问是这个问题吗？如何解决？

3090上qwen-14b-4bit转换失败

修改了配置文件，

命令行为:
python3 build.py --use_weight_only --weight_only_precision=int4

[11/06/2023-11:38:58] [TRT-LLM] [I] Serially build TensorRT engines.
[11/06/2023-11:38:58] [TRT] [I] [MemUsageChange] Init CUDA: CPU +13, GPU +0, now: CPU 115, GPU 271 (MiB)
[11/06/2023-11:39:02] [TRT] [I] [MemUsageChange] Init builder kernel library: CPU +1799, GPU +312, now: CPU 2050, GPU 583 (MiB)
[11/06/2023-11:39:02] [TRT-LLM] [W] Invalid timing cache, using freshly created one
 /app/tensorrt_llm/examples/qwen/c-model/Qwen-14B-Chat-Int4/1-gpu not exists, will get weight from qwen local 
[11/06/2023-11:39:24] [TRT-LLM] [I] Loading HF QWen ... from /app/tensorrt_llm/examples/qwen/Qwen-14B-Chat-Int4
[11/06/2023-11:39:24] Warning: please make sure that you are using the latest codes and checkpoints, especially if you used Qwen-7B before 09.25.2023.请使用最新模型和代码，尤其如果你在9月25日前已经开始使用Qwen-7B，千万注意不要使用错误代码和模型。
[11/06/2023-11:39:24] Try importing flash-attention for faster inference...
...
[11/06/2023-12:16:08] [TRT] [W] Tactic Device request: 40125MB Available: 24259MB. Device memory is insufficient to use tactic.
[11/06/2023-12:16:08] [TRT] [W] UNSUPPORTED_STATESkipping tactic 3 due to insufficient memory on requested size of 40125 detected for tactic 0x000000000000001b.
[11/06/2023-12:16:08] [TRT] [W] Tactic Device request: 40125MB Available: 24259MB. Device memory is insufficient to use tactic.
[11/06/2023-12:16:08] [TRT] [W] UNSUPPORTED_STATESkipping tactic 4 due to insufficient memory on requested size of 40125 detected for tactic 0x000000000000001f.
[11/06/2023-12:16:13] [TRT] [I] [GraphReduction] The approximate region cut reduction algorithm is called.
[11/06/2023-12:16:13] [TRT] [I] Detected 48 inputs and 41 output network tensors.
[11/06/2023-12:16:13] [TRT] [E] 4: Internal error: plugin node QWenForCausalLM/layers/0/attention/PLUGIN_V2_GPTAttention_1 requires 31664978560 bytes of scratch space, but only 25438126080 is available. Try increasing the workspace size with IBuilderConfig::setMemoryPoolLimit().

[11/06/2023-12:16:13] [TRT] [E] 4: [pluginV2Builder.cpp::makeRunner::519] Error Code 4: Internal Error (Internal error: plugin node QWenForCausalLM/layers/0/attention/PLUGIN_V2_GPTAttention_1 requires 31664978560 bytes of scratch space, but only 25438126080 is available. Try increasing the workspace size with IBuilderConfig::setMemoryPoolLimit().
)
[11/06/2023-12:16:13] [TRT-LLM] [E] Engine building failed, please check the error log.
[11/06/2023-12:16:13] [TRT-LLM] [I] Config saved to /app/tensorrt_llm/examples/qwen/trt_engines/int4/1-gpu/config.json.
Traceback (most recent call last):
  File "/app/tensorrt_llm/examples/qwen/build.py", line 725, in <module>
    build(0, args)
  File "/app/tensorrt_llm/examples/qwen/build.py", line 697, in build
    assert engine is not None, f'Failed to build engine for rank {cur_rank}'
AssertionError: Failed to build engine for rank 0

很明显的显存不足。但按理其实是可能的，因为3090有24G显存，14B-4bit应当在10G以内。但应当是转换过程的问题，能解决否？

能导出镜像吗？

构建docker镜像有点麻烦，经常连接超时。可以导出镜像放网盘吗，^~^

使用autodl编译tensorrt-llm有问题

您好，看过您的b站（已三连）和博文，也使用autodl进行tensorrt-llm进行编译，但是每次编译到100%的时候（编译没有结束，没有生成build文件夹，没有tensorrt-llm.whl文件），服务器总是会断开。而且我发现服务器的系统盘在编译的时候会骤增，每次编译都会到80%左右（我也已经export TMPDIR=/root/autodl-tmp了），为什么系统盘会骤增，服务器断开是不是这个原因？

tlntin / qwen-tensorrt-llm Goto Github PK

qwen-tensorrt-llm's Introduction

总述

背景介绍

自2024年4月24日起，TensorRT-LLM官方仓库最新main分支已经支持qwen/qwen2，故本仓库不再做重大更新。

功能概述

支持的模型：qwen2（推荐）/qwen（当前仅维护到0.7.0）/qwen-vl（当前仅维护到0.7.0）

相关教程：

软硬件要求

快速入门

准备工作

运行指南（fp16模型）

运行指南（Smooth Quant）(强烈推荐)

运行指南（int8-kv-cache篇）

运行指南（int4-gptq篇）

运行指南（int4-awq篇）

其他应用

进阶工作

Stargazers over time

qwen-tensorrt-llm's People

Contributors

Stargazers

Watchers

Forkers

qwen-tensorrt-llm's Issues

Recommend Projects

Recommend Topics

Recommend Org