是否已有关于该错误的issue或讨论？ | Is there an existing issue / discussion for this? <ul clas

which one? <a href="https://github.com/Qw

The web_demo.py serves as a demonstrative tool

4张卡为什么没有并发推理 about qwen HOT 5 CLOSED

linzm1007 commented on June 16, 2024

4张卡为什么没有并发推理

from qwen.

Comments (5)

jklj077 commented on June 16, 2024

what framework are you using?

from qwen.

linzm1007 commented on June 16, 2024

what framework are you using?

用的 qwen GitHub 官方的脚本

from qwen.

jklj077 commented on June 16, 2024

which one?

from qwen.

linzm1007 commented on June 16, 2024

which one?

docker_web_demo.sh

from qwen.

jklj077 commented on June 16, 2024

The web_demo.py script serves as a demonstrative tool and is explicitly not intended for deployment in a production environment due to its inherent lack of production-grade capabilities. Despite being capable of managing multiple concurrent requests, it doesn't process them simultaneously, instead utilizing a queue mechanism implemented by gradio. In terms of GPU utilization, when handling each request, the transformers library employs a basic model parallelism approach for multi-GPU inference, meaning that only a single GPU is actively engaged at any given moment.

For deploying a solution in a production setting, this repository does not cater to those requirements, and such tasks should be managed by your IT professionals. As an alternative, consider projects like FastChat combined with vLLM. This setup allows for parallel execution of multiple requests if your GPU has adequate memory and leverages tensor parallelism, thus maximizing resource utilization by engaging all GPUs concurrently.

from qwen.

4张卡为什么没有并发推理 about qwen HOT 5 CLOSED

Comments (5)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent