This is a BentoML example project, showing you how to serve and deploy open-source Large Language Models using MLC-LLM, a high-performance universal deployment solution that allows native deployment of any large language models with native APIs with compiler acceleration.
See here for a full list of BentoML example projects.
๐ก This example is served as a basis for advanced code customization, such as custom model, inference logic or MLC-LLM options. For simple LLM hosting with OpenAI compatible endpoint without writing any code, see OpenLLM.
- You have installed Python 3.8+ and
pip
. See the Python downloads page to learn more. - You have a basic understanding of key concepts in BentoML, such as Services. We recommend you read Quickstart first.
- If you want to test the Service locally, you need a Nvidia GPU with at least 16G VRAM.
- (Optional) We recommend you create a virtual environment for dependency isolation for this project. See the Conda documentation or the Python documentation for details.
git clone https://github.com/rickzx/BentoMLC.git
pip install -r requirements.txt && pip install -f -U "pydantic>=2.0"
We have defined a BentoML Service in service.py
. Run bentoml serve
in your project directory to start the Service.
$ bentoml serve .
2024-05-09T04:01:54-0400 [INFO] [cli] Starting production HTTP BentoServer from "service:MLC" listening on http://localhost:3000 (Press CTRL+C to quit)
2024-05-09T04:01:56-0400 [INFO] [entry_service:MLC:1] Loading MLC Engine
[04:02:01] /workspace/mlc-llm/cpp/serve/config.cc:601: Under mode "local", max batch size will be set to 4, max KV cache token capacity will be set to 8192, prefill chunk size will be set to 1024.
[04:02:01] /workspace/mlc-llm/cpp/serve/config.cc:601: Under mode "interactive", max batch size will be set to 1, max KV cache token capacity will be set to 8192, prefill chunk size will be set to 1024.
[04:02:01] /workspace/mlc-llm/cpp/serve/config.cc:601: Under mode "server", max batch size will be set to 80, max KV cache token capacity will be set to 125706, prefill chunk size will be set to 1024.
[04:02:01] /workspace/mlc-llm/cpp/serve/config.cc:678: The actual engine mode is "local". So max batch size is 4, max KV cache token capacity is 8192, prefill chunk size is 1024.
[04:02:01] /workspace/mlc-llm/cpp/serve/config.cc:683: Estimated total single GPU memory usage: 5736.325 MB (Parameters: 4308.133 MB. KVCache: 1092.268 MB. Temporary buffer: 335.925 MB). The actual usage might be slightly larger than the estimated number.
2024-05-09T04:02:04-0400 [INFO] [entry_service:MLC:1] MLC Engine loaded successfully.
The server is now active at http://localhost:3000. You can interact with it using the Swagger UI or in other different ways.
CURL
curl -X 'POST' \
'http://localhost:3000/generate' \
-H 'accept: text/event-stream' \
-H 'Content-Type: application/json' \
-d '{
"prompt": "Explain superconductors like I'\''m five years old",
"tokens": null
}'
Python client
import bentoml
with bentoml.SyncHTTPClient("http://localhost:3000") as client:
response_generator = client.generate(
prompt="Explain superconductors like I'm five years old",
tokens=None
)
for response in response_generator:
print(response)
OpenAI-compatible endpoints
See openai_compatibility.ipynb
This Service uses the FastAPI ASGI Integration to hook up MLC-LLM OpenAI-compatible endpoints to the BentoML service.
from openai import OpenAI
client = OpenAI(base_url='http://localhost:3000/v1', api_key='na')
# Use the following func to get the available models
client.models.list()
chat_completion = client.chat.completions.create(
model="HF://mlc-ai/Llama-3-8B-Instruct-q4f16_1-MLC",
messages=[
{
"role": "user",
"content": "Explain superconductors like I'm five years old"
}
],
stream=True,
)
for chunk in chat_completion:
# Extract and print the content of the model's reply
print(chunk.choices[0].delta.content or "", end="")
Note: If your Service is deployed with protected endpoints on BentoCloud, you need to set the environment variable OPENAI_API_KEY
to your BentoCloud API key first.
export OPENAI_API_KEY={YOUR_BENTOCLOUD_API_TOKEN}
You can then use the following line to replace the client in the above code snippet. Refer to Obtain the endpoint URL to retrieve the endpoint URL.
client = OpenAI(base_url='your_bentocloud_deployment_endpoint_url/v1')