Self-host LLMs with MLC-LLM and BentoML

This is a BentoML example project, showing you how to serve and deploy open-source Large Language Models using MLC-LLM, a high-performance universal deployment solution that allows native deployment of any large language models with native APIs with compiler acceleration.

See here for a full list of BentoML example projects.

💡 This example is served as a basis for advanced code customization, such as custom model, inference logic or MLC-LLM options. For simple LLM hosting with OpenAI compatible endpoint without writing any code, see OpenLLM.

Prerequisites

You have installed Python 3.8+ and pip. See the Python downloads page to learn more.
You have a basic understanding of key concepts in BentoML, such as Services. We recommend you read Quickstart first.
If you want to test the Service locally, you need a Nvidia GPU with at least 16G VRAM.
(Optional) We recommend you create a virtual environment for dependency isolation for this project. See the Conda documentation or the Python documentation for details.

Install dependencies

git clone https://github.com/rickzx/BentoMLC.git
pip install -r requirements.txt && pip install -f -U "pydantic>=2.0"

Run the BentoML Service

We have defined a BentoML Service in service.py. Run bentoml serve in your project directory to start the Service.

$ bentoml serve .

2024-05-09T04:01:54-0400 [INFO] [cli] Starting production HTTP BentoServer from "service:MLC" listening on http://localhost:3000 (Press CTRL+C to quit)
2024-05-09T04:01:56-0400 [INFO] [entry_service:MLC:1] Loading MLC Engine
[04:02:01] /workspace/mlc-llm/cpp/serve/config.cc:601: Under mode "local", max batch size will be set to 4, max KV cache token capacity will be set to 8192, prefill chunk size will be set to 1024. 
[04:02:01] /workspace/mlc-llm/cpp/serve/config.cc:601: Under mode "interactive", max batch size will be set to 1, max KV cache token capacity will be set to 8192, prefill chunk size will be set to 1024. 
[04:02:01] /workspace/mlc-llm/cpp/serve/config.cc:601: Under mode "server", max batch size will be set to 80, max KV cache token capacity will be set to 125706, prefill chunk size will be set to 1024. 
[04:02:01] /workspace/mlc-llm/cpp/serve/config.cc:678: The actual engine mode is "local". So max batch size is 4, max KV cache token capacity is 8192, prefill chunk size is 1024.
[04:02:01] /workspace/mlc-llm/cpp/serve/config.cc:683: Estimated total single GPU memory usage: 5736.325 MB (Parameters: 4308.133 MB. KVCache: 1092.268 MB. Temporary buffer: 335.925 MB). The actual usage might be slightly larger than the estimated number.
2024-05-09T04:02:04-0400 [INFO] [entry_service:MLC:1] MLC Engine loaded successfully.

The server is now active at http://localhost:3000. You can interact with it using the Swagger UI or in other different ways.

CURL

curl -X 'POST' \
  'http://localhost:3000/generate' \
  -H 'accept: text/event-stream' \
  -H 'Content-Type: application/json' \
  -d '{
  "prompt": "Explain superconductors like I'\''m five years old",
  "tokens": null
}'

Python client

import bentoml

with bentoml.SyncHTTPClient("http://localhost:3000") as client:
    response_generator = client.generate(
        prompt="Explain superconductors like I'm five years old",
        tokens=None
    )
    for response in response_generator:
        print(response)

OpenAI-compatible endpoints

See openai_compatibility.ipynb

This Service uses the FastAPI ASGI Integration to hook up MLC-LLM OpenAI-compatible endpoints to the BentoML service.

from openai import OpenAI

client = OpenAI(base_url='http://localhost:3000/v1', api_key='na')

# Use the following func to get the available models
client.models.list()

chat_completion = client.chat.completions.create(
    model="HF://mlc-ai/Llama-3-8B-Instruct-q4f16_1-MLC",
    messages=[
        {
            "role": "user",
            "content": "Explain superconductors like I'm five years old"
        }
    ],
    stream=True,
)

for chunk in chat_completion:
    # Extract and print the content of the model's reply
    print(chunk.choices[0].delta.content or "", end="")

Note: If your Service is deployed with protected endpoints on BentoCloud, you need to set the environment variable OPENAI_API_KEY to your BentoCloud API key first.

export OPENAI_API_KEY={YOUR_BENTOCLOUD_API_TOKEN}

You can then use the following line to replace the client in the above code snippet. Refer to Obtain the endpoint URL to retrieve the endpoint URL.

client = OpenAI(base_url='your_bentocloud_deployment_endpoint_url/v1')

rickzx / bentomlc Goto Github PK

bentomlc's Introduction

Self-host LLMs with MLC-LLM and BentoML

Prerequisites

Install dependencies

Run the BentoML Service

bentomlc's People

Contributors

Stargazers

Watchers

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent