Giter VIP home page Giter VIP logo

bentomlc's Introduction

Self-host LLMs with MLC-LLM and BentoML

This is a BentoML example project, showing you how to serve and deploy open-source Large Language Models using MLC-LLM, a high-performance universal deployment solution that allows native deployment of any large language models with native APIs with compiler acceleration.

See here for a full list of BentoML example projects.

๐Ÿ’ก This example is served as a basis for advanced code customization, such as custom model, inference logic or MLC-LLM options. For simple LLM hosting with OpenAI compatible endpoint without writing any code, see OpenLLM.

Prerequisites

  • You have installed Python 3.8+ and pip. See the Python downloads page to learn more.
  • You have a basic understanding of key concepts in BentoML, such as Services. We recommend you read Quickstart first.
  • If you want to test the Service locally, you need a Nvidia GPU with at least 16G VRAM.
  • (Optional) We recommend you create a virtual environment for dependency isolation for this project. See the Conda documentation or the Python documentation for details.

Install dependencies

git clone https://github.com/rickzx/BentoMLC.git
pip install -r requirements.txt && pip install -f -U "pydantic>=2.0"

Run the BentoML Service

We have defined a BentoML Service in service.py. Run bentoml serve in your project directory to start the Service.

$ bentoml serve .

2024-05-09T04:01:54-0400 [INFO] [cli] Starting production HTTP BentoServer from "service:MLC" listening on http://localhost:3000 (Press CTRL+C to quit)
2024-05-09T04:01:56-0400 [INFO] [entry_service:MLC:1] Loading MLC Engine
[04:02:01] /workspace/mlc-llm/cpp/serve/config.cc:601: Under mode "local", max batch size will be set to 4, max KV cache token capacity will be set to 8192, prefill chunk size will be set to 1024. 
[04:02:01] /workspace/mlc-llm/cpp/serve/config.cc:601: Under mode "interactive", max batch size will be set to 1, max KV cache token capacity will be set to 8192, prefill chunk size will be set to 1024. 
[04:02:01] /workspace/mlc-llm/cpp/serve/config.cc:601: Under mode "server", max batch size will be set to 80, max KV cache token capacity will be set to 125706, prefill chunk size will be set to 1024. 
[04:02:01] /workspace/mlc-llm/cpp/serve/config.cc:678: The actual engine mode is "local". So max batch size is 4, max KV cache token capacity is 8192, prefill chunk size is 1024.
[04:02:01] /workspace/mlc-llm/cpp/serve/config.cc:683: Estimated total single GPU memory usage: 5736.325 MB (Parameters: 4308.133 MB. KVCache: 1092.268 MB. Temporary buffer: 335.925 MB). The actual usage might be slightly larger than the estimated number.
2024-05-09T04:02:04-0400 [INFO] [entry_service:MLC:1] MLC Engine loaded successfully.

The server is now active at http://localhost:3000. You can interact with it using the Swagger UI or in other different ways.

CURL
curl -X 'POST' \
  'http://localhost:3000/generate' \
  -H 'accept: text/event-stream' \
  -H 'Content-Type: application/json' \
  -d '{
  "prompt": "Explain superconductors like I'\''m five years old",
  "tokens": null
}'
Python client
import bentoml

with bentoml.SyncHTTPClient("http://localhost:3000") as client:
    response_generator = client.generate(
        prompt="Explain superconductors like I'm five years old",
        tokens=None
    )
    for response in response_generator:
        print(response)
OpenAI-compatible endpoints

See openai_compatibility.ipynb

This Service uses the FastAPI ASGI Integration to hook up MLC-LLM OpenAI-compatible endpoints to the BentoML service.

from openai import OpenAI

client = OpenAI(base_url='http://localhost:3000/v1', api_key='na')

# Use the following func to get the available models
client.models.list()

chat_completion = client.chat.completions.create(
    model="HF://mlc-ai/Llama-3-8B-Instruct-q4f16_1-MLC",
    messages=[
        {
            "role": "user",
            "content": "Explain superconductors like I'm five years old"
        }
    ],
    stream=True,
)

for chunk in chat_completion:
    # Extract and print the content of the model's reply
    print(chunk.choices[0].delta.content or "", end="")

Note: If your Service is deployed with protected endpoints on BentoCloud, you need to set the environment variable OPENAI_API_KEY to your BentoCloud API key first.

export OPENAI_API_KEY={YOUR_BENTOCLOUD_API_TOKEN}

You can then use the following line to replace the client in the above code snippet. Refer to Obtain the endpoint URL to retrieve the endpoint URL.

client = OpenAI(base_url='your_bentocloud_deployment_endpoint_url/v1')

bentomlc's People

Contributors

rickzx avatar

Stargazers

 avatar  avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.