Giter VIP home page Giter VIP logo

christianalexander / bumblebee-model-harness Goto Github PK

View Code? Open in Web Editor NEW

This project forked from fly-apps/bumblebee-model-harness

1.0 0.0 0.0 362 KB

Minimal Elixir application that can host AI models on Fly.io GPUs and make them available via WireGuard to remote clustered Elixir applications for improved development experiences.

Home Page: https://fly.io/phoenix-files/easy-at-home-ai-with-bumblebee-and-fly-gpus/

License: MIT License

Shell 0.46% JavaScript 3.99% Elixir 74.72% CSS 0.14% HTML 16.82% Batchfile 0.05% Dockerfile 3.82%

bumblebee-model-harness's Introduction

Harness

See the accompanying blog post Easy at-home AI with Bumblebee and Fly GPUs

This is a minimal Elixir Phoenix web application that only exists to host a pre-trained Machine Learning model on a Fly.io machine with an attached GPU. The purpose is to make the model and GPU accessible to a separate Elixir application that is clustered with this app. In this way, this app is just a harness for the following:

  • Fetching the ML model (from HuggingFace)
  • Caching the downloaded model on a volume attached to the machine
  • Hosting the model through Bumblebee in an Nx.Serving, making it easy for Elixir to communicate with.

What's the advantage?

For a more detailed look at the advantages of doing this, please refer to this article.

In short, it's for the following reasons:

  • keep the benefits of rapid, local development
  • get access to large models, machines and GPUs
  • shut down machines when not in use
  • enables developing customized ML/AI code without compromising on the dev tooling or speed of development
  • bragging rights

Deploy this for yourself

This builds the Dockerfile image, deploys it, and starts the selected serving. For me, the process of starting the serving for a new Llama 2 model took about 4 minutes to download and start.

Track the logs if you like:

fly logs

Optional updates:

The fly.toml file has the auto_stop_machines = false setting. This is helpful when getting started so the machine doesn't get shutdown while the model is being downloaded. Once the machine is setup, feel free to change this value if that works best for your needs.

The VM size is set setting is size = "a100-40gb". This ensures the machine we get has the NVidia A100 GPU.

Selecting a ready-to-go model

Three LLMs are built-in and ready to go. Select the model to serve and enable it. Depending on the available hardware and the size of the model, hosting multiple models on the same GPU may not be practical or possible.

Select a single model to enable, deploy the harness application, and develop against it.

Models:

To select a model, uncomment it in lib/harness/application.ex and comment out the unused ones. This selects which serving to create and start. The following is an example of serving the Llama 2 model.

{Harness.DelayedServing,
  serving_name: Llama2ChatModel,
  serving_fn: fn -> Harness.Llama2Chat.serving() end},

In this example, the serving_name of Llama2ChatModel is the name of the serving to address in the client application. Name it whatever you like! It is the name used in the client when calling using the serving. In the client, it looks like this:

Nx.Serving.batched_run(Llama2ChatModel, "Say hello.")

The harness application uses a DelayedServing helper to start the model. Downloading and loading a large model takes time. It happens in the application startup process, which if done synchronously, makes the application unresponsive to health check. Fly will kill the app thinking it's unresponsive... which it is.

The DelayedServing makes the loading asynchronous so the application starts quickly and is responsive while the larger loading task continues.

Troubleshooting and diagnosis tips

To test and verify that you've successfully deployed the application to a machine with GPU access and that your application has all the necessary support for taking advantage of the GPU, do the following:

$ fly ssh console
# bin/harness remote
iex> Harness.DelayedServing.has_gpu_access?()
true

If you get a true response, then the machine, and your Elixir application both have access to the GPU.

Additionally, the harness app's logs report on successful GPU access or not for the Elixir application.

Logged messages:

  • info - "Elixir has CUDA GPU access! Starting serving #{serving_name}."
  • warning - "Elixir does not have GPU access. Serving will NOT be started."

Waiting for first-time model downloads

Depending on the model being used, it may be many GB in size and take several minutes to download.

If the Fly.io volume is setup correctly and available to the machine, the files are downloaded to /data/cache/bumblebee/huggingface/.

Before the Nx serving can be activated, the model must be fully downloaded, loaded into RAM, then moved to the GPU. Once complete, the serving is available for making calls against.

The attached volume caches the download so the local files are used the next time the harness application is started, skipping the lengthy download step.

Clustering your local app to the harness app on Fly.io

This documentation walks through the process: Easy Clustering from Home to Fly.io

bumblebee-model-harness's People

Contributors

brainlid avatar christianalexander avatar

Stargazers

Brian Jorgensen avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.