serverless-llm-app-factory

Not yet another LLM python package. At this point in time, there's

This repository serves as a starting point to churn out serverless LLM web applications.

You're just getting started with LLMs or APIs? Take a look at the "stupidly minimal guide" accompanying llm-api-starterkit and return here afterwards.

Skip to reference implementations to see what potential serverless LLM apps look like, and what are the key takeaways in implementation
Dive into installation to start developing your own serverless LLM app
Test the example applications, by skipping to local webapp deployment, or immediately deploy a reference application on the web

Premise

Building and deploying new AI products to end-users is possible in minutes, if we leverage LLMs and managed serverless services.

Building an AI product used to consist of three great obstacles on the technical domain:

(1) Data preparation & governance

(2) Model development & management

(3) Deployment (& maintenance)

With the ability of large language models (LLMs) to do zero-shot and few-shot learning from examples without context-specific training, it's possible to build new AI products in a couple of minutes, skipping model development if we assume:

(1) data is ready and/or static, or not needed for our product,

we use a service to help us manage (3) deployment

Key value proposition

In this repository, I focus on (3) deployment.

We leverage a simple design pattern from llm-api-starterkit using LangChain & FastAPI for model development.

Deployment needs resources for back-end & front-end:

We use back-end resources from Replicate, a serverless model endpoint service, for the following key reasons:

Use any open-source LLM: You can implement or adapt any existing open-source model and deploy it on Replicate with the fantastic cog template. Never mind, someone has likely already beat you to it, and you can leverage their implementation of the latest LLM flavour.
Free of charge: Just log in with GitHub, and you can use a limited amount of compute for free, ideal for getting started (unclear how much, never hit a limit with simple development)
Extremely beginner-friendly: Using the LangChain integration, all you need to do is explore the model hub and copy the endpoint link into your application

Other considerations:

Easy access to other SOTA non-LLM models: An active community of researchers & practictioners implements state of the art models, faster than almost any other open-source platform
Cheap to start, expensive to scale: According to inferless, one of the more expensive serverless GPU options to provide serverless compute for your preferred Large Language (and other ML) models without opting in to an opinionated ecosystem

For front-end deployment on the web, we leverage fly.io to deploy our application on the web. No idea if this is the best option, but it:

is very easy to deploy if you have a Docker container for your front-end
has a free tier
seems popular

Design draft

So, what does the core idea look like?

the middleware deployed on fly.io containing business logic is leveraging the simple pattern from llm-api-starterkit:

An incomplete approach

This is by no means a fully fledged guide to operationalize the development and maintenance of your deployed LLM-powered serverless application.
Cost optimization, inference speed, are not priorities in these examples, but will be some of the key considerations if you are building a user-facing product.
On top of that, we use no CI/CD, pre-commits, tests, anything slowing us down from deploying a first prototype or making sure our app is maintainable. Not recommended.

For a comprehensive guide to LLMOps, best practices & enterprise deployment... you'll have to wait until https://github.com/tleers/servelm is completed---or until someone else on the internet decides to invest their time into this :)

Reference implementations

Short descriptions and key considerations of the examples implemented in this repository.

To try the example applications out, skip to local webapp quickstart, or immediately deploy a reference application on the web

To-do extractor

TODO!

LLM-enhanced text 2 music with Meta's MusicGen

Text to music sample with Replicate, Vicuna & MusicGen

User input music description
Vicuna LLM converts rich musical description
MusicGen converts rich musical description to music

Pricing? Replicate gives you a couple of free tries before they ask for a credit card (I presume, they never asked me).

The cost is less than 0.10$ per sample when combining Vicuna & Audiocraft (prompt-to-sampleendpoint)
Estimated unit costs for the Audiocraft endpoint were: $0.00055 / second. Four samples cost me about $0.17.
Estimated unit costs for the Vicuna-13b endpoint were: $0.0023 / second. Four samples cost about $0.08.

Latency? 5 to 15 seconds for the LLM, about 60-90 seconds for music sample generation.

The LLM is deployed on Nvidia A100s.
The Audiocraft endpoint is deployed on Nvidia T4s. You could significantly improve latency by switching to A100s, and potentially optimize cost further.

LLM-powered creative agent that can create lyrics, music and music videos

TODO

Local webapp quickstart

You don't want to develop new apps yourself, but you're just interested in locally running a webapp in your browser? Start here.

If you have Docker installed

I recommend this route - it prevents dependency issues.

TODO README

If you do not have Docker installed

TODO, finalize examples.

We assume that poetry, the go-to for dependency management and building modules in Python, is installed. If not, please install Poetry first.

To launch the API for:

to-do extraction

sh todo_api.sh

text 2 music sample

sh custom_music_sample.sh

custom music agent

sh muzikagent_api.sh

Web API Launch Quickstart

You want to launch one of the existing apps on the web?

Here we launch our API on the web. You need to connect fly.io to your github or email. You'll be prompted when executing the commands below.

curl -L https://fly.io/install.sh | sh
fly auth signup

Once you're logged in, launch the Web API of your choice. But wait: First add your Replicate API token.

flyctl secrets set REPLICATE_API_TOKEN=token
fly launch --image apps/todo_extractor/api.Dockerfile

Congratulations, your todo extractor API will go online at https://.fly.dev/docs, with the name you chose during the installation process in fly launch

You're finished testing it out and want to take it down?

flyctl scale count 0

Installation

Development

To develop your own applications, start:

poetry install

Alternatively, you can rely on trusty venv:

python3 -m venv venv
. venv/bin/activate
pip install -r requirements.txt

Set up Replicate & fly.io

Register on Replicate.com with your Github account.
Copy your API token (left-click your username, then click API tokens)
Paste the API token into your .secrets file
- REPLICATE_API_TOKEN=r8_***

PS

Why not AWS, GCP, Azure, HuggingFace?

Already outlined above, Replicate is very easy to use and doesn't require a credit card to sign up - an active GitHub account seems to be enough to get access to limited free compute. On top of that, Replicate seems to have one of the fastest communities when it comes to picking up SOTA models and making them available for others to use. Their cog container template may be partially responsible for that.

AWS, GCP, Azure were not selected because these options require significantly more learning & effort to pick up and require a credit card (usually).
Hugging Face is a strong contender, and ultimately probably the better choice for a product that you expect to scale. Cost-wise, integration wise, and maturity wise, it's a more optimal choice. However, it is not possible to use GPUs without credit card, and it has a significantly higher learning curve, in addition to its strongly community-supported, but ultimately, bloated ecosystem, making adoption of SOTA models somewhat slower and harder to maintain.
OpenAI: In practice probably the easiest and cheapest to build an application with presently. Not selected because it's trivial to use (and demonstrated in llm-api-starterkit), because it requires a credit card, and we want to demonstrate other options.

Hidden agenda

I'm building on different, larger products that could benefit from a reference repository that explains how to design and deploy LLM webapps. Okay, I want LLM-powered agents to build stuff for me.

tleers / serverless-llm-app-factory Goto Github PK