Giter VIP home page Giter VIP logo

fly-autoscaler's Introduction

Fly Autoscaler

The project is a metrics-based autoscaler for Fly.io. The autoscaler supports polling for metrics from a Prometheus instance and then computing the number of machines based on those metrics.

How it works

The Fly Autoscaler works by performing a reconciliation loop on a regular interval. By default, it runs every 15 seconds.

  1. Collect metrics from external systems (e.g. Prometheus)

  2. Compute the target number of machines based on a user-provided expression.

  3. Fetch a list of all Fly Machines for your application.

  4. If the target number of machines is less than the number of started machines, use the Fly Machines API to start new machines.

                                     ┌────────────────────┐
fly-autoscaler ──────────┐           │                    │
│ ┌────────────────────┐ │    ┌──────│     Prometheus     │
│ │                    │ │    │      │                    │
│ │  Metric Collector  │◀┼────┘      └────────────────────┘
│ │                    │ │                                 
│ └──────┬─────────────┘ │                                 
│        │     △         │                                 
│        ▽     │         │                                 
│ ┌────────────┴───────┐ │                                 
│ │                    │ │                                 
│ │     Reconciler     │◀┼────┐                            
│ │                    │ │    │      ┌────────────────────┐
│ └────────────────────┘ │    │      │                    │
└────────────────────────┘    └─────▶│  Fly Machines API  │
                                     │                    │
                                     └────────────────────┘

Expressions

The autoscaler uses the Expr language to define the target number of machines. See the Expr Language Definition for syntax and a full list of built-in functions. The expression can utilize any named metrics that you collect and it should always return a number.

For example, if you poll for queue depth and each machine can handle 10 queue items at a time, you can compute the number of machines as:

ceil(queue_depth / 10)

The autoscaler can only start machines so it will never exceed the number of machines available for a Fly app.

Usage

Create an app for your autoscaler

First, create an app for your autoscaler:

$ fly apps create my-autoscaler

Then create a fly.toml for the deployment. Update the TARGET_APP_NAME with the name of the app that you want to scale and update MY_ORG to the organization where your Prometheus metrics live.

app = "my-autoscaler"

[build]
image = "flyio/fly-autoscaler:0.2"

[env]
FAS_APP_NAME = "TARGET_APP_NAME"
FAS_STARTED_MACHINE_COUNT = "ceil(queue_depth / 10)"
FAS_PROMETHEUS_ADDRESS = "https://api.fly.io/prometheus/MY_ORG"
FAS_PROMETHEUS_METRIC_NAME = "queue_depth"
FAS_PROMETHEUS_QUERY = "sum(queue_depth)"

[metrics]
port = 9090
path = "/metrics"

Create a deploy token

Next, set up a new deploy token for the application you want to scale:

$ fly tokens create deploy -a TARGET_APP_NAME

Set the token as a secret on your application:

$ fly secrets set FAS_API_TOKEN="FlyV1 ..."

Create a read-only token

Create a token for reading your Prometheus data:

$ fly tokens create readonly

Set the token as a secret on your application:

$ fly secrets set FAS_PROMETHEUS_TOKEN="FlyV1 ..."

Deploy the server

Finally, deploy your autoscaler application:

$ fly deploy

This should create a new machine and start it with the fly-autoscaler server running.

Testing your metrics & expression

You can perform a one-time run of metrics collection & expression evaluation for testing or debugging purposes by using the eval command. This command does not perform any scaling of Fly Machines. It will only print the evaluated expression based on current metrics numbers.

$ fly-autoscaler eval

You can change the evaluated expression by setting an environment variable:

$ FAS_STARTED_MACHINE_COUNT=queue_depth fly-autoscaler eval

Configuration

You can also configure fly-autoscaler with a YAML config file if you don't want to use environment variables or if you want to configure more than one metric collector.

Please see the reference fly-autoscaler.yml for more details.

fly-autoscaler's People

Contributors

benbjohnson avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

fly-autoscaler's Issues

Gracefully handle missing metrics

The reconciler has code to check if a metric is missing or NaN and avoid scaling. However, we need to investigate if lagging metrics will adversely affect scaling in any way.

Handle machines in "stopping" state

The autoscaler currently supports downscaling machines but stopping machines can take time. To allow the reconciler to make progress with machines that are waiting to stop, the autoscaler should introduce a pseudo "stopping" status and handle the stops in a separate goroutine.

Scale multiple apps within an organization

Currently, the autoscaler only works for a single app. However, many customers run a hundreds or thousands of app in the case where each app is for one of the customer's tenants. It would be useful to allow scaling of multiple with a single instance of the autoscaler.

/cc @hbagdi

Downscaling

Currently, the autoscaler only scales up and relies on the target application shutting down when it doesn't have work to perform. This works when there is low job parallelism within a single worker (e.g. a worker can handle 1 or 2 jobs at a time) but doesn't work well as it scales (e.g. a worker can handle tens of jobs at a time).

The reconciler should probably accept a min_machine_count and a max_machine_count expression instead of just the max (via expr).

Blue green upgrades

When running a cluster of things you need to not only control the scale but also upgrading the machines ( like 3 ) such that they is no down time.

The yaml config could be extended to support this will basic http calls to put a machine into drain mode and then upgrade one at a time .

This is pretty standard pattern , but there are other patterns. https://gofr.dev/docs/advanced-guide/publishing-custom-metrics show these standard patterns on a few of their pages.

This would broaden the amount of systems that can be run on fly.

https://github.com/gofr-dev/gofr Is the code to match the docs

Kafka metric collector

We should add support for Kafka as a metric collector using the Kafka Go Client. We should be able to determine the queue depth based on the consumer group commit & offset.

Support scaling by creating/destroying machines

Currently, the autoscaler only scales existing machines because they are much faster to start/stop than creating new machines. However, it'd be great to support the creation of new machines ahead of possible scaling.

We should support min_created_machine_count and max_created_machine_count expressions so we can create additional machines as needed.

Laravel Queues

This seems like a tool that could potentially auto scale Laravel Queue workers. Fly.io, more specifically @fideloper, has written about auto-scaling Laravel queues on Fly.io and has done some work to make this possible, but it would be neat to be able to use this for scaling and take the work off of the Laravel job scheduler itself.

I don't have any answers yet on how to determine the current queue length, but I wanted to create the issue and see if it was something that you'd be interested in including?

Throttling

Currently, the autoscaler will immediately scale up to the target number of machines. However, we may want to add an optional throttle so that the number of started machines only grows by a percentage of the currently started machines.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.