jrasell / sherpa Goto Github PK

Sherpa is a highly available, fast, and flexible horizontal job scaling for HashiCorp Nomad. It is capable of running in a number of different modes to suit different requirements, and can scale based on Nomad resource metrics or external sources.

License: Mozilla Public License 2.0

Dockerfile 0.16% Makefile 0.40% Go 99.44%

golang autoscaling nomad hashicorp automation docker containers api go

sherpa's Introduction

Sherpa

Features

Scale jobs based on Nomad resource consumption and external metrics: The Sherpa autoscaler can use a mixture of Nomad resource checks, and external metric values to make scaling decisions. Both are optional to provide flexibility. Jobs can also be scaled via the CLI and API in either a manual manner, or by using webhooks sent from external applications such as Prometheus Alertmanager.
Highly available and fault tolerant: Sherpa performs leadership locking and quick fail-over, allowing multiple instances to run safely. During availability issues, or deployment Sherpa servers will gracefully handle leadership changes resulting in uninterrupted scaling.
Operator friendly: Sherpa is designed to be easy to understand and work with as an operator. Scaling state in particular can contain metadata, providing insights into exactly why a scaling activity took place. A simple UI is also available to provide an easy method of checking scaling activities.

Download & Install

The Sherpa binary can be downloaded from the GitHub releases page using curl -L https://github.com/jrasell/sherpa/releases/download/v0.4.2/sherpa_0.4.2_linux_amd64 -o sherpa
A docker image can be found on Docker Hub, the latest version can be downloaded using docker pull jrasell/sherpa.
Sherpa can be built from source by cloning the repository git clone github.com/jrasell/sherpa.git and then using the make build command.

Documentation

Please refer to the documentation directory for guides to help with deploying and using Sherpa in your Nomad setup.

Contributing

Contributions to Sherpa are very welcome! Please reach out if you have any questions.

Contributors

Thanks to everyone who has contributed to this project.

@jvineet @josegonzalez @pmcatominey @numiralofe @commarla @hobochili

sherpa's People

Contributors

Stargazers

Watchers

Forkers

qivers pmcatominey numiralofe amasser catsone commarla mylesw42 iq-scm

sherpa's Issues

metrics: add prometheus format to metrics endpoint

Is your feature request related to a problem? Please describe.
Currently Sherpa does not expose a metrics endpoint which is consumable via Prometheus. Prometheus works in a pull/scrape manner and requires metrics to be formatted in a certain manner at the metrics endpoint.

Describe the solution you'd like.
The metrics endpoint should be updated to allow Prometheus to successfully scrape data. The CLI should have a flag to enable this functionality, in a similar manner to the other telemetry endpoints.

leadership: if cluster state exists do not overwrite cluster info

Describe the bug
If existing cluster state exists in the backend store, a new Sherpa server which gains leadership will generate a new cluster ID and overwrite the state.

To reproduce

run a single Sherpa server using Consul backend
stop the Sherpa server
observe the cluster state info
start a new Sherpa server
observe the cluster ID gets updated

Expected behavior
A server which joins the state of an existing cluster should not overwrite the ID or name, but should respect what is originally found.

Environment:

sherpa version v0.2.1+dev
	Date:   2019-10-14 09:27:50.896254 +0000 UTC
	Commit: be871e9
	Branch: master
	State:  dirty

cmd: add latest flag to scale status to list only latest events

Describe the solution you'd like.
Sherpa internally has easy methods to list the latest scaling events for job groups without having to perform any addition iterations. It would be helpful to expose this to the user, so they are able to get a quick overview of the latest activities to have occurred, without having to iterate and analyse the standard scaling list output.

cmd: scale status list should sort the events by time

Is your feature request related to a problem? Please describe.
Currently the list output from scale status is not sorted, which makes looking through them difficult and time consuming.

Describe the solution you'd like.
The output from scale status should be ordered by time, newest -> oldest to make it much easier for users to interact with.

[feature] add pprof functionality to allow better analysis of Sherpa

Is your feature request related to a problem? Please describe.
Currently only the sherpa runtime metrics and log messages can provide insights into the Sherpa server performance. It would be beneficial to have a method which allows better insights into the Sherpa server performance and behaviour.

Describe the solution you'd like.
It would be ideal to have an option to enable pprof for visualization and analysis of a running Sherpa server. This will help identify bottlenecks or edge cases as well as help optimize the server.

stdlib link - https://golang.org/pkg/net/http/pprof/
Google Github link - https://github.com/google/pprof

Discussion: External metric query comparison optimization

Is your feature request related to a problem? Please describe.
In working with the external query providers, for a CPU/Memory scale-in/out check, the exact same query is executed twice with the current model.

I think Sherpa can execute the query once and then run it against multiple Action comparisons. I haven't looked to see how much of a change this would be, but I just wanted to propose the idea.

Describe the solution you'd like.

What if the external check model looked something like below?

{
  "Enabled": true,
  "MaxCount": 16,
  "MinCount": 2,
  "ScaleOutCount": 1,
  "ScaleInCount": 1,
  "ExternalChecks": {
    "check_name": {
      "Enabled": true,
      "Provider": "provider-type",
      "Query": "fancy provider metric query",
      "Actions": [
        {
          "ComparisonOperator": "less-than",
          "ComparisonValue": 30,
          "Action": "scale-in"
        },
        {
          "ComparisonOperator": "greater-than",
          "ComparisonValue": 80,
          "Action": "scale-out"
        }
      ]
    }
  }
}

Describe alternatives you've considered.
N/A

Explain any additional use-cases.
N/A

Additional context.
N/A

Support highly available deployments

Is your feature request related to a problem? Please describe.
In its current state Sherpa has no awareness of other potential Sherpa servers running on the same Nomad cluster. This can result in both servers performing scaling actions which is problematic.

Describe the solution you'd like.
Sherpa should have a way to perform clustering or leadership locking of some kind. This would mean only one server at a time can be responsible for performing scaling requests. Seeing as Consul can be used as a persistence layer; initial thoughts are to use this in a manner similar to Vault to perform clustering and leadership activities.

Initially, mutual exclusion is the highest priority feature to achieve and having some form of identifying which instances have the lock, and which do not. Stretch or additional tasks could include request forwarding from a "follower" instance to the "active" instance.

Docker Hub wrong version - 0.4.0

Describe the bug

The latest version pushed on Docker Hub does not seem to be the right one.

The interface displays version 0.3.0 instead of 0.4.0 and the digest (a73972f6f7bd) is the same on the Docker Hub for both versions.

Build also show version 0.3.0.

Seem likes you forgot to bump up version in the Dockerfile.

To reproduce

docker run -it --rm -p 8000:8000 jrasell/sherpa:0.4.0 server --bind-addr 0.0.0.0 --ui

Hit http://127.0.0.1:8000/ui, version 0.3.0 is displayed.

Expected behavior

Sherpa at version 0.4.0.

Add cooldown functionality to job group scaling

Describe the solution you'd like.
Cooldown is a feature of autoscaling which can help ensure previous scaling activities have a chance to impact the load on an application before another scaling event is triggered. The recent addition of scaling event state tracking, now allows Sherpa to include cooldowns within its scaling decision tree.

AWS - https://docs.aws.amazon.com/autoscaling/ec2/userguide/Cooldown.html
Google - https://cloud.google.com/compute/docs/autoscaler/
Microsoft - https://docs.microsoft.com/en-us/azure/azure-monitor/platform/autoscale-virtual-machine-scale-sets?toc=https%3A%2F%2Fdocs.microsoft.com%2Fen-us%2Fazure%2Fvirtual-machine-scale-sets%2FTOC.json&bc=https%3A%2F%2Fdocs.microsoft.com%2Fen-us%2Fazure%2Fbread%2Ftoc.json

Add task group count reconcile event handler

Is your feature request related to a problem? Please describe.
There are situations where the scaling policy limits for a group may fall outside of the current running number. This could be caused by mis-configuration, or by introducing Sherpa onto a system where counts have not been adjusted to match the decided upon policy. This can be visualised such as:

running count = 3
policy min count = 4
policy max count = 8

Describe the solution you'd like.
It could be beneficial to have a reconcile handler which runs on a loop and can take action against job groups whose count falls outside the bounds of an associated scaling policy. This feature could be optional, with a CLI flag to control whether it is enabled or not. The feature could also only work when the internal policy engine is used.

Scale in instead of out

Describe the bug
I have a strange behaviour I see a scale in instead of a scale out.

To reproduce
My config is the following (I use nomad meta) :

 "Meta": {
        "sherpa_max_count": "15",
        "sherpa_cooldown": "180",
        "timestamp": "2019-10-30T13:23:10Z",
        "sherpa_scale_out_memory_percentage_threshold": "70",
        "sherpa_scale_out_cpu_percentage_threshold": "70",
        "sherpa_enabled": "1",
        "sherpa_min_count": "1",
        "sherpa_scale_out_count": "1",
        "sherpa_scale_in_cpu_percentage_threshold": "30",
        "sherpa_scale_in_memory_percentage_threshold": "30",
        "sherpa_scale_in_count": "1"
      },

In the log I have

Oct 31 08:56:31 admin-10-32-152-182 docker[23243]: {"level":"debug","job":"my-app","group":"my-app-main-spot","mem-usage-percentage":26,"cpu-usage-percentage":120,"time":"2019-10-31T07:56:31.222253696Z","message":"resource utilisation calculation"}
Oct 31 08:56:31 admin-10-32-152-182 docker[23243]: {"level":"debug","job":"my-app","scaling-req":{"direction":"in","count":1,"group":"my-app-main-spot"},"time":"2019-10-31T07:56:31.222325293Z","message":"added group scaling request"}
Oct 31 08:56:31 admin-10-32-152-182 docker[23243]: {"level":"info","job":"my-app","id":"35eafbc8-9946-4a1c-bcfe-6f1ec7394528","evaluation-id":"06d82281-1ba2-59ab-c9d5-10bd47fa527c","time":"2019-10-31T07:56:31.262742873Z","message":"successfully triggered autoscaling of job"}

With a CPU usage = 120% I should have a scale out and not a scale in.
It is a conflict with my memory that is under 30% ?

Expected behavior
A scale out

Environment:

Sherpa server information (retrieve with sherpa system info):

/usr/bin # sherpa system info
Nomad Address                http://xxxxxxxx.eu-central-1.elb.amazonaws.com:4646
Policy Engine                Nomad Job Group Meta
Storage Backend              Consul
Internal AutoScaling Engine  true
Strict Policy Checking       true

Sherpa CLI version (retrieve with sherpa --version):
docker image jrasell/sherpa:0.2.1

/usr/bin # sherpa --version
sherpa version v0.2.1
        Date:   2019-10-31 08:07:37.880961463 +0000 UTC
        Commit: be871e9
        Branch: v0.2.1
        State:  v0.2.1

Server Operating System/Architecture:
Docker 19.03.2
Debian strech 9.11
Linux sherpa 4.19.0-0.bpo.6-amd64 SMP Debian 4.19.67-2+deb10u1~bpo9+1 (2019-09-30) x86_64 Linux
Sherpa server configuration parameters:

SHERPA_AUTOSCALER_ENABLED=true
SHERPA_AUTOSCALER_EVALUATION_INTERVAL=60
SHERPA_AUTOSCALER_NUM_THREADS=3
SHERPA_BIND_ADDR=0.0.0.0
SHERPA_BIND_PORT=8000
SHERPA_CLUSTER_ADVERTISE_ADDR=http://127.0.0.1:8000
SHERPA_CLUSTER_NAME=prod-main-admin
SHERPA_LOG_FORMAT=auto
SHERPA_LOG_LEVEL=debug
SHERPA_LOG_USE_COLOR=true
SHERPA_POLICY_ENGINE_API_ENABLED=false
SHERPA_POLICY_ENGINE_NOMAD_META_ENABLED=true
SHERPA_POLICY_ENGINE_STRICT_CHECKING_ENABLED=true
SHERPA_STORAGE_CONSUL_ENABLED=true
SHERPA_STORAGE_CONSUL_PATH=sherpa/
SHERPA_TELEMETRY_STATSD_ADDRESS=
SHERPA_TELEMETRY_STATSITE_ADDRESS=
SHERPA_TLS_CERT_KEY_PATH=
SHERPA_TLS_CERT_PATH=
SHERPA_UI=false

Nomad client configuration parameters (if any):
There is nothing specific in my nomad config
Consul client configuration parameters (if any):
There is nothing specific in my consul config

Update documentation to detail new scaling state

Reference PR: #28

With the addition of scaling state, the documentation should be update to detail the new API, CLI command, and a guide to the scaling state.

scale-out event erases volume definition

Describe the bug
On a Nomad job that has Volumes defined a scale-out via the API causes the volume configuration to dissapear

To reproduce
Create a job with volumes, scale-out and see the definitions changes

Expected behavior
Scale-out event replicating the volume information

Environment:

Sherpa 0.4.2

scheduled scaling of task groups

Is your feature request related to a problem? Please describe.
Tasks that have clear and observable load patterns could benefit from scheduled autoscaling. This feature would allow the task in question, to run at a minimal cost during quiet hours, and then increase its count to handle the known traffic as required. A clear use case would be CI/CD build agents; which would be scaled down out of office hours, and then scheduled to scale up just before the start of the work day.

In order to ensure this feature is most useful, it would likely be a requirement to remove the policy min count merge default and allow setting of zero.

Describe the solution you'd like.
The scaling policy structure would need a new parameter which allows for the definition of scheduled actions. Internally, the autoscaler would need to run a new thread, which would periodically trigger scaling actions based on the desired state.

AWS - https://docs.aws.amazon.com/autoscaling/ec2/userguide/schedule_time.html
Azure - https://docs.microsoft.com/en-us/azure/azure-monitor/learn/tutorial-autoscale-performance-schedule
GCP - https://cloud.google.com/scheduler/docs/start-and-stop-compute-engine-instances-on-a-schedule

Incorrect response code and error when API scaling breaks limits

Describe the bug
When attempting to scale a job by a count which will break the configured policy thresholds, the returned error is empty and the response code is 304. The error should include details of what problem occurred. The response code should be 403, to indicate the request contained valid parameters, but was rejected by the server.

Do not scale job groups which are in deployment

Describe the solution you'd like.
Initially, Sherpa should decline to perform scaling on a job group which is classified as being in deployment by Nomad. Application deployments cause instability and can cause inaccurate resource usage calculations. In the future functionality could be considered which allows the scaling of job groups if they are under certain deployment types, but that is out of scope for this iteration.

Calling the Nomad API for every scaling trigger would be costly, therefore it would be preferable to process updates via a blocking query on the Nomad deployments API. Internally Sherpa could keep a track of deployment within a map, meaning quick and easy lookups and reduced pressure on Nomad.

twitter poll reference: https://twitter.com/jrasell/status/1172817834663141379

Update documentation to detail new scaling metadata

Related PR #74

[feature] improve health check to test consul/nomad endpoints

Describe the solution you'd like.
The health check is currently a simple alive check, and does not perform any checks on Sherpa dependancies such as Nomad or Consul APIs. The health check should be extended to perform checks which asses the health or dependancies to provide better operator feedback.

Rounding error occurs during % resource utilization calculation for autoscaling.

Describe the bug
The current order of computation for calculating cpuUsage and memUsage for autoscaling does division first and then multiplication with 100, so that % values always get rounded to multiples of 100 (0% for usage < 100%, 100% for usage between 100%-200% , 200% for usage between 200% - 300% , and so on) since both resourceUsage[group].cpu and resourceInfo[group].cpu both appear to be integer constants.

To reproduce
Example logs for autoscaling engine:

12:54PM DBG resource utilisation calculation cpu-usage-percentage=0 mem-usage-percentage=0
12:54PM INF added group scaling request job=pushgateway scaling-req={"count":1,"direction":"in","group":"pushgateway"}
12:54PM DBG scaling action will break job group minimum threshold group=pushgateway job=pushgateway
12:55PM DBG resource utilisation calculation cpu-usage-percentage=100 mem-usage-percentage=0
12:55PM INF added group scaling request job=pushgateway scaling-req={"count":1,"direction":"in","group":"pushgateway"}
12:55PM DBG scaling action will break job group minimum threshold group=pushgateway job=pushgateway
12:56PM DBG resource utilisation calculation cpu-usage-percentage=0 mem-usage-percentage=0
12:56PM INF added group scaling request job=pushgateway scaling-req={"count":1,"direction":"in","group":"pushgateway"}
12:56PM DBG scaling action will break job group minimum threshold group=pushgateway job=pushgateway
12:57PM DBG resource utilisation calculation cpu-usage-percentage=0 mem-usage-percentage=0
12:57PM INF added group scaling request job=pushgateway scaling-req={"count":1,"direction":"in","group":"pushgateway"}
12:57PM DBG scaling action will break job group minimum threshold group=pushgateway job=pushgateway
12:58PM DBG resource utilisation calculation cpu-usage-percentage=0 mem-usage-percentage=0
12:58PM INF added group scaling request job=pushgateway scaling-req={"count":1,"direction":"in","group":"pushgateway"}
12:58PM DBG scaling action will break job group minimum threshold group=pushgateway job=pushgateway
12:59PM DBG resource utilisation calculation cpu-usage-percentage=200 mem-usage-percentage=0
12:59PM INF added group scaling request job=pushgateway scaling-req={"count":1,"direction":"in","group":"pushgateway"}
12:59PM DBG scaling action will break job group minimum threshold group=pushgateway job=pushgateway

Expected behavior
The cpu-usage-percentage should have been around 40% for the most part except intermittently when it went above 100%.
The mem-usage-percentage should have been around 30%
(these are the values observed from the Resource Utilization charts in Nomad UI)

Environment:

Sherpa server information (retrieve with sherpa system info):

Nomad Address                http://loco:4646
Policy Engine                Nomad Job Group Meta
Policy Storage Backend       In Memory
Internal AutoScaling Engine  true
Strict Policy Checking       true

Sherpa CLI version (retrieve with sherpa --version):

built locally from github source, in order to test for and narrow down on the issue (built with go version go1.12.7)

Server Operating System/Architecture:

macOS Mojave Version 10.14.5 / 64 bit

Sherpa server configuration parameters:

sherpa server --autoscaler-enabled --policy-engine-nomad-meta-enabled --policy-engine-api-enabled=false --log-level=debug

Nomad client configuration parameters (if any):

client {
  enabled = true
  options {
    "docker.auth.config" = "/etc/nomad/.docker/config.json"
  }
}

Consul client configuration parameters (if any):

Not using Consul backend for meat policy engine

Additional context

https://github.com/jrasell/sherpa/blob/master/pkg/autoscale/autoscale.go#L54-L55

Autoscaler concurrency limiter to avoid overloading

Is your feature request related to a problem? Please describe.
Currently the internal autoscaler loops over the stored policies and will trigger and autoscaling run without any limit on the number of concurrent processes. In environments with a decent number of running and scalable jobs, this can cause pressure on the Nomad servers due to the number of API calls required to make.

Describe the solution you'd like.
The autoscaler should have some method to limit the number of concurrent autoscaling activities that can take place. This ideally would be something that has a sensible default, but that can be overridden by the operator. The Nomad num_schedulers provides an ideal reference.

The nomad-meta-policy-engine is unable to load the policy tags from the job group meta stanza

Describe the bug

The nomad-meta-policy-engine is unable to load the policy tags from the meta stanza within the job-group stanza. This is probably happening because jobs[i].ModifyIndex is always <= maxFound (meta.LastIndex) and so m.indexHasChange(jobs[i].ModifyIndex, maxFound) always returns False and prevents m.readJobMeta(jobs[i].ID) to load policy from the meta tags into memory. (With Nomad version v0.9.2; not tested against older nomad versions)

https://github.com/jrasell/sherpa/blob/master/pkg/policy/watcher/watcher.go#L41

To reproduce

Run the sherpa server with meta-policy-engine enabled

sherpa server --autoscaler-enabled --policy-engine-nomad-meta-enabled --policy-engine-api-enabled=false --log-level=debug

and deploy a nomad job where the policy is difined in the group stanza like the following

 meta {
      sherpa_enabled = true
      sherpa_max_count = 4
      sherpa_min_count = 1
      sherpa_scale_out_count = 1
      sherpa_scale_in_count = 1
      sherpa_scale_out_cpu_percentage_threshold = 40
      sherpa_scale_out_memory_percentage_threshold =30
      sherpa_scale_in_cpu_percentage_threshold = 20
      sherpa_scale_in_memory_percentage_threshold = 5
    }

The server logs show that sherpa isn't able to read in the policies from meta tags

6:20PM INF starting HTTP server addr=127.0.0.1:8000
6:20PM INF Sherpa server configuration server={"autoscaler-enabled":true,"autoscaler-evaluation-interval":60,"bind-addr":"127.0.0.1","bind-port":8000,"policy-engine-api-enabled":false,"policy-engine-nomad-meta-enabled":true,"policy-engine-strict-checking-enabled":true,"storage-consul-enabled":false,"storage-consul-path":"sherpa/policies/"} telemetry={"telemetry-statsd-address":"","telemetry-statsite-address":""} tls={"tls-cert-key-path":"","tls-cert-path":""}
6:20PM DBG setting up Nomad client
6:20PM DBG setting up in-memory storage backend
6:20PM DBG setting up Sherpa internal auto-scaling engine
6:20PM DBG setting up HTTP server routes
6:20PM INF mounting route endpoint GetSystemHealth method=GET path=/v1/system/health
6:20PM INF mounting route endpoint GetSystemInfo method=GET path=/v1/system/info
6:20PM INF mounting route endpoint GetSystemMetrics method=GET path=/v1/system/metrics
6:20PM INF mounting route endpoint ScaleOutJobGroup method=PUT path=/v1/scale/out/{job_id}/{group}
6:20PM INF mounting route endpoint ScaleInJobGroup method=PUT path=/v1/scale/in/{job_id}/{group}
6:20PM INF mounting route endpoint GetJobScalingPolicies method=GET path=/v1/policies
6:20PM INF mounting route endpoint GetJobScalingPolicy method=GET path=/v1/policy/{job_id}
6:20PM INF mounting route endpoint GetJobGroupScalingPolicy method=GET path=/v1/policy/{job_id}/{group}
6:20PM INF HTTP server successfully listening addr=127.0.0.1:8000
6:20PM INF starting Sherpa Nomad meta policy engine
6:20PM INF starting Sherpa internal auto-scaling engine
6:20PM DBG meta watcher last index has changed
6:21PM DBG no scaling policies found in storage backend
6:22PM DBG no scaling policies found in storage backend
6:23PM DBG no scaling policies found in storage backend
6:24PM DBG no scaling policies found in storage backend

Also, $ sherpa policy list doesn't list anything

Expected behavior

The autoscale policy should be read from the meta stanza within the job group and running

$ sherpa policy list

should list all the policies that were defined as such

Environment:

Sherpa server information (retrieve with sherpa system info):

Nomad Address                http://loco:4646
Policy Engine                Nomad Job Group Meta
Policy Storage Backend       In Memory
Internal AutoScaling Engine  true
Strict Policy Checking       true

Sherpa CLI version (retrieve with sherpa --version):

v0.0.1

Server Operating System/Architecture:

macOS Mojave Version 10.14.5 / 64 bit

Sherpa server configuration parameters:

sherpa server --autoscaler-enabled --policy-engine-nomad-meta-enabled --policy-engine-api-enabled=false --log-level=debug

Nomad client configuration parameters (if any):

(Nomad v0.9.2)
client {
  enabled = true
  options {
    "docker.auth.config" = "/etc/nomad/.docker/config.json"
  }
}

Consul client configuration parameters (if any):

Not using Consul backend for meat policy engine

Additional context

Feature: External metric provider support for InfluxDB

Is your feature request related to a problem? Please describe.
Currently the only external metrics provider is Prometheus. It would be nice if additional providers were added to the project.

Describe the solution you'd like.
Add support for querying metrics out of InfluxDB

Describe alternatives you've considered.
No other alternatives

Explain any additional use-cases.
The Sherpa API could be used in conjunction with Kapacitor, but having native support in Sherpa to query InfluxDB is very convenient.

Additional context.
Add any other context or screenshots about the feature request here.

Scaling doesn't work when inicial number of task beyond the minimum and maximum values

Describe the bug
When applying policy with interval [MinCount, MaxCount ] to existent job with count value lower than MinCount scaling out is not occurs.

To reproduce
Steps to reproduce the behavior ideally with policy and job examples.
Nomad job example

job "test1" {
	region = "us-west-2"
        datacenters = ["us-west-2"]
	type = "service"
	group "api" {
		count = 1 < -- -- -- -- -- -- -- -- -- -- --not in interval[2, 4]
		task "test" {
			driver = "docker"
			config {
				image = "nginx:latest"
				}
			resources {
				cpu = 300# MHz
				memory = 256# MB
				}
			}
		}
	}

policy {
Enabled: true,
Cooldown: 180,
MinCount: 2,
MaxCount: 4,
ScaleOutCount: 1,
ScaleInCount: 1
}

Expected behavior
Desired number of tasks becomes scalled out to 2.

Environment:

Sherpa server information (retrieve with sherpa system info):

NomadAddress	"http://http.nomad.service.my-dev:4646"
PolicyEngine	"Sherpa API"
StorageBackend	"Consul"
InternalAutoScalingEngine	true
StrictPolicyChecking	true

Sherpa CLI version (retrieve with sherpa --version):

sherpa version v0.4.1
 Date:   2020-01-17 20:02:32.946360009 +0000 UTC
Commit: d215c85
Branch: v0.4.1
State:  v0.4.1

Server Operating System/Architecture: Docker on Ubuntu 18
Sherpa server configuration parameters:

docker run -e "CONSUL_HTTP_ADDR=consul.service.my-dev:8500" -e "NOMAD_ADDR=http://http.nomad.service.my-dev:4646" -it -p 8000:8000 jrasell/sherpa server --cluster-name sherpa-test-akar --ui --bind-addr "" --storage-consul-enabled --autoscaler-enabled --policy-engine-api-enabled --log-level=debug

Document new leadership and clustering HA feature

Reference PR: #45

docs: add documentation detailing debug pprof routes

reference PR #124

pulumi-provider: build Pulumi provider to manage policies

This feature would be outside the Sherpa repo; but having a Pulumi provider available to manage the configuration of scaling policies would allow for easy, standardized, and codified Sherpa deployments when using the API policy engine.

Sherpa UI wrong timestamp format (month is incorrect)

Describe the bug
On the UI the timestamp has the wrong format - month is incorrect.

To reproduce
I just run Sherpa and on the GUI i see:
Completed | 2020-0-30 10:55:57.594 +0000 UTC

Expected behavior
Completed | 2020-01-30 10:55:57.594 +0000 UTC

Environment:
Sherpa version: 0.4.1

Update documentation to detail new UI

Reference PR: #33

Scaling not working after scaling in to 1

Describe the bug
Sometimes when one of my task group is scaled in to 1 allocation, sherpa seems to forgot the policies and doesn't scale anymore.

To reproduce
I am using nomad meta.

sherpa_cooldown:"180"
sherpa_enabled:"1"
sherpa_max_count:"5"
sherpa_min_count:"1"
sherpa_scale_in_count:"1"
sherpa_scale_in_cpu_percentage_threshold:"30"
sherpa_scale_in_memory_percentage_threshold:"30"
sherpa_scale_out_count:"1"
sherpa_scale_out_cpu_percentage_threshold:"70"
sherpa_scale_out_memory_percentage_threshold:"70"

The job is scaling in to 1 then never scale out

I just need to increase manually the count of the task group from 1 to 2 and sherpa continue to scale out normally.

In this example the service was stuck for 15+ days.
In 0.3.0 I have the following logs:

Jan 02 14:37:56 admin-10-32-152-182 docker[7065]: {"level":"debug","job":"my-service","time":"2020-01-02T13:37:56.421144110Z","message":"no task groups in job have scaling policies enabled"}
Jan 02 14:38:56 admin-10-32-152-182 docker[7065]: {"level":"debug","job":"my-service","time":"2020-01-02T13:38:56.422893887Z","message":"no task groups in job have scaling policies enabled"}

After upgrading to 0.4.0

Jan 02 14:57:21 admin-10-32-152-182 docker[29652]: {"level":"debug","job":"my-service","time":"2020-01-02T13:57:21.298110900Z","message":"triggering autoscaling job evaluation"}
Jan 02 14:57:21 admin-10-32-152-182 docker[29652]: {"level":"error","job":"my-service","error":"no allocations found to match task group with scaling policy","time":"2020-01-02T13:57:21.306032109Z","message":"failed to collect Nomad metrics, skipping Nomad based checks"}
Jan 02 14:57:21 admin-10-32-152-182 docker[29652]: {"level":"debug","job":"my-service","group":"my-service-service-spot","time":"2020-01-02T13:57:21.306043091Z","message":"triggering autoscaling job group evaluation"}
Jan 02 14:57:21 admin-10-32-152-182 docker[29652]: {"level":"info","job":"my-service","time":"2020-01-02T13:57:21.306154810Z","message":"scaling evaluation completed and no scaling required"}

Expected behavior
A scaling out event.

Environment:

Sherpa server information (retrieve with sherpa system info):

/usr/bin # sherpa system info
Nomad Address                http://nomad.eu-central-1.elb.amazonaws.com:4646
Policy Engine                Nomad Job Group Meta
Storage Backend              Consul
Internal AutoScaling Engine  true
Strict Policy Checking       true

Sherpa CLI version (retrieve with sherpa --version):

/usr/bin # sherpa --version
sherpa version v0.4.0
        Date:   2020-01-02 14:26:03.176156495 +0000 UTC
        Commit: 212ca6e
        Branch: v0.4.0
        State:  v0.4.0

Server Operating System/Architecture: Debian 9.11
Sherpa server configuration parameters:

SHERPA_AUTOSCALER_ENABLED=true
SHERPA_AUTOSCALER_EVALUATION_INTERVAL=60
SHERPA_AUTOSCALER_NUM_THREADS=20
SHERPA_BIND_ADDR=0.0.0.0
SHERPA_BIND_PORT=8000
SHERPA_CLUSTER_ADVERTISE_ADDR=http://127.0.0.1:8000
SHERPA_CLUSTER_NAME=prod-main-admin
SHERPA_LOG_FORMAT=auto
SHERPA_LOG_LEVEL=debug
SHERPA_LOG_USE_COLOR=true
SHERPA_POLICY_ENGINE_API_ENABLED=false
SHERPA_POLICY_ENGINE_NOMAD_META_ENABLED=true
SHERPA_POLICY_ENGINE_STRICT_CHECKING_ENABLED=true
SHERPA_STORAGE_CONSUL_ENABLED=true
SHERPA_STORAGE_CONSUL_PATH=sherpa/
SHERPA_TELEMETRY_PROMETHEUS=true
SHERPA_UI=true

Failed and Completed allocs are being used in computations for job group scaling requirements

Describe the bug

For Nomad version v0.9.2 (not tested against older nomad versions), the autoscaler triggers the scaling behavior unexpectedly, since it adds the resource allocations and last captured resource utilization (RSS mem and CPU ticks) from allocs that have already completed or failed. As we don't want such allocs affecting the computation used in determining the elasticity for a nomad job group, we should perhaps only try to look at allocs that are either in running or pending state to determine the current state of the job group resources.

To reproduce

We added a debug statement in the getJobResourceUsage function right after a.nomad.Allocations().Stats(allocs[i], nil) is called:

a.logger.Debug().
    Int("cpu", int(stats.ResourceUsage.CpuStats.TotalTicks)).
    Int("memory", int(stats.ResourceUsage.MemoryStats.RSS/1024/1024)).
    Msg("Usage")

 DBG Reading policy from meta
2:34PM DBG Read policy from meta
2:35PM DBG Usage cpu=0 memory=0
2:35PM DBG Usage cpu=0 memory=0
2:35PM DBG Usage cpu=0 memory=0
2:35PM DBG Usage cpu=0 memory=0
2:35PM DBG Usage cpu=0 memory=0
2:35PM DBG Usage cpu=0 memory=0
2:35PM DBG Usage cpu=23 memory=7
2:35PM DBG Usage cpu=112 memory=8
2:35PM DBG Usage cpu=5 memory=7
2:35PM DBG resource utilisation calculation cpu-usage-percentage=77 mem-usage-percentage=10
2:35PM INF added group scaling request job=pushgateway scaling-req={"count":1,"direction":"out","group":"pushgateway"}
2:35PM DBG scaling action will break job group maximum threshold group=pushgateway job=pushgateway

Expected behavior

We expected to only see one DBG line DBG Usage cpu=23 memory=7, as there was only one actively running allocation at the time. But we saw a line corresponding to all allocations that were also marked failed or completed. The Nomad GC doesn't collect these allocations quickly and they stick around for some time and can affect the resource utilization calculations.

Environment:

Sherpa server information (retrieve with sherpa system info):

Nomad Address                http://loco:4646
Policy Engine                Nomad Job Group Meta
Policy Storage Backend       In Memory
Internal AutoScaling Engine  true
Strict Policy Checking       true

Sherpa CLI version (retrieve with sherpa --version):

built locally from github source, in order to test for and narrow down on the issue (built with go version go1.12.7)

Server Operating System/Architecture:

macOS Mojave Version 10.14.5 / 64 bit

Sherpa server configuration parameters:

sherpa server --autoscaler-enabled --policy-engine-nomad-meta-enabled --policy-engine-api-enabled=false --log-level=debug

Nomad client configuration parameters (if any):

(Nomad v0.9.2)
client {
  enabled = true
  options {
    "docker.auth.config" = "/etc/nomad/.docker/config.json"
  }
}

Consul client configuration parameters (if any):

Not using Consul backend for meat policy engine

Additional context

job with multiple groups problem

hi All,

Env: sherpa: 0.2.0 / nomad 0.9.5

Problem: I have a nomad job where i have multiple groups defined, some with sherpa meta to enable scaling policies and other groups without any scaling police, also, sherpa policies are defined inside the group block.

Deployed
Task Group           Auto Revert  Desired  Placed  Healthy  Unhealthy  Progress Deadline
dynamic  true         1        1       0        1          2019-10-11T10:24:38Z
static   true         1        1       0        1          2019-10-11T10:24:38Z

Allocations
ID        Node ID   Task Group           Version  Desired  Status   Created    Modified
82adab39  7144d68b  dynamic  0        run      running  2m44s ago  3s ago
f2b5b883  8efd65cc  static   0        run      running  2m44s ago  3s ago

If i curl the api i can see that the scaling police for the dynamic bit is properly set:

curl localhost:9000/v1/policies
{"jobcheckout":{"dynamic":{"Enabled":true,"MinCount":1,"MaxCount":4,"ScaleOutCount":1,"ScaleInCount":1,"ScaleOutCPUPercentageThreshold":85,"ScaleOutMemoryPercentageThreshold":85,"ScaleInCPUPercentageThreshold":30,"ScaleInMemoryPercentageThreshold":30}}}

but after putting some load on the job, i am getting the following on the sherpa logs:

{"time":"2019-10-11T10:28:36.350750688Z","message":"worker with func exits from a panic: runtime error: invalid memory address or nil pointer dereference"}
{"time":"2019-10-11T10:28:36.350815881Z","message":"worker with func exits from panic: goroutine 575 [running]:\ngithub.com/jrasell/sherpa/vendor/github.com/panjf2000/ants.(*goWorkerWithFunc).run.func1.1(0xc0001ca600)\n\t/home/travis/gopath/src/github.com/jrasell/sherpa/vendor/github.com/panjf2000/ants/worker_func.go:59 +0x13e\npanic(0x9a5b40, 0xf1f150)\n\t/home/travis/.gimme/versions/go1.12.linux.amd64/src/runtime/panic.go:522 +0x1b5\ngithub.com/jrasell/sherpa/pkg/autoscale.(*AutoScale).getJobAllocations(0xc0001305a0, 0xc0002a4d05, 0xb, 0xc0002a9500, 0x0, 0x405585, 0x42db9c, 0xc0001435f8, 0x0, 0xc0001f09a0)\n\t/home/travis/gopath/src/github.com/jrasell/sherpa/pkg/autoscale/autoscale.go:116 +0x174\ngithub.com/jrasell/sherpa/pkg/autoscale.(*AutoScale).autoscaleJob(0xc0001305a0, 0xc0002a4d05, 0xb, 0xc0002a9500)\n\t/home/travis/gopath/src/github.com/jrasell/sherpa/pkg/autoscale/autoscale.go:11 +0x6a\ngithub.com/jrasell/sherpa/pkg/autoscale.(*AutoScale).workerPoolFunc.func1(0x9474c0, 0xc0001887e0)\n\t/home/travis/gopath/src/github.com/jrasell/sherpa/pkg/autoscale/handler.go:185 +0xe8\ngithub.com/jrasell/sherpa/vendor/github.com/panjf2000/ants.(*goWorkerWithFunc).run.func1(0xc0001ca600)\n\t/home/travis/gopath/src/github.com/jrasell/sherpa/vendor/github.com/panjf2000/ants/worker_func.go:71 +0xb3\ncreated by github.com/jrasell/sherpa/vendor/github.com/panjf2000/ants.(*goWorkerWithFunc).run\n\t/home/travis/gopath/src/github.com/jrasell/sherpa/vendor/github.com/panjf2000/ants/worker_func.go:49 +0x4d\n"}
{"level":"debug","time":"2019-10-11T10:29:14.365681892Z","message":"meta watcher last index has not changed"}
{"level":"debug","time":"2019-10-11T10:29:17.735299209Z","message":"deployment watcher last index has not changed"}
{"time":"2019-10-11T10:29:36.351848672Z","message":"worker with func exits from a panic: runtime error: invalid memory address or nil pointer dereference"}
{"time":"2019-10-11T10:29:36.351949427Z","message":"worker with func exits from panic: goroutine 612 [running]:\ngithub.com/jrasell/sherpa/vendor/github.com/panjf2000/ants.(*goWorkerWithFunc).run.func1.1(0xc000148720)\n\t/home/travis/gopath/src/github.com/jrasell/sherpa/vendor/github.com/panjf2000/ants/worker_func.go:59 +0x13e\npanic(0x9a5b40, 0xf1f150)\n\t/home/travis/.gimme/versions/go1.12.linux.amd64/src/runtime/panic.go:522 +0x1b5\ngithub.com/jrasell/sherpa/pkg/autoscale.(*AutoScale).getJobAllocations(0xc0001305a0, 0xc0002a4d05, 0xb, 0xc0002a9500, 0x2, 0x405585, 0x42db9c, 0xc00013f5f8, 0x0, 0xb13ee0)\n\t/home/travis/gopath/src/github.com/jrasell/sherpa/pkg/autoscale/autoscale.go:116 +0x174\ngithub.com/jrasell/sherpa/pkg/autoscale.(*AutoScale).autoscaleJob(0xc0001305a0, 0xc0002a4d05, 0xb, 0xc0002a9500)\n\t/home/travis/gopath/src/github.com/jrasell/sherpa/pkg/autoscale/autoscale.go:11 +0x6a\ngithub.com/jrasell/sherpa/pkg/autoscale.(*AutoScale).workerPoolFunc.func1(0x9474c0, 0xc0002f4d80)\n\t/home/travis/gopath/src/github.com/jrasell/sherpa/pkg/autoscale/handler.go:185 +0xe8\ngithub.com/jrasell/sherpa/vendor/github.com/panjf2000/ants.(*goWorkerWithFunc).run.func1(0xc000148720)\n\t/home/travis/gopath/src/github.com/jrasell/sherpa/vendor/github.com/panjf2000/ants/worker_func.go:71 +0xb3\ncreated by github.com/jrasell/sherpa/vendor/github.com/panjf2000/ants.(*goWorkerWithFunc).run\n\t/home/travis/gopath/src/github.com/jrasell/sherpa/vendor/github.com/panjf2000/ants/worker_func.go:49 +0x4d\n"}
{"time":"2019-10-11T10:30:36.352331770Z","message":"worker with func exits from a panic: runtime error: invalid memory address or nil pointer dereference"}
{"time":"2019-10-11T10:30:36.352406052Z","message":"worker with func exits from panic: goroutine 639 [running]:\ngithub.com/jrasell/sherpa/vendor/github.com/panjf2000/ants.(*goWorkerWithFunc).run.func1.1(0xc000148720)\n\t/home/travis/gopath/src/github.com/jrasell/sherpa/vendor/github.com/panjf2000/ants/worker_func.go:59 +0x13e\npanic(0x9a5b40, 0xf1f150)\n\t/home/travis/.gimme/versions/go1.12.linux.amd64/src/runtime/panic.go:522 +0x1b5\ngithub.com/jrasell/sherpa/pkg/autoscale.(*AutoScale).getJobAllocations(0xc0001305a0, 0xc0002a4d05, 0xb, 0xc0002a9500, 0x2, 0x0, 0x0, 0x0, 0x0, 0xb13ee0)\n\t/home/travis/gopath/src/github.com/jrasell/sherpa/pkg/autoscale/autoscale.go:116 +0x174\ngithub.com/jrasell/sherpa/pkg/autoscale.(*AutoScale).autoscaleJob(0xc0001305a0, 0xc0002a4d05, 0xb, 0xc0002a9500)\n\t/home/travis/gopath/src/github.com/jrasell/sherpa/pkg/autoscale/autoscale.go:11 +0x6a\ngithub.com/jrasell/sherpa/pkg/autoscale.(*AutoScale).workerPoolFunc.func1(0x9474c0, 0xc0001e1580)\n\t/home/travis/gopath/src/github.com/jrasell/sherpa/pkg/autoscale/handler.go:185 +0xe8\ngithub.com/jrasell/sherpa/vendor/github.com/panjf2000/ants.(*goWorkerWithFunc).run.func1(0xc000148720)\n\t/home/travis/gopath/src/github.com/jrasell/sherpa/vendor/github.com/panjf2000/ants/worker_func.go:71 +0xb3\ncreated by github.com/jrasell/sherpa/vendor/github.com/panjf2000/ants.(*goWorkerWithFunc).run\n\t/home/travis/gopath/src/github.com/jrasell/sherpa/vendor/github.com/panjf2000/ants/worker_func.go:49 +0x4d\n"}
{"time":"2019-10-11T10:31:36.350696901Z","message":"worker with func exits from a panic: runtime error: invalid memory address or nil pointer dereference"}

cli: the Consul storage CLI description is inaccurate

Describe the bug
The description here on master is currently inaccurate and should be updated.

Add telemetry measurements to key functionality

Is your feature request related to a problem? Please describe.
Sherpa currently ships with the ability to send metrics to a number of backends, however, the current metrics that are emitted are only the default runtime metrics from Go. It would aid operators to have further metrics which provides greater insight to the server health and actions which have been taken.

Initial thoughts on where to implement:

storage backend call times
scale in/out success failure counts
autoscaler run time

Track scaling activities and retain for time period

Is your feature request related to a problem? Please describe.
Currently when a scaling action is triggered, nothing is stored internally to detail items such as the timestamp or associated Nomad evaluationID. In order to provide functionality such as scaling cooldowns, having internal state to reference is essential. The state can also help operators understand, and debug the system as Nomad does not provide any indication as to why a deployment was triggered.

Describe the solution you'd like.
Sherpa has a simple (initially in memory) state store which tracks basic information about about the triggered event. The information should be available via the API and CLI for iteration and should be periodically cleaned via some garbage collection method.

At a minimum the an individual scaling state entry should include the following:

timestamp: a timestamp of when the event took place
scaling ID: an internally used UUID to track the scaling event
evaluation ID: the Nomad evaluation ID which resulted from the scaling event

Autoscaling engine?

Hello I love this name btw! The Sherpa helps the nomad goes up and down the mountain! Well I was wondering in the documentation it specifically mentions Prometheus alert manager, but what about Datadog? How would Datadog be hooked up into the Autoscaling Proxy engine?

Currently we are supporting about 12 different nomad clusters in our Organization and we are unified on Datadog as our monitoring source, how would this integrate into this as a scaling solution?

Add simple UI to provide scaling event overview

Is your feature request related to a problem? Please describe.
Running CLI or API commands is sometimes not the most efficient way to get an overview of the system state. UI's allow operators to quickly view particular metrics to get an idea of the system and can even to displayed on wall monitors if desired.

Describe the solution you'd like.
A small UI to show scaling events in a manner similar to Fabio's route table display.

Repository Archived

Early this year I joined HashiCorp to work directly on the Nomad ecosystem. This work has resulted in the creation of nomad-autoscaler which provides horizontal app and cluster scaling. The nomad-autoscaler is the officially supported autoscaler for Nomad and has a dedicated team working on it. It therefore makes sense to now archive this repository. Thanks to everyone who helped build this project.

terraform-provider: build TF provider to manage policies

This feature would be outside the Sherpa repo; but having a Terraform provider available to manage the configuration of scaling policies would allow for easy, standardized, and codified Sherpa deployments when using the API policy engine.

External scaling runtime error when group name differs from job name

Describe the bug
Sherpa autoscale panics upon job evaluation when external scaling is configured for a group with a different name than the job.

To reproduce
Run a job with a group named differently than the job itself, such as:

job "foo" {
  region = "global"

  datacenters = ["dc1"]

  type = "service"

  group "bar" {
    count = 1

    restart {
      attempts = 2
      interval = "30m"
      delay = "15s"
      mode = "fail"
    }

    task "tail" {
      driver = "raw_exec"

      config {
        command = "/usr/bin/tail"
        args = ["-f", "/dev/null"]
      }

      resources {
        cpu    = 100
        memory = 64
      }
    }
  }
}

Create a scaling policy for the group with external checks, such as:

{
  "Enabled": true,
  "MaxCount": 16,
  "MinCount": 1,
  "ScaleOutCount": 1,
  "ScaleInCount": 1,
  "ExternalChecks": {
    "prometheus_memory_in": {
      "Enabled": true,
      "Provider": "prometheus",
      "Query": "sum(nomad_client_allocs_memory_usage{task_group='cache'})/sum(nomad_client_allocs_memory_allocated{task_group='cache'})*100",
      "ComparisonOperator": "less-than",
      "ComparisonValue": 30,
      "Action": "scale-in"
    },
    "prometheus_memory_out": {
      "Enabled": true,
      "Provider": "prometheus",
      "Query": "sum(nomad_client_allocs_memory_usage{task_group='cache'})/sum(nomad_client_allocs_memory_allocated{task_group='cache'})*100",
      "ComparisonOperator": "greater-than",
      "ComparisonValue": 80,
      "Action": "scale-out"
    }
  }
}

Expected behavior
No runtime error

Environment

$ sherpa system info
Nomad Address                http://127.0.0.1:4646
Policy Engine                Sherpa API
Storage Backend              Consul
Internal AutoScaling Engine  true
Strict Policy Checking       true

Additional context
Pardon the json logs

{
     "time":"2020-01-09T16:22:26.788224800Z",
     "message":"worker with func exits from a panic: runtime error: invalid memory address or nil pointer dereference"
}
{
     "time":"2020-01-09T16:22:26.788402500Z",
     "message":"worker with func exits from panic: goroutine 72 [running]:
github.com/jrasell/sherpa/vendor/github.com/panjf2000/ants/v2.(*goWorkerWithFunc).run.func1.1(0xc0000ff230)
	/home/travis/gopath/src/github.com/jrasell/sherpa/vendor/github.com/panjf2000/ants/v2/worker_func.go:58 +0x123
panic(0xa5ec80, 0x10b0ff0)
	/home/travis/.gimme/versions/go1.12.linux.amd64/src/runtime/panic.go:522 +0x1b5
github.com/jrasell/sherpa/pkg/autoscale.(*autoscaleEvaluation).choseCorrectDecision(0xc000181680, 0xc0002eff00, 0x6, 0xc000283838, 0xc00011c568)
	/home/travis/gopath/src/github.com/jrasell/sherpa/pkg/autoscale/decision.go:146 +0x232
github.com/jrasell/sherpa/pkg/autoscale.(*autoscaleEvaluation).calculateExternalScalingDecision(0xc000181680, 0xc0002eff00, 0x6, 0xc0001c1f20, 0xa)
	/home/travis/gopath/src/github.com/jrasell/sherpa/pkg/autoscale/decision.go:90 +0x344
github.com/jrasell/sherpa/pkg/autoscale.(*autoscaleEvaluation).evaluateJob(0xc000181680)
	/home/travis/gopath/src/github.com/jrasell/sherpa/pkg/autoscale/autoscale.go:88 +0x603
github.com/jrasell/sherpa/pkg/autoscale.(*AutoScale).workerPoolFunc.func1(0x9f6260, 0xc0000ff200)
	/home/travis/gopath/src/github.com/jrasell/sherpa/pkg/autoscale/handler.go:258 +0x2c3
github.com/jrasell/sherpa/vendor/github.com/panjf2000/ants/v2.(*goWorkerWithFunc).run.func1(0xc0000ff230)
	/home/travis/gopath/src/github.com/jrasell/sherpa/vendor/github.com/panjf2000/ants/v2/worker_func.go:69 +0xb3
created by github.com/jrasell/sherpa/vendor/github.com/panjf2000/ants/v2.(*goWorkerWithFunc).run
	/home/travis/gopath/src/github.com/jrasell/sherpa/vendor/github.com/panjf2000/ants/v2/worker_func.go:49 +0x4d
"}

The policy list command while using storage-consul backend fails to list policies

Describe the bug
I have a local dev instance of consul and nomad running. Latest 0.0.2+dev build of sherpa. When using the storage-consul backend for policies, the "sherpa policy list" command will fail to list the policies.

To reproduce
Start up local consul -dev and nomad -dev instances. Start sherpa locally using the storage-consul-enabled flag.

$./sherpa policy init http-echo > http-echo.policy

$ jq . http-echo.policy

{
"Enabled": true,
"MaxCount": 16,
"MinCount": 4,
"ScaleOutCount": 2,
"ScaleInCount": 2,
"ScaleOutCPUPercentageThreshold": 75,
"ScaleOutMemoryPercentageThreshold": 75,
"ScaleInCPUPercentageThreshold": 30,
"ScaleInMemoryPercentageThreshold": 30
}

$./sherpa policy write --policy-group-name="http-echo" http-echo http-echo.policy
Successfully wrote job group scaling policy

$./sherpa policy read http-echo
Group Enabled MinCount MaxCount ScaleInCount ScaleOutCount
http-echo true 4 16 2 2

$./sherpa policy list
Error querying policy list: unexpected response code 500: failed to unmarshal Consul KV value: unexpected end of JSON input

Expected behavior
I expect to see the output of all policies defined within the storage-consul backend

Environment:

Sherpa server information (retrieve with sherpa system info):
Nomad Address http://127.0.0.1:4646
Policy Engine Sherpa API
Policy Storage Backend Consul
Internal AutoScaling Engine false
Strict Policy Checking true
Sherpa CLI version (retrieve with sherpa --version):
sherpa version v0.0.2+dev
Date: 2019-08-25 18:27:10.534241 +0000 UTC
Commit: ba37aaa
Branch: master
State: clean
Server Operating System/Architecture:
MacOS 10.14 / Darwin
Sherpa server configuration parameters:
$./sherpa server --storage-consul-enabled --log-enable-dev --log-level=debug
Nomad client configuration parameters (if any):
$./nomad agent -dev
Consul client configuration parameters (if any):
$./consul agent -dev -enable-script-checks -node=web -ui

Additional context

Sherpa debug log:

12:11PM INF pkg/server/server.go:49 > starting HTTP server addr=127.0.0.1:8000
12:11PM INF pkg/server/server.go:65 > Sherpa server configuration server={"autoscaler-enabled":false,"autoscaler-evaluation-interval":60,"bind-addr":"127.0.0.1","bind-port":8000,"policy-engine-api-enabled":true,"policy-engine-nomad-meta-enabled":false,"policy-engine-strict-checking-enabled":true,"storage-consul-enabled":true,"storage-consul-path":"sherpa/policies/"} telemetry={"telemetry-statsd-address":"","telemetry-statsite-address":""} tls={"tls-cert-key-path":"","tls-cert-path":""}
12:11PM DBG pkg/server/server.go:137 > setting up Nomad client
12:11PM DBG pkg/server/server.go:128 > setting up Consul storage backend
12:11PM DBG pkg/server/routes.go:20 > setting up HTTP server routes
12:11PM INF pkg/server/routes.go:90 > starting Sherpa API policy engine
12:11PM INF pkg/server/router/router.go:29 > mounting route endpoint GetSystemHealth method=GET path=/v1/system/health
12:11PM INF pkg/server/router/router.go:29 > mounting route endpoint GetSystemInfo method=GET path=/v1/system/info
12:11PM INF pkg/server/router/router.go:29 > mounting route endpoint GetSystemMetrics method=GET path=/v1/system/metrics
12:11PM INF pkg/server/router/router.go:29 > mounting route endpoint ScaleOutJobGroup method=PUT path=/v1/scale/out/{job_id}/{group}
12:11PM INF pkg/server/router/router.go:29 > mounting route endpoint ScaleInJobGroup method=PUT path=/v1/scale/in/{job_id}/{group}
12:11PM INF pkg/server/router/router.go:29 > mounting route endpoint GetJobScalingPolicies method=GET path=/v1/policies
12:11PM INF pkg/server/router/router.go:29 > mounting route endpoint GetJobScalingPolicy method=GET path=/v1/policy/{job_id}
12:11PM INF pkg/server/router/router.go:29 > mounting route endpoint GetJobGroupScalingPolicy method=GET path=/v1/policy/{job_id}/{group}
12:11PM INF pkg/server/router/router.go:29 > mounting route endpoint PostJobScalingPolicy method=POST path=/v1/policy/{job_id}
12:11PM INF pkg/server/router/router.go:29 > mounting route endpoint PostJobGroupScalingPolicy method=POST path=/v1/policy/{job_id}/{group}
12:11PM INF pkg/server/router/router.go:29 > mounting route endpoint DeleteJobGroupScalingPolicy method=DELETE path=/v1/policy/{job_id}/{group}
12:11PM INF pkg/server/router/router.go:29 > mounting route endpoint DeleteJobScalingPolicy method=DELETE path=/v1/policy/{job_id}
12:11PM INF pkg/server/server.go:103 > HTTP server successfully listening addr=127.0.0.1:8000
12:11PM INF pkg/server/middleware.go:25 > server responded to request method=POST path=/v1/policy/http-echo/http-echo remote-addr=127.0.0.1:63953 response-code=201
12:11PM ERR pkg/policy/v1/policies.go:28 > failed to call policy backend error="failed to unmarshal Consul KV value: unexpected end of JSON input"
12:11PM INF pkg/server/middleware.go:25 > server responded to request method=GET path=/v1/policies remote-addr=127.0.0.1:63971 response-code=500
12:13PM INF pkg/server/middleware.go:25 > server responded to request method=POST path=/v1/policy/http-echo/http-echo remote-addr=127.0.0.1:64028 response-code=201
12:13PM INF pkg/server/middleware.go:25 > server responded to request method=GET path=/v1/policy/http-echo remote-addr=127.0.0.1:64039 response-code=200
12:13PM ERR pkg/policy/v1/policies.go:28 > failed to call policy backend error="failed to unmarshal Consul KV value: unexpected end of JSON input"
12:13PM INF pkg/server/middleware.go:25 > server responded to request method=GET path=/v1/policies remote-addr=127.0.0.1:64045 response-code=500

Update documentation to detail new scaling cooldown feature

Related PR #77

scaling: allow for the submission of key/value metadata on scaling trigger

Is your feature request related to a problem? Please describe.
Scaling state includes details about the job, group, count, direction and which portion of the application made the request. This gives a nice overview, but lacks in certain areas particularly when using webhooks via the API.

Describe the solution you'd like.
When triggering a scaling action, an optional list of key/values can be passed along with the request. These can be free form strings, which allow operators to send useful information such as the metric which caused the action, the value at the time it was triggered and even the provider. These can then be stored in the state for inspection at later dates providing a much better operator experience.

Meta policy watcher should be leader protected and update backend

Currently the policy meta engine will not run if the Consul backend is enabled due to the legacy way in which the storage was built. In addition the meta policy watcher is not leader protected and so multiple Sherpa servers will process job meta changes.

The meta policy engine should be updated to allow it to run in conjunction with the Consul storage backend. The meta policy engine should also be leader protected, so that only one Sherpa server processes updates.

the consul client should be reused across backends

Currently a new Consul client is created when a backend process needs access to Consul. This is not dangerous nor very inefficient, but it would be better if a single client was created and then passed to each sub-process that required it.

autoscaler: allow use of custom metric sources in autoscaler evaluation

Is your feature request related to a problem? Please describe.
In order to use custom metrics to scale, the metric store, such as Prometheus must be configured in such a manner to send web-hook requests to the Sherpa API when the metric is found to violate a policy. This can involve setting up and configuring additional infrastructure components, and depending on the operators experience with the additional applications, can be somewhat time consuming.

Describe the solution you'd like.
The autoscaler and therefore the scaling policy document should be updated to allow the use of metrics from external providers. These values can then be used to make scaling decisions alongside using the standard memory and CPU metrics which are calculated from Nomad.

jrasell / sherpa Goto Github PK

sherpa's Introduction

Sherpa

Features

Download & Install

Documentation

Contributing

Contributors

sherpa's People

Contributors

Stargazers

Watchers

Forkers

sherpa's Issues

Recommend Projects

Recommend Topics

Recommend Org