leanercloud / autospotting Goto Github PK

Saves up to 90% of AWS EC2 costs by automating the use of spot instances on existing AutoScaling groups. Installs in minutes using CloudFormation or Terraform. Convenient to deploy at scale using StackSets. Uses tagging to avoid launch configuration changes. Automated spot termination handling. Reliable fallback to on-demand instances.

Home Page: https://autospotting.io

License: Open Software License 3.0

Makefile 1.35% Go 98.43% Dockerfile 0.11% HTML 0.11%

aws autoscaling-groups aws-lambda spot-instances cost ec2 amazon-web-services terraform-module infrastructure aws-autoscaling

autospotting's People

Contributors

Stargazers

Watchers

Forkers

gitter-badger solancer nmeierpolys venkateshks keimhaqi roeyazroel spscommerce jwineinger eugenestarchenko ashea-handy handybook snordhausen babator a3linux himanshpal joonhochoi michaltaratuta casualuser ldesiqueira thebigjc mgrennan antonioangelino etsangsplk ab01 zeapo s3u stephensalim gophersgang phils awsfanatic ddoloroi occelebi ildarmf joshuajlai codexnubes vecchp bjitd doloroi suhagba meyerbro francis-ax briantyr raravena80 chrisurwin silky micjaesc rgarcia maxshilov mbjerkness slacksec shiftre jordansimonovski croman svo35 mr-brody francoran wsoula stakater javaguy147 awsbot-labs tuananh sharmaansh21 hkp llchen223 0i0r0i benyanke dg-infrastructure vgrigoruk gaurav-narra scout24 vcctr universam1 kartik894 vivekdubey tied msales sjanulonoks ellerbrock cloud-architecture chaner jinkxed 13scoobie alan01252 tootedom coingraham anarcher codetriage-readme-bot gozer kaptiin-tuekhov vinaykadalagi atillamas remind101 dwardu89 grahamlyons komljen smallpdf avjafrey brennv vaidik blueharvest

autospotting's Issues

Write tests for autoscaling.go

Working with Elastic Beanstalk

Just tested it with Elastic Beanstalk - working great. just add the tags during creation of new environment.

Maybe add to README?

Roey

Consider network performance attributes when choosing new instance types

Check the following attributes, only giving instances with supperior characteristics than the original instance:

Network Performance rating
Enhanced networking
EBS optimized IOPS
EBS Optimized Throughput
EBS Optimized MaxBandwidth

Temporarily blacklist spot instance types after multiple launch failures in a given AZ

This is an idea I got while discussing #25.

We should have a way to track instance launch failures for a given instance type/AZ combination, and somehow temporarily blacklist them for a few hours if spot instance requests fail to be fulfilled over multiple runs in a row.

Implement global configuration options when executed from Lambda

Currently the autospotting CLI tool supports a number of flags when executed manually, but those are completely ignored when executed from Lambda, which is not yet configurable.

This should be implemented by customizing our CloudWatch events.

The CloudFormation stack should have configuration parameters corresponding to each of the global command-line flags
A custom CloudWatch event should be generated by the CloudFormation template based on those stack parameters
The event data should be handled by the autospotting Lambda handler for generating configuration data structure just like done for the command-line flags

Add new region: Canada-central

This depends on vantage-sh/ec2instances.info#214

Investigate and implement a better way of logging

The idea is to have different level of logging, thus allowing cleaner/shorter outputs and solve bug if need be.

Use the average spot price over the last day/week instead of the current one

This is related to #3

Failures with AutoScaling groups defined on non-default VPC

Allow keeping some residual on-demand capacity when configured as such

This depends on #5, but once that's in place we can have a way to configure the algorithm to keep some on-demand capacity.

Some Spot are created without Tag

Some spot requests are fulfilled, but no 'launched-for-asg' is been assigned to this instance.

03:08:27
autoscaling.go:443: production-ecs-cluster Created spot instance request sir-dbqr5wjj
03:08:27
autoscaling.go:483: production-ecs-cluster Failed to create tags for the spot instance request InvalidSpotInstanceRequestID.NotFound: The spot instance request ID 'sir-dbqr5wjj' does not exist
03:08:28
autoscaling.go:335: production-ecs-cluster Refreshed details for sir-dbqr5wjj {
03:08:28
SpotInstanceRequestId: "sir-dbqr5wjj",
03:08:33
autoscaling.go:335: production-ecs-cluster Refreshed details for sir-dbqr5wjj {
03:08:33
SpotInstanceRequestId: "sir-dbqr5wjj",
03:08:38
autoscaling.go:335: production-ecs-cluster Refreshed details for sir-dbqr5wjj {
03:08:38
SpotInstanceRequestId: "sir-dbqr5wjj",
03:08:43
autoscaling.go:335: production-ecs-cluster Refreshed details for sir-dbqr5wjj {
03:08:43
SpotInstanceRequestId: "sir-dbqr5wjj",
03:08:48
autoscaling.go:335: production-ecs-cluster Refreshed details for sir-dbqr5wjj {
03:08:48
SpotInstanceRequestId: "sir-dbqr5wjj",
03:08:53
autoscaling.go:335: production-ecs-cluster Refreshed details for sir-dbqr5wjj {
03:08:53
SpotInstanceRequestId: "sir-dbqr5wjj",
03:08:58
autoscaling.go:335: production-ecs-cluster Refreshed details for sir-dbqr5wjj {
03:08:58
SpotInstanceRequestId: "sir-dbqr5wjj",
03:09:04
autoscaling.go:335: production-ecs-cluster Refreshed details for sir-dbqr5wjj {
03:09:04
SpotInstanceRequestId: "sir-dbqr5wjj",

Write tests for instance.go

Write tests for spot_instance_request.go

Use eawsy/aws-lambda-go for packaging

This has a couple of benefits

simpler, less custom build scripts
build in Docker container, so it doesn't need anything but docker installed for development instead of a full Golang toolchain
remove the Python wrapper entirely
slightly faster execution

Possible issues

the shipping of the instance information would need some refactoring, maybe we could be using go-bindata instead of including a blob
Local execution would need some Investigation

Investigate group capacity set to 0 instances

Investigate weighting of instance types

This may allow us to replace a number of smaller instances with a single bigger instance, as long as the price is proportionally lower.

Also depends on #5

Fix handling of instance storage mapping

There is a bug in instance store handling.

In case the launch configuration has more device mappings than available in the specified instance type, some of them will get ignored when launching the instances, so the instance will actually have less ephemeral instance store volumes than specified in the launch configuration.

Since we compare storage in the instance information with the number of ephemeral devices in the launch configuration, this causes the storage comparison to fail for a lot of otherwise compatible and closely priced instances, potentially leaving us only with much more expensive instances likely to fail when compared by price.

The storage comparison should instead consider the minimum between the number of ephemeral devices specified in the launch configuration and the number of devices available for that instance type.

undefined: Asset

I'm trying to test this out but I'm having trouble even getting it running locally. Following the SETUP.md, I get this:

$ go get github.com/cristim/autospotting
# github.com/cristim/autospotting
src/github.com/cristim/autospotting/autospotting.go:53: undefined: Asset
src/github.com/cristim/autospotting/autospotting.go:58: undefined: Asset

Configuration option for the number of instances to be allowed per instance-type/AZ combination

At the moment this number is hardcoded to 20% in the autospotting instance replacement logic, which should be changed.

we need a configuration option added to the logic that would allow an arbitrary number (the related logic may also need some clean-ups).
the new option should be exposed through a new command-line flag, defaulting to the current hardcoded value.
the new flag needs to be exposed by the CloudFormaiton stack as a parameter, also with the same default value.
it needs to be configurable on a group level override using tags
the changed code need to have unit tests
the new option needs to be documented in the README, perhaps in multiple sections if applicable.

Update regional support for R3 and D2 instance types

AWS announced increased regional availability of R3 and D2 instance types: https://aws.amazon.com/about-aws/whats-new/2016/10/announcing-regional-expansion-of-amazon-ec2-instances/

Pick the replaced on-demand instances based on the uptime

The instances should be replaced in a way that minimizes the wasted runtime hours that the user gets charged. We should pick the on-demand instance which will be closer to a full instance hour when eventually terminated.

Max((uptime_minutes + grace_period_minutes + 15 ) % 60)

The 15min is a buffer that allows us to launch a spot instance and replace the on-demand one. It could also be parameterized, although I don't think it worth the effort.

Make it self-contained, and disable auto-updates

Fork the code into a self-contained implementation which is packaged entirely into the Lambda function's code, without external runtime dependencies.

This may be done thorugh a port to one of the lambda-based frameworks such as apex, serverless or sparta, after investigating which one of those is the best for us.

Agent Fault during execution

Getting the following error:

`autoscaling.go:732: memory compatible, continuing evaluation
2016/10/05 09:08:22 Request body type has been overwritten. May cause race conditions
2016/10/05 09:08:22 Request body type has been overwritten. May cause race conditions
2016/10/05 09:08:22 Request body type has been overwritten. May cause race conditions
2016/10/05 09:08:22 Request body type has been overwritten. May cause race conditions
2016/10/05 09:08:22 Request body type has been overwritten. May cause race conditions
2016/10/05 09:08:22 Request body type has been overwritten. May cause race conditions
2016/10/05 09:08:22 Request body type has been overwritten. May cause race conditions
2016/10/05 09:08:22 Request body type has been overwritten. May cause race conditions
2016/10/05 09:08:22 Request body type has been overwritten. May cause race conditions
2016/10/05 09:08:23 Request body type has been overwritten. May cause race conditions
2016/10/05 09:08:23 Request body type has been overwritten. May cause race conditions
2016/10/05 09:08:23 Request body type has been overwritten. May cause race conditions
2016/10/05 09:08:23 Request body type has been overwritten. May cause race conditions
2016/10/05 09:08:24 Request body type has been overwritten. May cause race conditions
2016/10/05 09:08:24 Request body type has been overwritten. May cause race conditions
2016/10/05 09:08:24 Request body type has been overwritten. May cause race conditions
2016/10/05 09:08:24 Request body type has been overwritten. May cause race conditions
2016/10/05 09:08:24 Request body type has been overwritten. May cause race conditions
2016/10/05 09:08:26 Request body type has been overwritten. May cause race conditions
autoscaling.go:489: Throttling: Rate exceeded
status code: 400, request id: 467a74c5-8adb-11e6-b8f6-5b4ea7c39369
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x8 pc=0x478404]

goroutine 151 [running]:
panic(0x84c520, 0xc420014040)
/home/travis/.gimme/versions/go1.7.linux.amd64/src/runtime/panic.go:500 +0x1a1
github.com/cristim/autospotting/core.(_autoScalingGroup).countAttachedInstanceStoreVolumes(0xc420252740, 0xc42066ee40)
/home/travis/gopath/src/github.com/cristim/autospotting/core/autoscaling.go:842 +0x34
github.com/cristim/autospotting/core.(_autoScalingGroup).getCompatibleSpotInstanceTypes(0xc420252740, 0xc4203f5660, 0xa, 0xc4202ef800, 0xc420641ce0, 0x38, 0x0)
/home/travis/gopath/src/github.com/cristim/autospotting/core/autoscaling.go:747 +0xa16
github.com/cristim/autospotting/core.(_autoScalingGroup).getCheapestCompatibleSpotInstanceType(0xc420252740, 0xc4203f5660, 0xa, 0xc4202ef800, 0xc420641ca0)
/home/travis/gopath/src/github.com/cristim/autospotting/core/autoscaling.go:665 +0x22f
github.com/cristim/autospotting/core.(_autoScalingGroup).launchCheapestSpotInstance(0xc420252740, 0xc42032b338)
/home/travis/gopath/src/github.com/cristim/autospotting/core/autoscaling.go:361 +0x2f1
github.com/cristim/autospotting/core.(_autoScalingGroup).process(0xc420252740)
/home/travis/gopath/src/github.com/cristim/autospotting/core/autoscaling.go:63 +0x55b
github.com/cristim/autospotting/core.(_region).processEnabledAutoScalingGroups.func1(0xc4204ec420, 0xc42022cce0, 0x16, 0xc4204ec420, 0xc42025a280, 0x0, 0x0, 0x0)
/home/travis/gopath/src/github.com/cristim/autospotting/core/region.go:195 +0x74
created by github.com/cristim/autospotting/core.(*region).processEnabledAutoScalingGroups
/home/travis/gopath/src/github.com/cristim/autospotting/core/region.go:197 +0x112
END RequestId: 8e105fb4-8ada-11e6-83d3-57bdd7783738
REPORT RequestId: 8e105fb4-8ada-11e6-83d3-57bdd7783738 Duration: 9197.08 ms Billed Duration: 9200 ms Memory Size: 128 MB Max Memory Used: 15 MB `

Choose more reliable instance types

Use the Spot Bid Advisor information about the likelyhood of instances being terminated, in order to prefer instances unlikely to be terminated in the near future.

Refactoring to split autoscaling.go in multiple components.

Increase the automated test coverage to an acceptable value

I'd like to see it achieve somewhere around 80%

Summary of files:

autoscaling (ref https://github.com/cristim/autospotting/issues/41)
connections (no ref)
instance (ref https://github.com/cristim/autospotting/issues/42)
launch_configuration (ref https://github.com/cristim/autospotting/issues/43)
main (ref https://github.com/cristim/autospotting/issues/272)
region (ref https://github.com/cristim/autospotting/issues/44)
spot_instance_request (ref https://github.com/cristim/autospotting/issues/45)
spot_price (no ref)

React on the 2min termination notice

Each of the instances about to be terminated can poll a metadata entry which will specify when the termination is imminent.

This should be handled in some way that may be specific to the application, so the user should be allowed to run some code that cleanly takes that instance out of the pool.

Spot requests fail when no SSH key is configured

Hi!

While playing around with autospotting, I configured a Launch Configuration that did not define an SSH key. In this setup, the spot requests failed with this error message:

failed: Invalid value '' for keyPairNames. It should not be blank (400 response code)
bad-parameters: Your Spot request failed due to bad parameters.

It would be nice if autospotting supported this setup.

Paginate all API calls where it is applicable

This issue was discovered in #36, it may become a problem on AWS accounts with lots of resources.

Failed to tag instance

It has happened before, the waitFor function should be enough, but it's look like the request it self doesn't exists.

The instance was launch successfully with the right user data.

autoscaling.go:313: test-ecs-cluster Waiting for spot instance for spot instance request sir-7tag5hag
autoscaling.go:468: awseb-e-sku5tnq3xu-stack-AWSEBAutoScalingGroup-1F01KRYAGNGAD Failed to create tags for the spot instance request InvalidSpotInstanceRequestID.NotFound: The spot instance request ID 'sir-r1vi76vg' does not exist
status code: 400, request id: d00b11c4-ba27-4934-ab5e-5a189b6b1c64

Consider availability zone when picking an instance type

Our instances all run in the us-east-1d availability zone. When the autospotting process runs, it's finding the cg1.4xlarge instance type as the cheapest option and trying to use that to request spot instances. Unfortunately, the cg1.4xlarge instance type isn't available in that us-east-1d AZ, only us-east-1c. We get this error on the spot request "capacity-not-available: There is no Spot capacity available that matches your request. "

For use cases like ours that are limited to a specific AZ, it would be really helpful to consider the AZ when retrieving spot pricing info, and only pick an instance type if it's available in the AZ.

When this happens, it continues to request a new spot instance each time the process runs, which fairly quickly uses up the maximum number of open spot instance requests that AWS allows, preventing other spot instance requests.

Make the algorithm configurable

This could be achieved using some additional metadata specified in the tag set on the AutoScaling group.

Try it out on EC2 classic and fix any issues

The current code was recently changed in order to work on VPC, and was only tested on VPC and DefaultVPC environments, so it may have regressions on EC2 Classic.

Those need to be checked and ironed out if found.

Properly handle SecurityGroups in DefaultVPC environments

When using a stack created for EC2 classic, the groups are created by name.

In VPC(inclusing DefaultVPC) they need to be given by ID.

The code may need to query the groups by name and return their ID, and pass them by ID on VPC environments.

Write tests for launch_configuration.go

Support the new Ohio region and recently released instance types

Fixed in a2be373, available for installation as of the TravisCI build 63

Write tests for region.go

Support more Spot market products, not only Linux/UNIX

Add support for the new London region

Review all merged non-approved pull requests.

All the code that was already merged needs to be reviewed after the fact.

All identified problems would need to be raised as new issues and fixed in new pull requests.

Improve the instance replacement logic

The code needs to be cleaned up at least as per some of the code review comments posted on #46.

Any function that is changed needs to have its unit tests updated or created if missing.

Take into account the current reserved instance usage before launching spot instances

Add support for Windows, and use the correct spot product for Linux

It should also work for Windows instances, not just Linux/Unix.

On Linux/Unix, also take into account the Suse and VPC/Classic spot price variations, since at the moment just Linux/Unix is used, which may sometimes cause problems.

Make Agent Run for all ASG in each iteration, packaging discussions

Looks like in each iteration of the Lambda function, it runs only for one ASG, make the replacement process very long for multiple ASG environment.

CodeDeploy support

My current autoscaling groups use CodeDeploy to deploy the web applications. CodeDeploy uses Hooks in the AutoScaling Group for that, is this supported out of the box?

Keep at least one (or more) reserved instance running

I have a web application and it is essential to have a number of instances always running to handle the possibility of having all spot instances shut down and make the web application still accessible.

Is it possible to configure Autospotting to keep a specific (or a number of) reserved instances without replacement while replacing the rest of the instances with spot ones ?

Whitelist/blacklist certain instance types via configuration

I saw forks that hard-coded some specific instance types they need in their environment. Other people complained that some instances are problematic in certain availability zones and had them hardcoded out.

This could be configurable using CloudFormation stack parameters, passed as variables to the Lambda function now that Lambda supports this feature. We could add two new CloudFormation stack options:

WhiteListedInstanceTypes: m3.medium,c4.large
BlackListedInstanceTypes: c3.large

On another hand I think that such a global setting may not be desirable for some cases, so it may be better to also be able to apply it on a per-group basis, using additional tags set on the AutoScaling group:

AutoSpotting_WhiteListedInstanceTypes: c3.large,c4.large
AutoSpotting_BlackListedInstanceTypes:  c3.xlarge

Failed to describe AutoScaling tags in <region> AccessDenied

I'm getting the following error:

region.go:248: Failed to describe AutoScaling tags in us-east-1 AccessDenied: User: arn:aws:sts::308824460317:assumed-role/AutoSpotting-LambdaExecutionRole-17GHN91Z6W9OE/AutoSpotting-LambdaFunction-V9LHZ91FKJK4 is not authorized to perform: autoscaling:DescribeTags

Workaround it by adding "AmazonEC2FullAccess" policy to the autospotting role.

I'm using very early version of the stack, maybe the role needs to be updated.

Currently this is the structure of the role:

{
    "Statement": [
        {
            "Action": [
                "autoscaling:DescribeAutoScalingGroups",
                "autoscaling:DescribeLaunchConfigurations",
                "autoscaling:AttachInstances",
                "autoscaling:DetachInstances",
                "ec2:CreateTags",
                "ec2:DescribeInstances",
                "ec2:DescribeRegions",
                "ec2:DescribeSpotInstanceRequests",
                "ec2:DescribeSpotPriceHistory",
                "ec2:RequestSpotInstances",
                "ec2:TerminateInstances",
                "iam:PassRole",
                "logs:CreateLogGroup",
                "logs:CreateLogStream",
                "logs:PutLogEvents"
            ],
            "Resource": "*",
            "Effect": "Allow"
        }
    ]
}

Document self-hosting binaries

This was achieved by @nmeierpolys while trying to work around #25, and it would be nice to have it documented in a new Markdown file, which would be referenced from the main readme.

Do not process AutoScaling groups while the CloudFormation stack that created them is in progress

We use cfn init to publish a success signal. It would be cool if autospotting could read that and ensure that a replacement of an instance resulted in a successful deployment.