keikoproj / minion-manager Goto Github PK

View Code? Open in Web Editor NEW

131.0 131.0 22.0 13.73 MB

Intelligent use of Spot Instances in Kubernetes

License: Apache License 2.0

Makefile 1.24% Python 97.97% Dockerfile 0.56% Shell 0.24%

autoscaling-groups cost-effectiveness kubernetes spot-instances

minion-manager's People

Contributors

Stargazers

Watchers

minion-manager's Issues

Bug: LaunchTemplate causes controller to crash

Looks like if we tag an ASG using a LaunchTemplate with the minion manager tag, it causes the controller to crash.
Regardless of whether we would like to support launch templates in the future, we should probably avoid crashing if the LaunchConfigurationName field is nil.

Traceback (most recent call last):
  File "./minion_manager.py", line 62, in <module>
    run()
  File "./minion_manager.py", line 58, in run
    minion_manager.run()
  File "/src/cloud_provider/aws/aws_minion_manager.py", line 727, in run
    str(ex))
Exception: Failed to discover/populate current ASG info: LaunchConfigurationName

Spot instances can be terminated without price change (and new ones gotten after long time)

We noticed that spot instances were Terminated without any bid or spot price changes. It seems AWS can terminate them and not give new ones immediately. We may need to switch to on-demand when this is happening. Will be a little hard to decide but need some mechanism

Handle Cloudwatch Event for Spot instance termination

Minion manager should react to the spot instance termination event and switch to On Demand to catch very rapid bursts in Spot Price.

Cordon and drain nodes before termination

When the minion-manager switches between on-demand to spot instances, it currently simply terminates the nodes. It will be good if the termination is proceeded by cordoning and draining the node so that the pods on that node can move to a different node. Also, this might reduce the downtime (if any) that the apps might face because of this.

Remove python2 compatibility code

Python2 was EOL'ed earlier this year and thus it may be worth switching to Python3 to avoid potential security issues

https://www.python.org/doc/sunset-python-2/

Move container image under argoproj

The current minion-manager docker image is under a personal docker-hub account. It should be moved under argoproj with the rest of the images.

Remove dependency on aws credentials for running unit tests

Currently, running make runs the unit tests which require valid aws credentials. This is because the unit tests invoke the boto apis that actually make AWS api calls. Instead, the appropriate calls should be mocked with mocker or moto. This will also reduce the time required for the tests to run.

Switch to on-demand when capacity unavailable

Good Afternoon,

I was just wondering if it was possible to add a feature to switch to on-demand instances when capacity is unavailable and the request could not be be fulfilled?

AWS can return multiple values per instance-type per region

We recently found that the minion-manager was not switching from on-demand to spot-instances because it saw that the spot-instance price was > than the on-demand price. The on-demand instance price was seen to be 0.000 :-).

Turns out that this was happening because the AWS on-demand pricing endpoint was returning multiple values for that instance-type for the same region. The current mechanism for gathering the ondemand instance price will simply take the price of the last entry for that instance type. But it seems that that price could be 0.00.

This is what we got in one case:

{'SKU': '2N2QH6UEJZ5GUPT8'	'OfferingClass': ''	'Group': ''	'Instance Capacity - xlarge': ''	'Instance Capacity - 16xlarge': ''	'PricePerUnit': '0.0464000000'	'PriceDescription': '$0.0464 per On Demand Linux t2.medium Instance Hour'	'Storage': 'EBS only'	'Pre Installed S/W': 'NA'	'Instance': ''	'Normalization Size Factor': '2'	'Location': 'US East (Ohio)'	'Memory': '4 GiB'	'Physical Processor': 'Intel Xeon Family'	'operation': 'RunInstances'	'Dedicated EBS Throughput': ''	'Instance Capacity - 10xlarge': ''	'Instance Capacity - 4xlarge': ''	'To Location': ''	'From Location': ''	'Operating System': 'Linux'	'Product Family': 'Compute Instance'	'GPU': ''	'Intel Turbo Available': ''	'Intel AVX Available': ''	'Max IOPS Burst Performance': ''	'Instance Capacity - 32xlarge': ''	'ECU': 'Variable'	'Tenancy': 'Shared'	'Instance Capacity - 18xlarge': ''	'OfferTermCode': 'JRTCKXETXF'	'Instance Capacity - 9xlarge': ''	'Instance Capacity - 8xlarge': ''	'Processor Architecture': '32-bit or 64-bit'	'EBS Optimized': ''	'Group Description': ''	'Provisioned': ''	'Location Type': 'AWS Region'	'EffectiveDate': '2018-10-01'	'License Model': 'No License required'	'vCPU': '2'	'TermType': 'OnDemand'	'instanceSKU': ''	'PurchaseOption': ''	'Instance Type': 't2.medium'	'Instance Capacity - 2xlarge': ''	'LeaseContractLength': ''	'Instance Capacity - large': ''	'StartingRange': '0'	'Max IOPS/volume': ''	'Max throughput/volume': ''	'To Location Type': ''	'Processor Features': 'Intel AVX; Intel Turbo'	'Intel AVX2 Available': ''	'GPU Memory': ''	'serviceName': 'Amazon Elastic Compute Cloud'	'Network Performance': 'Low to Moderate'	'Max Volume Size': ''	'CapacityStatus': 'Used'	'Instance Capacity - 12xlarge': ''	'Transfer Type': ''	'Elastic GPU Type': ''	'usageType': 'USE2-BoxUsage:t2.medium'	'RateCode': '2N2QH6UEJZ5GUPT8.JRTCKXETXF.6YS6EN2CT7'	'Instance Capacity - 24xlarge': ''	'Instance Family': 'General purpose'	'Currency': 'USD'	'Enhanced Networking Supported': ''	'serviceCode': 'AmazonEC2'	'Physical Cores': ''	'Instance Capacity - medium': ''	'Volume Type': ''	'Storage Media': ''	'EndingRange': 'Inf'	'Clock Speed': 'Up to 3.3 GHz'	'From Location Type': ''	'Unit': 'Hrs'	'Current Generation': 'Yes'}
{'SKU': 'QT7848TA4YHDW5JE'	'OfferingClass': ''	'Group': ''	'Instance Capacity - xlarge': ''	'Instance Capacity - 16xlarge': ''	'PricePerUnit': '0.0464000000'	'PriceDescription': '$0.0464 per Unused Reservation Linux t2.medium Instance Hour'	'Storage': 'EBS only'	'Pre Installed S/W': 'NA'	'Instance': ''	'Normalization Size Factor': '2'	'Location': 'US East (Ohio)'	'Memory': '4 GiB'	'Physical Processor': 'Intel Xeon Family'	'operation': 'RunInstances'	'Dedicated EBS Throughput': ''	'Instance Capacity - 10xlarge': ''	'Instance Capacity - 4xlarge': ''	'To Location': ''	'From Location': ''	'Operating System': 'Linux'	'Product Family': 'Compute Instance'	'GPU': ''	'Intel Turbo Available': ''	'Intel AVX Available': ''	'Max IOPS Burst Performance': ''	'Instance Capacity - 32xlarge': ''	'ECU': 'Variable'	'Tenancy': 'Shared'	'Instance Capacity - 18xlarge': ''	'OfferTermCode': 'JRTCKXETXF'	'Instance Capacity - 9xlarge': ''	'Instance Capacity - 8xlarge': ''	'Processor Architecture': '32-bit or 64-bit'	'EBS Optimized': ''	'Group Description': ''	'Provisioned': ''	'Location Type': 'AWS Region'	'EffectiveDate': '2018-10-01'	'License Model': 'No License required'	'vCPU': '2'	'TermType': 'OnDemand'	'instanceSKU': '2N2QH6UEJZ5GUPT8'	'PurchaseOption': ''	'Instance Type': 't2.medium'	'Instance Capacity - 2xlarge': ''	'LeaseContractLength': ''	'Instance Capacity - large': ''	'StartingRange': '0'	'Max IOPS/volume': ''	'Max throughput/volume': ''	'To Location Type': ''	'Processor Features': 'Intel AVX; Intel Turbo'	'Intel AVX2 Available': ''	'GPU Memory': ''	'serviceName': 'Amazon Elastic Compute Cloud'	'Network Performance': 'Low to Moderate'	'Max Volume Size': ''	'CapacityStatus': 'UnusedCapacityReservation'	'Instance Capacity - 12xlarge': ''	'Transfer Type': ''	'Elastic GPU Type': ''	'usageType': 'USE2-UnusedBox:t2.medium'	'RateCode': 'QT7848TA4YHDW5JE.JRTCKXETXF.6YS6EN2CT7'	'Instance Capacity - 24xlarge': ''	'Instance Family': 'General purpose'	'Currency': 'USD'	'Enhanced Networking Supported': ''	'serviceCode': 'AmazonEC2'	'Physical Cores': ''	'Instance Capacity - medium': ''	'Volume Type': ''	'Storage Media': ''	'EndingRange': 'Inf'	'Clock Speed': 'Up to 3.3 GHz'	'From Location Type': ''	'Unit': 'Hrs'	'Current Generation': 'Yes'}
{'SKU': 'PRCADQFUQ6HZKBHK'	'OfferingClass': ''	'Group': ''	'Instance Capacity - xlarge': ''	'Instance Capacity - 16xlarge': ''	'PricePerUnit': '0.0000000000'	'PriceDescription': '$0.00 per Reservation Linux t2.medium Instance Hour'	'Storage': 'EBS only'	'Pre Installed S/W': 'NA'	'Instance': ''	'Normalization Size Factor': '2'	'Location': 'US East (Ohio)'	'Memory': '4 GiB'	'Physical Processor': 'Intel Xeon Family'	'operation': 'RunInstances'	'Dedicated EBS Throughput': ''	'Instance Capacity - 10xlarge': ''	'Instance Capacity - 4xlarge': ''	'To Location': ''	'From Location': ''	'Operating System': 'Linux'	'Product Family': 'Compute Instance'	'GPU': ''	'Intel Turbo Available': ''	'Intel AVX Available': ''	'Max IOPS Burst Performance': ''	'Instance Capacity - 32xlarge': ''	'ECU': 'Variable'	'Tenancy': 'Shared'	'Instance Capacity - 18xlarge': ''	'OfferTermCode': 'JRTCKXETXF'	'Instance Capacity - 9xlarge': ''	'Instance Capacity - 8xlarge': ''	'Processor Architecture': '32-bit or 64-bit'	'EBS Optimized': ''	'Group Description': ''	'Provisioned': ''	'Location Type': 'AWS Region'	'EffectiveDate': '2018-10-01'	'License Model': 'No License required'	'vCPU': '2'	'TermType': 'OnDemand'	'instanceSKU': '2N2QH6UEJZ5GUPT8'	'PurchaseOption': ''	'Instance Type': 't2.medium'	'Instance Capacity - 2xlarge': ''	'LeaseContractLength': ''	'Instance Capacity - large': ''	'StartingRange': '0'	'Max IOPS/volume': ''	'Max throughput/volume': ''	'To Location Type': ''	'Processor Features': 'Intel AVX; Intel Turbo'	'Intel AVX2 Available': ''	'GPU Memory': ''	'serviceName': 'Amazon Elastic Compute Cloud'	'Network Performance': 'Low to Moderate'	'Max Volume Size': ''	'CapacityStatus': 'AllocatedCapacityReservation'	'Instance Capacity - 12xlarge': ''	'Transfer Type': ''	'Elastic GPU Type': ''	'usageType': 'USE2-Reservation:t2.medium'	'RateCode': 'PRCADQFUQ6HZKBHK.JRTCKXETXF.6YS6EN2CT7'	'Instance Capacity - 24xlarge': ''	'Instance Family': 'General purpose'	'Currency': 'USD'	'Enhanced Networking Supported': ''	'serviceCode': 'AmazonEC2'	'Physical Cores': ''	'Instance Capacity - medium': ''	'Volume Type': ''	'Storage Media': ''	'EndingRange': 'Inf'	'Clock Speed': 'Up to 3.3 GHz'	'From Location Type': ''	'Unit': 'Hrs'	'Current Generation': 'Yes'}

The difference in the three prices is the price description.

'PriceDescription': '$0.0464 per On Demand Linux t2.medium Instance Hour'
'PriceDescription': '$0.0464 per Unused Reservation Linux t2.medium Instance Hour'
'PriceDescription': '$0.00 per Reservation Linux t2.medium Instance Hour'

Basically, minion-manager currently only support on-demand instances (does not support Reserved instances). Therefore, only the "On Demand" price description from the above is relevant. But current implementation of the price querying API does not factor this in.

To start with:

it'll be good to specifically look for "On Demand " in the price description and only consider that price.
Add warnings if there are duplicates and if some price is being overwritten
Ensure that the price is not set to 0. If so... log LOUDLY!!

Use AWS tags to discover ASGs instead of command line arguments

The minion-manager currently uses the --scaling-groups command line argument to find the list of ASGs on which to operate. Everytime the list has to be updated, the minion-manager deployment has to be updated and restarted. This is cumbersome and error-prone.

Instead, the minion-manager should take an AWS tag name and tag value pair as argument and "discover" the ASGs to operate upon. If the user wants to disable use of spot-instances, the user can simply modify the tags in AWS and the minion-manager pod should factor that in.

This is similar to the way the cluster-autoscaler pod runs.

On-demand instances get terminated all together

Currently, all on-demand instances in an ASG get terminated together when the minion-manager decides to use spot instances. This has it's pros and cons. The benefit of this is that new instances all come-up in parallel shortening the time for which on-demand instances run (and therefore keeps costs low). However, this can lead to service disruption.

Ideally, it should be possible to chose what termination strategy is to be used.

Maybe, add another tag?
k8s-minion-manager/num-simultaneous-terminations: 1 will terminate one instance at a time.
k8s-minion-manager/num-simultaneous-terminations: all will terminate all instances together.

lot of DescribeSpotPriceHistory calls when there is a Exception

When an Exception occurred during DescribeSpotPriceHistory there is no back off, minion-manager is making lot of aws calls

Will `schedule_instance_termination()` terminate instance just created?

When I read the schedule_instance_termination() in aws_minion_manager.py. I found it will terminate instances if not match ASG's k8s-minion-manager.
But I didn't found update_scaling_group() will update the ASG's k8s-minion-manager value.
When spot price over on-demand price. mm will update launch config to use on-demand. And update lc_info and bid_info.
So I worry will it keep terminate instances just launched after price raised over on-demand price.

BUG: Spot price is not updated based on LaunchConfig

When a LaunchConfig is changed, spot price will stay the same and based on the size of instance-type being switched, it can prevent instances from launching.

Spot price should be updated based on instance-type specified in LaunchConfig, you might also need to maintain the previous spot pricing until all new instances have joined Asg.

minion manager should support events only mode

Minion Manager should support Kubernetes events only mode

Published structured JSON with IG-name, Bid price, and aws-region, etc
Minion Manager will only publish recommendation and will not take any action

Handle AZ-isolated capacity issues

Occasionally, an AZ may run out of spot capacity. When this happens, an ASG will temporarily spin up instances in other AZs if possible - and later attempt to rebalance instances across all AZs. If an AZ is still out of capacity or close to being out, AWS will still attempt to spin up instances in this AZ. We've noticed a lot of node churn when this happens - nodes are spun up, before being yanked by AWS for instance-terminated-no-capacity - it would be nice if minion manager was able suspend AzRebalance in these cases to avoid further churn.

on-demand price is 0.0000

I installed kubernetes 1.10 and started minion-manager using yaml file from deploy folder. I tagged ASG with "KubernetesCluster"="my-cluster-name" and "minion-manager"="on-spot". After some time log shows

INFO aws.minion-manager.bid-advisor MainThread: Using spot_instance price 0.013900, on-demand price 0.000000 for instance type: t2.medium, zones: ['us-east-1a', 'us-east-1b']. Why is that? There was no errors in the log.

How to change log to debug?

Try to do a POC with this tool but cant make it to work (i set the tags)

MM fails to discover/populate ASG, using weird endpoint for AWS API

Minion-Manager seems to be using a peculiar endpoint when talking to AWS Autoscaling API.
I've got the following setup:

K8s cluster: dev.rnd.pw
K8s nodes ASG: nodes.dev.rnd.pw
Route 53 Zone Record: *.dev.rnd.pw
- api.dev.rnd.pw -> CNAME -> K8s API ELB
- *.dev.rnd.pw -> CNAME -> Ingress ELB
- other records, not relevant to this issue
Ingress ELB with rnd.pw SSL Cert with additional alternative names: *.rnd.pw, *.dev.rnd.pw

When launching Minion-Manger, it seems to attempt to talk to something in *.dev.rnd.pw, according to the error message mentioning the certificate. I've no idea how it would resolve autoscaling.us-east-2.amazonaws.com via *.dev.rnd.pw wildcard CNAME record.

Using shrinand/k8s-minion-manager:v0.2-dev

$ kubectl logs -f minion-manager-695dd4596f-nl5wd
2018-07-25T10:20:04 INFO minion_manager MainThread: Starting minion-manager for cluster: dev.rnd.pw, in region us-east-2 for cloud provider aws
2018-07-25T10:20:05 INFO aws_minion_manager MainThread: Running AWS Minion Manager
Traceback (most recent call last):
  File "./minion_manager.py", line 61, in <module>
    run()
  File "./minion_manager.py", line 57, in run
    minion_manager.run()
  File "/cloud_provider/aws/aws_minion_manager.py", line 495, in run
    str(ex))
Exception: Failed to discover/populate current ASG info: hostname 'autoscaling.us-east-2.amazonaws.com' doesn't match either of 'rnd.pw', '*.rnd.pw', '*.dev.rnd.pw'

Or using argoproj/minion-manager

$ kubectl logs -f minion-manager-dep-575cb9d695-7vrhl
2018-07-25T10:30:05 INFO minion_manager MainThread: Starting ...
2018-07-25T10:30:05 INFO minion_manager MainThread: Using config from env: us-east-2
2018-07-25T10:30:05 INFO minion_manager MainThread: Using config from env: ['nodes.dev.rnd.pw']
2018-07-25T10:30:05 INFO minion_manager MainThread: Starting minion-manager for scaling groups: ['nodes.dev.rnd.pw'], in region us-east-2 for cloud provider aws
2018-07-25T10:30:05 INFO aws.minion-manager MainThread: Running AWS Minion Manager
Traceback (most recent call last):
  File "/ax/bin/minion_manager", line 89, in <module>
    run()
  File "/ax/bin/minion_manager", line 79, in run
    minion_manager.run()
  File "/ax/python/ax/platform/minion_manager/cloud_provider/aws/aws_minion_manager.py", line 550, in run
    self.start()
  File "/ax/python/ax/platform/minion_manager/cloud_provider/aws/aws_minion_manager.py", line 152, in start
    str(ex))
Exception: Failed to discover/populate current ASG info: hostname 'autoscaling.us-east-2.amazonaws.com' doesn't match either of 'rnd.pw', '*.rnd.pw', '*.dev.rnd.pw'

As soon as I delete *.dev.rnd.pw DNS record, the problem disappears, and Minion-Manager discovers ASG just fine.

Master branch broken because of incorrect use of variable

Switching to on-demand instances from spot-instances is currently broken because of the following:

2019-03-26T05:38:29 ERROR aws_minion_manager MainThread: Failed while checking instances in ASG: global name 'spot_price' is not defined
Traceback (most recent call last):
  File "/src/cloud_provider/aws/aws_minion_manager.py", line 593, in minion_manager_work
    self.update_scaling_group(asg_meta, new_bid_info)
  File "/src/cloud_provider/aws/aws_minion_manager.py", line 330, in update_scaling_group
    self.create_lc_on_demand(new_lc_name, launch_config)
  File "/usr/local/lib/python2.7/site-packages/retrying.py", line 49, in wrapped_f
    return Retrying(*dargs, **dkw).call(f, *args, **kw)
  File "/usr/local/lib/python2.7/site-packages/retrying.py", line 212, in call
    raise attempt.get()
  File "/usr/local/lib/python2.7/site-packages/retrying.py", line 247, in get
    six.reraise(self.value[0], self.value[1], self.value[2])
  File "/usr/local/lib/python2.7/site-packages/retrying.py", line 200, in call
    attempt = Attempt(fn(*args, **kwargs), attempt_number, False)
  File "/src/cloud_provider/aws/aws_minion_manager.py", line 270, in create_lc_on_demand
    SpotPrice=spot_price,
NameError: global name 'spot_price' is not defined

Upgrade the `pyca / cryptography` library to newer version

Bid threshold should be a parameter

Hi, I've done some testing on minion-manager and was really satisfied with the result, thinking about implementing it to production 😄 , but I think a configurable threshold should be added, so we can determine how aggressive we want to be about the prices over OnDemand instances.

Something like:

parser.add_argument("--threshold", default=80, help="Max percentage to pay over OnDemand price")

Looking at the code, IMO it's a very simple change and I wonder if you think it's valid or not?
I could submit a PR if you're positive.

Thanks.

Setup PR builds for minion-manager

Switch to on-demand pricing per region

Example: ( us-west-2 ) URL
https://pricing.us-east-1.amazonaws.com/offers/v1.0/aws/AmazonEC2/current/us-west-2/index.csv

Certain Instance types have on-demand price is 0.0000

I saw that these two issues below are similar to mine and were closed.
#15
#10

The bug still seems to be in the code and also affects m5a.2xlarge instances

I have a fix. What is the best way to share it? I am getting a 403 when I try to push up my branch.

Thanks!

Dockerfile Vulnerability

I don't like the unknown nature of the docker file. why is this not based on https://hub.docker.com/_/python/?tab=tags
?

SpotRecommendation events should include IG name

Current:

28m         Normal   SpotRecommendationGiven   SpotPriceInfo   {"apiVersion":"v1alpha1","spotPrice":"", "useSpot": false}

Ecxpeetd:

28m         Normal   SpotRecommendationGiven   SpotPriceInfo   {"apiVersion":"v1alpha1","spotPrice":"0.90", "useSpot": false, "instanceGroup": "node123"}

Does minion-manager support mixed instance?

I have cluster that use ASG with mixed instance, on-demand + spot instance reference , I want to use minion-manager to just use to autoscale the spot instance? without touching the on-demand one, does this possible with current minion-manager?

Config-file with config options

The minion-manager has a few configuration options and could use a few more. E.g.

Name of the cluster
Region
Number of instances to terminate in parallel
Time to sleep between terminating instances

It will be better to have these options in a config file and make the minion-manager use that file instead of some command line args.

k8s-minion-manager should show money spent/saved

Is this a BUG REPORT or FEATURE REQUEST?:
FEATURE REQUEST

What happened:
k8s-minion-manager can intelligently switch between spot and on-demand instances. However, it doesn't provide information about how much money has been saved because of it. It will be good if the addon can provide that information.

What you expected to happen:
There should be an easy way of seeing the money spent and saved on a per IG basis.

keikoproj / minion-manager Goto Github PK

minion-manager's People

Contributors

Stargazers

Watchers

Forkers

minion-manager's Issues

Recommend Projects

Recommend Topics

Recommend Org