keikoproj / minion-manager Goto Github PK
View Code? Open in Web Editor NEWIntelligent use of Spot Instances in Kubernetes
License: Apache License 2.0
Intelligent use of Spot Instances in Kubernetes
License: Apache License 2.0
Looks like if we tag an ASG using a LaunchTemplate with the minion manager tag, it causes the controller to crash.
Regardless of whether we would like to support launch templates in the future, we should probably avoid crashing if the LaunchConfigurationName field is nil.
Traceback (most recent call last):
File "./minion_manager.py", line 62, in <module>
run()
File "./minion_manager.py", line 58, in run
minion_manager.run()
File "/src/cloud_provider/aws/aws_minion_manager.py", line 727, in run
str(ex))
Exception: Failed to discover/populate current ASG info: LaunchConfigurationName
We noticed that spot instances were Terminated without any bid or spot price changes. It seems AWS can terminate them and not give new ones immediately. We may need to switch to on-demand when this is happening. Will be a little hard to decide but need some mechanism
Minion manager should react to the spot instance termination event and switch to On Demand to catch very rapid bursts in Spot Price.
When the minion-manager switches between on-demand to spot instances, it currently simply terminates the nodes. It will be good if the termination is proceeded by cordoning and draining the node so that the pods on that node can move to a different node. Also, this might reduce the downtime (if any) that the apps might face because of this.
Python2 was EOL'ed earlier this year and thus it may be worth switching to Python3 to avoid potential security issues
The current minion-manager
docker image is under a personal docker-hub account. It should be moved under argoproj with the rest of the images.
Currently, running make
runs the unit tests which require valid aws credentials. This is because the unit tests invoke the boto apis that actually make AWS api calls. Instead, the appropriate calls should be mocked with mocker or moto. This will also reduce the time required for the tests to run.
Good Afternoon,
I was just wondering if it was possible to add a feature to switch to on-demand instances when capacity is unavailable and the request could not be be fulfilled?
We recently found that the minion-manager was not switching from on-demand to spot-instances because it saw that the spot-instance price was > than the on-demand price. The on-demand instance price was seen to be 0.000 :-).
Turns out that this was happening because the AWS on-demand pricing endpoint was returning multiple values for that instance-type for the same region. The current mechanism for gathering the ondemand instance price will simply take the price of the last entry for that instance type. But it seems that that price could be 0.00.
This is what we got in one case:
{'SKU': '2N2QH6UEJZ5GUPT8' | 'OfferingClass': '' | 'Group': '' | 'Instance Capacity - xlarge': '' | 'Instance Capacity - 16xlarge': '' | 'PricePerUnit': '0.0464000000' | 'PriceDescription': '$0.0464 per On Demand Linux t2.medium Instance Hour' | 'Storage': 'EBS only' | 'Pre Installed S/W': 'NA' | 'Instance': '' | 'Normalization Size Factor': '2' | 'Location': 'US East (Ohio)' | 'Memory': '4 GiB' | 'Physical Processor': 'Intel Xeon Family' | 'operation': 'RunInstances' | 'Dedicated EBS Throughput': '' | 'Instance Capacity - 10xlarge': '' | 'Instance Capacity - 4xlarge': '' | 'To Location': '' | 'From Location': '' | 'Operating System': 'Linux' | 'Product Family': 'Compute Instance' | 'GPU': '' | 'Intel Turbo Available': '' | 'Intel AVX Available': '' | 'Max IOPS Burst Performance': '' | 'Instance Capacity - 32xlarge': '' | 'ECU': 'Variable' | 'Tenancy': 'Shared' | 'Instance Capacity - 18xlarge': '' | 'OfferTermCode': 'JRTCKXETXF' | 'Instance Capacity - 9xlarge': '' | 'Instance Capacity - 8xlarge': '' | 'Processor Architecture': '32-bit or 64-bit' | 'EBS Optimized': '' | 'Group Description': '' | 'Provisioned': '' | 'Location Type': 'AWS Region' | 'EffectiveDate': '2018-10-01' | 'License Model': 'No License required' | 'vCPU': '2' | 'TermType': 'OnDemand' | 'instanceSKU': '' | 'PurchaseOption': '' | 'Instance Type': 't2.medium' | 'Instance Capacity - 2xlarge': '' | 'LeaseContractLength': '' | 'Instance Capacity - large': '' | 'StartingRange': '0' | 'Max IOPS/volume': '' | 'Max throughput/volume': '' | 'To Location Type': '' | 'Processor Features': 'Intel AVX; Intel Turbo' | 'Intel AVX2 Available': '' | 'GPU Memory': '' | 'serviceName': 'Amazon Elastic Compute Cloud' | 'Network Performance': 'Low to Moderate' | 'Max Volume Size': '' | 'CapacityStatus': 'Used' | 'Instance Capacity - 12xlarge': '' | 'Transfer Type': '' | 'Elastic GPU Type': '' | 'usageType': 'USE2-BoxUsage:t2.medium' | 'RateCode': '2N2QH6UEJZ5GUPT8.JRTCKXETXF.6YS6EN2CT7' | 'Instance Capacity - 24xlarge': '' | 'Instance Family': 'General purpose' | 'Currency': 'USD' | 'Enhanced Networking Supported': '' | 'serviceCode': 'AmazonEC2' | 'Physical Cores': '' | 'Instance Capacity - medium': '' | 'Volume Type': '' | 'Storage Media': '' | 'EndingRange': 'Inf' | 'Clock Speed': 'Up to 3.3 GHz' | 'From Location Type': '' | 'Unit': 'Hrs' | 'Current Generation': 'Yes'} |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
{'SKU': 'QT7848TA4YHDW5JE' | 'OfferingClass': '' | 'Group': '' | 'Instance Capacity - xlarge': '' | 'Instance Capacity - 16xlarge': '' | 'PricePerUnit': '0.0464000000' | 'PriceDescription': '$0.0464 per Unused Reservation Linux t2.medium Instance Hour' | 'Storage': 'EBS only' | 'Pre Installed S/W': 'NA' | 'Instance': '' | 'Normalization Size Factor': '2' | 'Location': 'US East (Ohio)' | 'Memory': '4 GiB' | 'Physical Processor': 'Intel Xeon Family' | 'operation': 'RunInstances' | 'Dedicated EBS Throughput': '' | 'Instance Capacity - 10xlarge': '' | 'Instance Capacity - 4xlarge': '' | 'To Location': '' | 'From Location': '' | 'Operating System': 'Linux' | 'Product Family': 'Compute Instance' | 'GPU': '' | 'Intel Turbo Available': '' | 'Intel AVX Available': '' | 'Max IOPS Burst Performance': '' | 'Instance Capacity - 32xlarge': '' | 'ECU': 'Variable' | 'Tenancy': 'Shared' | 'Instance Capacity - 18xlarge': '' | 'OfferTermCode': 'JRTCKXETXF' | 'Instance Capacity - 9xlarge': '' | 'Instance Capacity - 8xlarge': '' | 'Processor Architecture': '32-bit or 64-bit' | 'EBS Optimized': '' | 'Group Description': '' | 'Provisioned': '' | 'Location Type': 'AWS Region' | 'EffectiveDate': '2018-10-01' | 'License Model': 'No License required' | 'vCPU': '2' | 'TermType': 'OnDemand' | 'instanceSKU': '2N2QH6UEJZ5GUPT8' | 'PurchaseOption': '' | 'Instance Type': 't2.medium' | 'Instance Capacity - 2xlarge': '' | 'LeaseContractLength': '' | 'Instance Capacity - large': '' | 'StartingRange': '0' | 'Max IOPS/volume': '' | 'Max throughput/volume': '' | 'To Location Type': '' | 'Processor Features': 'Intel AVX; Intel Turbo' | 'Intel AVX2 Available': '' | 'GPU Memory': '' | 'serviceName': 'Amazon Elastic Compute Cloud' | 'Network Performance': 'Low to Moderate' | 'Max Volume Size': '' | 'CapacityStatus': 'UnusedCapacityReservation' | 'Instance Capacity - 12xlarge': '' | 'Transfer Type': '' | 'Elastic GPU Type': '' | 'usageType': 'USE2-UnusedBox:t2.medium' | 'RateCode': 'QT7848TA4YHDW5JE.JRTCKXETXF.6YS6EN2CT7' | 'Instance Capacity - 24xlarge': '' | 'Instance Family': 'General purpose' | 'Currency': 'USD' | 'Enhanced Networking Supported': '' | 'serviceCode': 'AmazonEC2' | 'Physical Cores': '' | 'Instance Capacity - medium': '' | 'Volume Type': '' | 'Storage Media': '' | 'EndingRange': 'Inf' | 'Clock Speed': 'Up to 3.3 GHz' | 'From Location Type': '' | 'Unit': 'Hrs' | 'Current Generation': 'Yes'} |
{'SKU': 'PRCADQFUQ6HZKBHK' | 'OfferingClass': '' | 'Group': '' | 'Instance Capacity - xlarge': '' | 'Instance Capacity - 16xlarge': '' | 'PricePerUnit': '0.0000000000' | 'PriceDescription': '$0.00 per Reservation Linux t2.medium Instance Hour' | 'Storage': 'EBS only' | 'Pre Installed S/W': 'NA' | 'Instance': '' | 'Normalization Size Factor': '2' | 'Location': 'US East (Ohio)' | 'Memory': '4 GiB' | 'Physical Processor': 'Intel Xeon Family' | 'operation': 'RunInstances' | 'Dedicated EBS Throughput': '' | 'Instance Capacity - 10xlarge': '' | 'Instance Capacity - 4xlarge': '' | 'To Location': '' | 'From Location': '' | 'Operating System': 'Linux' | 'Product Family': 'Compute Instance' | 'GPU': '' | 'Intel Turbo Available': '' | 'Intel AVX Available': '' | 'Max IOPS Burst Performance': '' | 'Instance Capacity - 32xlarge': '' | 'ECU': 'Variable' | 'Tenancy': 'Shared' | 'Instance Capacity - 18xlarge': '' | 'OfferTermCode': 'JRTCKXETXF' | 'Instance Capacity - 9xlarge': '' | 'Instance Capacity - 8xlarge': '' | 'Processor Architecture': '32-bit or 64-bit' | 'EBS Optimized': '' | 'Group Description': '' | 'Provisioned': '' | 'Location Type': 'AWS Region' | 'EffectiveDate': '2018-10-01' | 'License Model': 'No License required' | 'vCPU': '2' | 'TermType': 'OnDemand' | 'instanceSKU': '2N2QH6UEJZ5GUPT8' | 'PurchaseOption': '' | 'Instance Type': 't2.medium' | 'Instance Capacity - 2xlarge': '' | 'LeaseContractLength': '' | 'Instance Capacity - large': '' | 'StartingRange': '0' | 'Max IOPS/volume': '' | 'Max throughput/volume': '' | 'To Location Type': '' | 'Processor Features': 'Intel AVX; Intel Turbo' | 'Intel AVX2 Available': '' | 'GPU Memory': '' | 'serviceName': 'Amazon Elastic Compute Cloud' | 'Network Performance': 'Low to Moderate' | 'Max Volume Size': '' | 'CapacityStatus': 'AllocatedCapacityReservation' | 'Instance Capacity - 12xlarge': '' | 'Transfer Type': '' | 'Elastic GPU Type': '' | 'usageType': 'USE2-Reservation:t2.medium' | 'RateCode': 'PRCADQFUQ6HZKBHK.JRTCKXETXF.6YS6EN2CT7' | 'Instance Capacity - 24xlarge': '' | 'Instance Family': 'General purpose' | 'Currency': 'USD' | 'Enhanced Networking Supported': '' | 'serviceCode': 'AmazonEC2' | 'Physical Cores': '' | 'Instance Capacity - medium': '' | 'Volume Type': '' | 'Storage Media': '' | 'EndingRange': 'Inf' | 'Clock Speed': 'Up to 3.3 GHz' | 'From Location Type': '' | 'Unit': 'Hrs' | 'Current Generation': 'Yes'} |
The difference in the three prices is the price description.
Basically, minion-manager currently only support on-demand instances (does not support Reserved instances). Therefore, only the "On Demand" price description from the above is relevant. But current implementation of the price querying API does not factor this in.
To start with:
The minion-manager currently uses the --scaling-groups
command line argument to find the list of ASGs on which to operate. Everytime the list has to be updated, the minion-manager deployment has to be updated and restarted. This is cumbersome and error-prone.
Instead, the minion-manager should take an AWS tag name and tag value pair as argument and "discover" the ASGs to operate upon. If the user wants to disable use of spot-instances, the user can simply modify the tags in AWS and the minion-manager pod should factor that in.
This is similar to the way the cluster-autoscaler
pod runs.
Currently, all on-demand instances in an ASG get terminated together when the minion-manager decides to use spot instances. This has it's pros and cons. The benefit of this is that new instances all come-up in parallel shortening the time for which on-demand instances run (and therefore keeps costs low). However, this can lead to service disruption.
Ideally, it should be possible to chose what termination strategy is to be used.
Maybe, add another tag?
k8s-minion-manager/num-simultaneous-terminations: 1
will terminate one instance at a time.
k8s-minion-manager/num-simultaneous-terminations: all
will terminate all instances together.
When an Exception occurred during DescribeSpotPriceHistory there is no back off, minion-manager is making lot of aws calls
When I read the schedule_instance_termination()
in aws_minion_manager.py
. I found it will terminate instances if not match ASG's k8s-minion-manager
.
But I didn't found update_scaling_group()
will update the ASG's k8s-minion-manager
value.
When spot price over on-demand price. mm will update launch config to use on-demand. And update lc_info
and bid_info
.
So I worry will it keep terminate instances just launched after price raised over on-demand price.
When a LaunchConfig is changed, spot price will stay the same and based on the size of instance-type being switched, it can prevent instances from launching.
Spot price should be updated based on instance-type specified in LaunchConfig, you might also need to maintain the previous spot pricing until all new instances have joined Asg.
Minion Manager should support Kubernetes events only mode
Occasionally, an AZ may run out of spot capacity. When this happens, an ASG will temporarily spin up instances in other AZs if possible - and later attempt to rebalance instances across all AZs. If an AZ is still out of capacity or close to being out, AWS will still attempt to spin up instances in this AZ. We've noticed a lot of node churn when this happens - nodes are spun up, before being yanked by AWS for instance-terminated-no-capacity
- it would be nice if minion manager was able suspend AzRebalance
in these cases to avoid further churn.
I installed kubernetes 1.10 and started minion-manager using yaml file from deploy folder. I tagged ASG with "KubernetesCluster"="my-cluster-name" and "minion-manager"="on-spot". After some time log shows
INFO aws.minion-manager.bid-advisor MainThread: Using spot_instance price 0.013900, on-demand price 0.000000 for instance type: t2.medium, zones: ['us-east-1a', 'us-east-1b']. Why is that? There was no errors in the log.
Try to do a POC with this tool but cant make it to work (i set the tags)
Minion-Manager seems to be using a peculiar endpoint when talking to AWS Autoscaling API.
I've got the following setup:
dev.rnd.pw
nodes.dev.rnd.pw
*.dev.rnd.pw
api.dev.rnd.pw -> CNAME -> K8s API ELB
*.dev.rnd.pw -> CNAME -> Ingress ELB
Ingress ELB
with rnd.pw
SSL Cert with additional alternative names: *.rnd.pw
, *.dev.rnd.pw
When launching Minion-Manger, it seems to attempt to talk to something in *.dev.rnd.pw
, according to the error message mentioning the certificate. I've no idea how it would resolve autoscaling.us-east-2.amazonaws.com
via *.dev.rnd.pw
wildcard CNAME record.
Using shrinand/k8s-minion-manager:v0.2-dev
$ kubectl logs -f minion-manager-695dd4596f-nl5wd
2018-07-25T10:20:04 INFO minion_manager MainThread: Starting minion-manager for cluster: dev.rnd.pw, in region us-east-2 for cloud provider aws
2018-07-25T10:20:05 INFO aws_minion_manager MainThread: Running AWS Minion Manager
Traceback (most recent call last):
File "./minion_manager.py", line 61, in <module>
run()
File "./minion_manager.py", line 57, in run
minion_manager.run()
File "/cloud_provider/aws/aws_minion_manager.py", line 495, in run
str(ex))
Exception: Failed to discover/populate current ASG info: hostname 'autoscaling.us-east-2.amazonaws.com' doesn't match either of 'rnd.pw', '*.rnd.pw', '*.dev.rnd.pw'
Or using argoproj/minion-manager
$ kubectl logs -f minion-manager-dep-575cb9d695-7vrhl
2018-07-25T10:30:05 INFO minion_manager MainThread: Starting ...
2018-07-25T10:30:05 INFO minion_manager MainThread: Using config from env: us-east-2
2018-07-25T10:30:05 INFO minion_manager MainThread: Using config from env: ['nodes.dev.rnd.pw']
2018-07-25T10:30:05 INFO minion_manager MainThread: Starting minion-manager for scaling groups: ['nodes.dev.rnd.pw'], in region us-east-2 for cloud provider aws
2018-07-25T10:30:05 INFO aws.minion-manager MainThread: Running AWS Minion Manager
Traceback (most recent call last):
File "/ax/bin/minion_manager", line 89, in <module>
run()
File "/ax/bin/minion_manager", line 79, in run
minion_manager.run()
File "/ax/python/ax/platform/minion_manager/cloud_provider/aws/aws_minion_manager.py", line 550, in run
self.start()
File "/ax/python/ax/platform/minion_manager/cloud_provider/aws/aws_minion_manager.py", line 152, in start
str(ex))
Exception: Failed to discover/populate current ASG info: hostname 'autoscaling.us-east-2.amazonaws.com' doesn't match either of 'rnd.pw', '*.rnd.pw', '*.dev.rnd.pw'
As soon as I delete *.dev.rnd.pw
DNS record, the problem disappears, and Minion-Manager discovers ASG just fine.
Switching to on-demand instances from spot-instances is currently broken because of the following:
2019-03-26T05:38:29 ERROR aws_minion_manager MainThread: Failed while checking instances in ASG: global name 'spot_price' is not defined
Traceback (most recent call last):
File "/src/cloud_provider/aws/aws_minion_manager.py", line 593, in minion_manager_work
self.update_scaling_group(asg_meta, new_bid_info)
File "/src/cloud_provider/aws/aws_minion_manager.py", line 330, in update_scaling_group
self.create_lc_on_demand(new_lc_name, launch_config)
File "/usr/local/lib/python2.7/site-packages/retrying.py", line 49, in wrapped_f
return Retrying(*dargs, **dkw).call(f, *args, **kw)
File "/usr/local/lib/python2.7/site-packages/retrying.py", line 212, in call
raise attempt.get()
File "/usr/local/lib/python2.7/site-packages/retrying.py", line 247, in get
six.reraise(self.value[0], self.value[1], self.value[2])
File "/usr/local/lib/python2.7/site-packages/retrying.py", line 200, in call
attempt = Attempt(fn(*args, **kwargs), attempt_number, False)
File "/src/cloud_provider/aws/aws_minion_manager.py", line 270, in create_lc_on_demand
SpotPrice=spot_price,
NameError: global name 'spot_price' is not defined
Hi, I've done some testing on minion-manager and was really satisfied with the result, thinking about implementing it to production ๐ , but I think a configurable threshold should be added, so we can determine how aggressive we want to be about the prices over OnDemand instances.
Something like:
parser.add_argument("--threshold", default=80, help="Max percentage to pay over OnDemand price")
Looking at the code, IMO it's a very simple change and I wonder if you think it's valid or not?
I could submit a PR if you're positive.
Thanks.
Switch to on-demand pricing per region
Example: ( us-west-2 ) URL
https://pricing.us-east-1.amazonaws.com/offers/v1.0/aws/AmazonEC2/current/us-west-2/index.csv
I don't like the unknown nature of the docker file. why is this not based on https://hub.docker.com/_/python/?tab=tags
?
SpotRecommendation events should include IG name
Current:
28m Normal SpotRecommendationGiven SpotPriceInfo {"apiVersion":"v1alpha1","spotPrice":"", "useSpot": false}
Ecxpeetd:
28m Normal SpotRecommendationGiven SpotPriceInfo {"apiVersion":"v1alpha1","spotPrice":"0.90", "useSpot": false, "instanceGroup": "node123"}
Does minion-manager support mixed instance?
I have cluster that use ASG with mixed instance, on-demand + spot instance reference , I want to use minion-manager to just use to autoscale the spot instance? without touching the on-demand one, does this possible with current minion-manager?
The minion-manager has a few configuration options and could use a few more. E.g.
It will be better to have these options in a config file and make the minion-manager use that file instead of some command line args.
Is this a BUG REPORT or FEATURE REQUEST?:
FEATURE REQUEST
What happened:
k8s-minion-manager can intelligently switch between spot and on-demand instances. However, it doesn't provide information about how much money has been saved because of it. It will be good if the addon can provide that information.
What you expected to happen:
There should be an easy way of seeing the money spent and saved on a per IG basis.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.