hellofresh / eks-rolling-update Goto Github PK

View Code? Open in Web Editor NEW

359.0 192.0 78.0 172 KB

EKS Rolling Update is a utility for updating the launch configuration of worker nodes in an EKS cluster.

License: Apache License 2.0

Dockerfile 1.42% Python 94.46% Makefile 4.12%

open-source

eks-rolling-update's People

Contributors

Stargazers

Watchers

eks-rolling-update's Issues

Tagging variables can not be customized

The documentation in README.md states that
ASG_DESIRED_STATE_TAG
ASG_ORIG_CAPACITY_TAG
ASG_ORIG_MAX_CAPACITY_TAG

can be specified from the environment, but in https://github.com/hellofresh/eks-rolling-update/blob/master/eksrollup/config.py the variables are hardcoded.

Is there any reason?

Even though they are temporary, I'd like to customize them.

Resuming k8s autoscaler when K8S_AUTOSCALER_ENABLED set to False

Setting env var K8S_AUTOSCALER_ENABLED to False will no try to scale down any autoscaler (expected), but still tries to resume it at the end of the process (not expected), with error:

2019-09-10 17:49:57,012 INFO     Describing autoscaling groups...
2019-09-10 17:49:57,468 INFO     

2019-09-10 17:49:57,468 INFO     ****  Starting rolling update for autoscaling group eks_tmp_cluster  ****
2019-09-10 17:49:57,468 INFO     Instance id i-060706ffa2de48e75 : OK 
2019-09-10 17:49:57,468 INFO     Instance id i-0bf549dc2949f1d67 : OK 
2019-09-10 17:49:57,468 INFO     Found 0 outdated instances
2019-09-10 17:49:57,469 INFO     

2019-09-10 17:49:57,469 INFO     ****  Starting rolling update for autoscaling group eks_tmp_cluster-green  ****
2019-09-10 17:49:57,469 INFO     Instance id i-06f0d8f693e0a3340 : OK 
2019-09-10 17:49:57,469 INFO     Instance id i-0e5b6c0c6bae84131 : OK 
2019-09-10 17:49:57,469 INFO     Found 0 outdated instances
2019-09-10 17:49:57,469 INFO     All asgs processed
2019-09-10 17:49:57,984 INFO     Resuming k8s autoscaler...
2019-09-10 17:49:57,984 INFO     Missing the required parameter `name` when calling `patch_namespaced_deployment`
2019-09-10 17:49:57,984 INFO     *** Rolling update of asg has failed. Exiting ***

Private EKS Cluster not accessble

Hi, this is srinivasa am created EKS cluster in AWS using EKSCTL but default it will create public eks (API server endpoint access) but it is i need to change this one into private am trying from AWS console after changing in to private from kube-server where i installed kubectl and eksctl i cant able to access that cluster am getting error tcp:ip ip:443 i/o timeout my kubeserver is in private subnet only and all my worker nodes is also in private only but i dont know why am getting this error from my kube-machine please help me for this to troubleshoot incase u need any info i will provide
EKS-version 1.15
thank you

Getting ASG healthcheck failed error

Even after increasing the value for GLOBAL_HEALTH_RETRY and/or GLOBAL_HEALTH_WAIT, getting below error:

2021-01-29 03:00:21,729 INFO     Found 3 outdated instances
2021-01-29 03:00:22,349 INFO     Getting k8s nodes...
2021-01-29 03:00:22,743 INFO     Current k8s node count is 3
2021-01-29 03:00:22,743 INFO     Setting the scale of ASG dev-gvh-worker19940502290119910100001011 based on 3 outdated instances.
2021-01-29 03:00:22,743 INFO     Modifying asg dev-gvh-worker19940502290119910100001011 autoscaling to resume ...
2021-01-29 03:00:22,976 INFO     Found previous desired capacity value tag set on asg from a previous run.
2021-01-29 03:00:22,976 INFO     Maintaining previous capacity of 3 to not overscale.
2021-01-29 03:00:22,976 INFO     Waiting for 90 seconds before validating cluster health...
2021-01-29 03:01:52,981 INFO     Checking asg dev-gvh-worker19940502290119910100001011 instance count...
2021-01-29 03:01:53,258 INFO     Asg dev-gvh-worker19940502290119910100001011 does not have enough running instances to proceed
2021-01-29 03:01:53,258 INFO     Actual instances: 3 Desired instances: 4
2021-01-29 03:01:53,258 INFO     Validation failed for asg dev-gvh-worker19940502290119910100001011. Not enough instances online.
2021-01-29 03:01:53,258 INFO     Exiting since ASG healthcheck failed after 1 attempts
2021-01-29 03:01:53,258 ERROR    ASG healthcheck failed
2021-01-29 03:01:53,258 ERROR    *** Rolling update of ASG has failed. Exiting ***
2021-01-29 03:01:53,258 ERROR    AWS Auto Scaling Group processes will need resuming manually

cordoning node removes the node from loadbalancer causing downtime

As discussed here kubernetes/kubernetes#65013, while rolling the node pool, cordoning the nodes will remove all the nodes from the loadbalancer which might cause downtime. Instead, we can taint the node to make it non-schedulable instead of cordoning the node.

Update: this might not cause downtime as the scale up of nodes in the ASG is done before cordoning of the nodes.

Question: IAM and K8s Permissions

Hi!
I want to implement eks-rolling-update tool with least privilege permissions in my environment. I have created IAM Role with required permissions listed in README.md file and found out that there is a typo in one of the permissions - ec2:DescribeInstances - should be with "s" at the end.
If I am not wrong, "autoscaling:DescribeLaunchConfigurations" permission is required in addition to the listed in README for this utility to work with LaunchConfigurations too.
Did anyone actually test this tool with permissions listed in README file?

Also, I’m planning to give least privilege permissions on kubernetes side – I have created such a ClusterRole:

apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: eks-rolling
rules:
  - apiGroups: [""]
    resources: ["pods/eviction"]
    verbs: ["create"]
  - apiGroups: [""]
    resources: ["pods"]
    verbs: ["get", "list"]
  - apiGroups: [""]
    resources: ["nodes"]
    verbs: ["delete", "get", "list", "patch"]
  - apiGroups: ["apps"]
    resources: ["statefulsets"]
    verbs: ["get", "list"]
  - apiGroups: ["extensions"]
    resources: ["daemonsets", "replicasets"]
    verbs: ["get", "list"]
  - apiGroups: ["apps"]
    resources: ["deployments"]
    resourceNames: ["cluster-autoscaler"]
    verbs: ["update", "get"]

Could anyone please confirm if the permissions listed in this ClusterRole are sufficient for this utility to work properly?
Many thanks

Values from ENV are being used as int and bool

K8S_AUTOSCALER_ENABLED and DRY_RUN are being used as bool values in the code
CLUSTER_HEALTH_WAIT, GLOBAL_MAX_RETRY, GLOBAL_HEALTH_WAIT, BETWEEN_NODES_WAIT, and RUN_MODE are being used as int values.

Code will need to cast them appropriately.

Support accessing K8S API server via HTTPS_PROXY env vars

Currently if you need to use some sort of proxy to access your K8S master/API server, eks-rolling-update doesn't seem to work. The kubectl bits work fine because kubectl seems to have support for these env vars out of the box.

With a bit of digging, this appears to be because the Kubernetes Python Client library does not support these env vars out of the box.

There is a suggested implementation at kubernetes-client/python#333 (comment) which could possible be sorted on the eks-rolling-update side to be able to support this.

Consistently aborts before launch template node detaches from asg

I'm currently setting up EKS clusters with worker groups based on launch templates (using terraform-aws-eks).

Gave this tool a spin and so far all attempts aborted when waiting for the first node to detach from the asg.

Now of course it's possible to increase GLOBAL_MAX_RETRY and that works, typically at the 16th attempt or so. But it raises some questions:

Have things become slower on AWS side? Or do asgs based on launch templates behave differently (slower) than based on launch configuration?
Have you (maintainers) or other people experienced this as well?
Would it be wise to increase the defaults? Or introduce a separate setting for detach retry?
What failure scenario does the wait for detach aim to prevent? If I observe what happens during the process while waiting for the detach, I see the EC2 instance has succesfully terminatedand is gone from the cluster. I suppose it's a way of waiting for the asg state to 'settle down' before proceeding on the next node? Would it be possible to already proceed to the next node when observing the terminated node having lifecycle state 'Detaching'?

See below fragment of aws autoscaling describe-auto-scaling-groups output. Wouldn't checking for LifecycleState=Detaching suffice? It could speed up the rolling update process by a significant amount.

   {
        "InstanceId": "i-nnnnnnnnn",
        "AvailabilityZone": "eu-west-1c",
        "LifecycleState": "Detaching",
        "HealthStatus": "Unhealthy",
        "LaunchTemplate": {
            "LaunchTemplateId": "lt-nnnnnnnnnn",
            "LaunchTemplateName": "apps-ng-annnnnnnnn",
            "Version": "2"
        },
        "ProtectedFromScaleIn": false
    }

Question - Throttling request took 2.429459004s

this isnt a bug or any issue.

what is meaning of below log that comes during the script run sometimes, do we need to find what request is throttled?

I0209 21:27:49.835447   59139 request.go:645] Throttling request took 2.429459004s, request: GET:https://xxxxxxxxxxxxxxx.gr7.us-west-2.eks.amazonaws.com/api/v1/namespaces/xxxxx-yyyy/pods/aaa-bbb-ccc-dddd-6b4f44c7f8-vtsr2

unable to install eks-rolling-update from pip or pip3

I have installed the latest python3 and tried to install this module from pip and pip3 both.

trial-1
root@ip-10-XX-XX-XX:/home/ubuntu/eks-rolling-update# pip3 install eks-rolling-update
ERROR: Could not find a version that satisfies the requirement eks-rolling-update (from versions: none)
ERROR: No matching distribution found for eks-rolling-update

trial-2
root@ip-10-XX-XX-XX:/home/ubuntu/eks-rolling-update# pip3 install "eks-rolling-update==1.0.96"
ERROR: Could not find a version that satisfies the requirement eks-rolling-update==1.0.96 (from versions: none)
ERROR: No matching distribution found for eks-rolling-update==1.0.96

trial-3 - downloaded the tar.gz file and tried installing through pip3 install
root@ip-10-XX-XX-XX:/home/ubuntu/eks-rolling-update# pip3 install eks-rolling-update-1.0.96.tar.gz
Processing ./eks-rolling-update-1.0.96.tar.gz
ERROR: Command errored out with exit status 1:
command: /usr/bin/python3 -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'/tmp/pip-req-build-n36f_y60/setup.py'"'"'; file='"'"'/tmp/pip-req-build-n36f_y60/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(file);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, file, '"'"'exec'"'"'))' egg_info --egg-base /tmp/pip-pip-egg-info-4fa6hv6b
cwd: /tmp/pip-req-build-n36f_y60/
Complete output (32 lines):
Traceback (most recent call last):
File "", line 1, in
File "/tmp/pip-req-build-n36f_y60/setup.py", line 7, in
setuptools.setup()
File "/usr/lib/python3/dist-packages/setuptools/init.py", line 129, in setup
return distutils.core.setup(**attrs)
File "/usr/lib/python3.6/distutils/core.py", line 121, in setup
dist.parse_config_files()
File "/usr/lib/python3/dist-packages/setuptools/dist.py", line 494, in parse_config_files
ignore_option_errors=ignore_option_errors)

trial-4
root@ip-10-XX-XX-XX:/home/ubuntu/eks-rolling-update# pip3 install eks-rolling-update --trusted-host files.pythonhosted.org --trusted-host pypi.org --trusted-host pypi.python.org
ERROR: Could not find a version that satisfies the requirement eks-rolling-update (from versions: none)
ERROR: No matching distribution found for eks-rolling-update

I am not able to find any other method to install this module. I have latest TLS=1.3 and if you could help me install this module, it will be a great help.

kubernetes lib 12.0.0 breaks authentication mechanism#

Since kubernetes released v 12.0.0 of the python library I have this error:

null_resource.eks-rolling-update (local-exec): 2020-10-16 20:17:02,122 INFO     Describing autoscaling groups...
null_resource.eks-rolling-update (local-exec): 2020-10-16 20:17:02,544 INFO     Pausing k8s autoscaler...
null_resource.eks-rolling-update (local-exec): Traceback (most recent call last):
null_resource.eks-rolling-update (local-exec):   File "/usr/local/lib/python3.7/site-packages/urllib3/connection.py", line 157, in _new_conn
null_resource.eks-rolling-update (local-exec):     (self._dns_host, self.port), self.timeout, **extra_kw
null_resource.eks-rolling-update (local-exec):   File "/usr/local/lib/python3.7/site-packages/urllib3/util/connection.py", line 84, in create_connection
null_resource.eks-rolling-update (local-exec):     raise err
null_resource.eks-rolling-update (local-exec):   File "/usr/local/lib/python3.7/site-packages/urllib3/util/connection.py", line 74, in create_connection
null_resource.eks-rolling-update (local-exec):     sock.connect(sa)
null_resource.eks-rolling-update (local-exec): ConnectionRefusedError: [Errno 111] Connection refused

null_resource.eks-rolling-update (local-exec): During handling of the above exception, another exception occurred:

null_resource.eks-rolling-update (local-exec): Traceback (most recent call last):
null_resource.eks-rolling-update (local-exec):   File "/usr/local/lib/python3.7/site-packages/urllib3/connectionpool.py", line 672, in urlopen
null_resource.eks-rolling-update (local-exec):     chunked=chunked,
null_resource.eks-rolling-update (local-exec):   File "/usr/local/lib/python3.7/site-packages/urllib3/connectionpool.py", line 387, in _make_request
null_resource.eks-rolling-update (local-exec):     conn.request(method, url, **httplib_request_kw)
null_resource.eks-rolling-update (local-exec):   File "/usr/local/lib/python3.7/http/client.py", line 1244, in request
null_resource.eks-rolling-update (local-exec):     self._send_request(method, url, body, headers, encode_chunked)
null_resource.eks-rolling-update (local-exec):   File "/usr/local/lib/python3.7/http/client.py", line 1290, in _send_request
null_resource.eks-rolling-update (local-exec):     self.endheaders(body, encode_chunked=encode_chunked)
null_resource.eks-rolling-update (local-exec):   File "/usr/local/lib/python3.7/http/client.py", line 1239, in endheaders
null_resource.eks-rolling-update (local-exec):     self._send_output(message_body, encode_chunked=encode_chunked)
null_resource.eks-rolling-update (local-exec):   File "/usr/local/lib/python3.7/http/client.py", line 1026, in _send_output
null_resource.eks-rolling-update (local-exec):     self.send(msg)
null_resource.eks-rolling-update (local-exec):   File "/usr/local/lib/python3.7/http/client.py", line 966, in send
null_resource.eks-rolling-update (local-exec):     self.connect()
null_resource.eks-rolling-update (local-exec):   File "/usr/local/lib/python3.7/site-packages/urllib3/connection.py", line 184, in connect
null_resource.eks-rolling-update (local-exec):     conn = self._new_conn()
null_resource.eks-rolling-update (local-exec):   File "/usr/local/lib/python3.7/site-packages/urllib3/connection.py", line 169, in _new_conn
null_resource.eks-rolling-update (local-exec):     self, "Failed to establish a new connection: %s" % e
null_resource.eks-rolling-update (local-exec): urllib3.exceptions.NewConnectionError: <urllib3.connection.HTTPConnection object at 0x7fc53687a3d0>: Failed to establish a new connection: [Errno 111] Connection refused

null_resource.eks-rolling-update (local-exec): During handling of the above exception, another exception occurred:

null_resource.eks-rolling-update (local-exec): Traceback (most recent call last):
null_resource.eks-rolling-update (local-exec):   File "/usr/local/bin/eks_rolling_update.py", line 8, in <module>
null_resource.eks-rolling-update (local-exec):     sys.exit(main())
null_resource.eks-rolling-update (local-exec):   File "/usr/local/lib/python3.7/site-packages/eksrollup/cli.py", line 282, in main
null_resource.eks-rolling-update (local-exec):     modify_k8s_autoscaler("pause")
null_resource.eks-rolling-update (local-exec):   File "/usr/local/lib/python3.7/site-packages/eksrollup/lib/k8s.py", line 83, in modify_k8s_autoscaler
null_resource.eks-rolling-update (local-exec):     body
null_resource.eks-rolling-update (local-exec):   File "/usr/local/lib/python3.7/site-packages/kubernetes/client/api/apps_v1_api.py", line 4511, in patch_namespaced_deployment
null_resource.eks-rolling-update (local-exec):     return self.patch_namespaced_deployment_with_http_info(name, namespace, body, **kwargs)  # noqa: E501
null_resource.eks-rolling-update (local-exec):   File "/usr/local/lib/python3.7/site-packages/kubernetes/client/api/apps_v1_api.py", line 4636, in patch_namespaced_deployment_with_http_info
null_resource.eks-rolling-update (local-exec):     collection_formats=collection_formats)
null_resource.eks-rolling-update (local-exec):   File "/usr/local/lib/python3.7/site-packages/kubernetes/client/api_client.py", line 353, in call_api
null_resource.eks-rolling-update (local-exec):     _preload_content, _request_timeout, _host)
null_resource.eks-rolling-update (local-exec):   File "/usr/local/lib/python3.7/site-packages/kubernetes/client/api_client.py", line 184, in __call_api
null_resource.eks-rolling-update (local-exec):     _request_timeout=_request_timeout)
null_resource.eks-rolling-update (local-exec):   File "/usr/local/lib/python3.7/site-packages/kubernetes/client/api_client.py", line 413, in request
null_resource.eks-rolling-update (local-exec):     body=body)
null_resource.eks-rolling-update (local-exec):   File "/usr/local/lib/python3.7/site-packages/kubernetes/client/rest.py", line 300, in PATCH
null_resource.eks-rolling-update (local-exec):     body=body)
null_resource.eks-rolling-update (local-exec):   File "/usr/local/lib/python3.7/site-packages/kubernetes/client/rest.py", line 172, in request
null_resource.eks-rolling-update (local-exec):     headers=headers)
null_resource.eks-rolling-update (local-exec):   File "/usr/local/lib/python3.7/site-packages/urllib3/request.py", line 80, in request
null_resource.eks-rolling-update (local-exec):     method, url, fields=fields, headers=headers, **urlopen_kw
null_resource.eks-rolling-update (local-exec):   File "/usr/local/lib/python3.7/site-packages/urllib3/request.py", line 171, in request_encode_body
null_resource.eks-rolling-update (local-exec):     return self.urlopen(method, url, **extra_kw)
null_resource.eks-rolling-update (local-exec):   File "/usr/local/lib/python3.7/site-packages/urllib3/poolmanager.py", line 330, in urlopen
null_resource.eks-rolling-update (local-exec):     response = conn.urlopen(method, u.request_uri, **kw)
null_resource.eks-rolling-update (local-exec):   File "/usr/local/lib/python3.7/site-packages/urllib3/connectionpool.py", line 762, in urlopen
null_resource.eks-rolling-update (local-exec):     **response_kw
null_resource.eks-rolling-update (local-exec):   File "/usr/local/lib/python3.7/site-packages/urllib3/connectionpool.py", line 762, in urlopen
null_resource.eks-rolling-update (local-exec):     **response_kw
null_resource.eks-rolling-update (local-exec):   File "/usr/local/lib/python3.7/site-packages/urllib3/connectionpool.py", line 762, in urlopen
null_resource.eks-rolling-update (local-exec):     **response_kw
null_resource.eks-rolling-update (local-exec):   File "/usr/local/lib/python3.7/site-packages/urllib3/connectionpool.py", line 720, in urlopen
null_resource.eks-rolling-update (local-exec):     method, url, error=e, _pool=self, _stacktrace=sys.exc_info()[2]
null_resource.eks-rolling-update (local-exec):   F
ile "/usr/local/lib/python3.7/site-packages/urllib3/util/retry.py", line 436, in increment
null_resource.eks-rolling-update (local-exec):     raise MaxRetryError(_pool, url, error or ResponseError(cause))
null_resource.eks-rolling-update (local-exec): urllib3.exceptions.MaxRetryError: HTTPConnectionPool(host='localhost', port=80): Max retries exceeded with url: /apis/apps/v1/namespaces/kube-system/deployments/cluster-autoscaler (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7fc53687a3d0>: Failed to establish a new connection: [Errno 111] Connection refused'))

Reverting manually to v 11.0.0 works as expected.

add AZRebalance also in suspend action

during our node rotation az rebalanced the nodes, at this time desired and running nodes were 6
ASG added an extra node to balance the nodes across AZs regardless of desired count
as soon as the new node started, ASG found 7 nodes running but desired is 6 so it killed an old node (because of OldestLaunchConfiguration termination policy) to match the desired count

after the last worker node rotation eks-rolling-update changed ASG from 6 to 4 thinking the activity is completed, it caused 1 node abruptly terminate

so for next cluster upgrade we updated the script locally to include AZRebalance in suspend

it would be good if that fix is included here as well

list index out of range

I am getting this issue, any clue?

I already removed the tags from the autoscaling group after the first failed run.

(venv) [centos@ip-10-16-35-7 eks-rolling-update]$ ./roll.sh
2021-02-02 19:05:45,792 INFO     Describing autoscaling groups...
2021-02-02 19:05:46,067 INFO     Pausing k8s autoscaler...
2021-02-02 19:05:46,164 INFO     K8s autoscaler modified to replicas: 0
2021-02-02 19:05:46,164 INFO     *** Checking for nodes older than 7 days in autoscaling group ms-dev-apps-general_purpose_xlarge20200617203556813100000014 ***
2021-02-02 19:05:46,353 INFO     Instance id i-051b2437c0caa8d09 : OK
2021-02-02 19:05:46,729 INFO     Instance id i-07f5e4c9aa84b8579 : OK
2021-02-02 19:05:46,860 INFO     Instance id i-0867d2b14b2ef7b95 : OK
2021-02-02 19:05:46,900 ERROR    list index out of range
2021-02-02 19:05:46,900 ERROR    *** Rolling update of ASG has failed. Exiting ***
2021-02-02 19:05:46,900 ERROR    AWS Auto Scaling Group processes will need resuming manually
2021-02-02 19:05:46,900 ERROR    Kubernetes Cluster Autoscaler will need resuming manually

This is my config

export ASG_NAMES="ms-dev-apps-general_purpose_xlarge20200617203556813100000014" 
export K8S_AUTOSCALER_ENABLED=True 
export K8S_AUTOSCALER_NAMESPACE="kube-system" 
export K8S_AUTOSCALER_DEPLOYMENT="cluster-autoscaler-autodetect-aws-cluster-autoscaler"
export K8S_AUTOSCALER_REPLICAS=1
export EXTRA_DRAIN_ARGS="--delete-local-data=true --disable-eviction=true --force=true --grace-period=10 --ignore-daemonsets=true"
export MAX_ALLOWABLE_NODE_AGE=7
export RUN_MODE=4

I am using the latest version

Allow user to select kubeconfig context

Allow the user to explicitly select the context to be used from the kubeconfig. Given the number of tools that can alter the current-context of the kubeconfig file, it isn't always safe to rely on it.

Failed node drains leave tool in unrecoverable state

Description
If draining a node fails part way through (due to, say, a Kube API error, or a Pod which can't be evicted) the tool ends up in an unrecoverable state that requires manually editing the ASG to remove the tags.

Details
This happens because it loses track of the desired capacity as it gradually drains nodes one-by-one, rather than keeping track of where it got up to using the tags.

Take a scenario like
ASG original_capacity: 3
Outdated nodes: 3

which runs like

desired_capacity set to 6
ASG scaled up to 6
Nodes 1, 2, 3 are cordoned
Node 1 drained, terminated & detached -> instances online = 5
Node 2 drain fails, tool aborts

When you re-run the tool it sees

desired_capacity = 6 (from the ASG tag)
current node count = 5 (from the cluster)

And then fails per the below

2020-01-26 16:44:52,264 INFO     Current k8s node count is 5
2020-01-26 16:44:53,444 INFO     Setting the scale of ASG test-asg based on 2 outdated instances.
2020-01-26 16:44:53,444 INFO     Modifying asg test-asg autoscaling to resume ...
2020-01-26 16:44:53,559 INFO     Found previous desired capacity value tag set on asg from a previous run.
2020-01-26 16:44:53,559 INFO     Maintaining previous capacity of 5 to not overscale.
2020-01-26 16:44:53,559 INFO     Describing autoscaling groups...
2020-01-26 16:44:53,828 INFO     Current asg instance count in cluster is: 5. K8s node count should match this number
2020-01-26 16:44:53,828 INFO     Checking asg test-asg instance count...
2020-01-26 16:44:53,901 INFO     Asg test-asg does not have enough running instances to proceed
2020-01-26 16:44:53,901 INFO     Actual instances: 5 Desired instances: 6
2020-01-26 16:44:53,901 INFO     Validation failed for asg test-asg. Not enough instances online.
2020-01-26 16:44:53,901 INFO     Exiting since ASG healthcheck failed
2020-01-26 16:44:53,901 INFO     ASG healthcheck failed
2020-01-26 16:44:53,901 INFO     *** Rolling update of asg has failed. Exiting ***

Expected Behaviour
In the case above when the tool re-runs, it should really see desired_capacity = 5 because the first node was successfully drained and terminated and no longer is expected to be healthy.

Resolving this somehow would mean re-running the tool would be able to resume where it left off.

Possible Solution
I would suggest that immediately prior to terminating each instance, that the desired_capacity tag value should be decremented to represent the intended cluster state. I think this would mean when/if the tool is re-run that it would be able to recover from where it left off and drain/terminate the remaining nodes.

Last build / deployment of master failed

I was looking for where #117 is released and discovered that the latest build of master failed: https://github.com/hellofresh/eks-rolling-update/runs/2716443115?check_suite_focus=true

Sadly logs are no longer present :(

Dry run doesn't care about RUN_MODE

Hi there,

I've been using this tool for a while. And recently the dry-run behaviour changed.
Unfortunately, I can't say what version introduced this though :-(

I have the RUN_MODE set to 1, but when I run the tool with -p cli key, it runs in RUN_MODE=4.

Expected behaviour:
dry run pays attention to the RUN_MODE variable

disabling autoscaler fails at the end

Running with the following

K8S_AUTOSCALER_ENABLED=false python3 ./eks_rolling_update.py -c sandbox

At the end it still tries to enable it

2019-10-23 10:37:47,698 INFO     All asgs processed
2019-10-23 10:37:47,755 INFO     Resuming k8s autoscaler...
2019-10-23 10:37:47,755 INFO     Missing the required parameter `name` when calling `patch_namespaced_deployment`
2019-10-23 10:37:47,755 INFO     *** Rolling update of asg has failed. Exiting ***

Allow autoscaler to run concurrently

We struggle to update clusters at times because changing circumstances can require our cluster to scale up the number of nodes. If cluster-autoscaler wasn't shut down, we could allow our cluster to scale up for increases in demand during the process.

I'm not sure of the best way to handle this, but it'd be fantastic if we could.

I think the primary issue with leaving the autoscaler on is that it will prefer to shut down nodes if nothing has scheduled there. This means that the nodes that get spun up before rotating nodes will be shut down prematurely. To combat this, we could annotate those nodes with "cluster-autoscaler.kubernetes.io/scale-down-disabled": "true". It might be hard to determine which nodes are "new", but disabling scale down on all nodes that match the new launch configuration should be a good heuristic. If new nodes join, they'll also need to be annotated, however. I believe it's also possible to disable scale-down entirely, but that would require modifying the autoscaler deployment so that's less attractive for that reason.

The next issue would be when the desired count increased while rotating nodes. This would mess with eks-rolling-update as the original count will have diverged from where it was. eks-rolling-update could perhaps tolerate increases to this number and update the ASG tags to match. If the number went down unexpectedly, that would be an issue that would still cause the tool to abort.

I searched for related issues, but didn't see anything, so I apologize if this has already been noted.

docs: Add minimum IAM policy

It would be great to document what level of access eks-rolling-update needs to successfully run. This is useful when the script is being run by automation with an specific IAM-role.

Support to exclude nodes that are not managed by asg

Some of the nodes in eks might not be controlled by ASG. Such as in a scenario where few nodes are created using ASG and few are managed using spotinst.
rolling-update should exclude those nodes from rolling operation.

'<' not supported between instances of 'int' and 'str'

Ran into the error displayed below.

Command invoked:

GLOBAL_MAX_RETRY=30 python eks_rolling_update.py --cluster_name apps-1`

Kind of odd as recently a similar operation worked fine.

2019-12-12 06:54:58,337 INFO     Found 2 outdated instances
2019-12-12 06:54:59,462 INFO     Getting k8s nodes...
2019-12-12 06:55:00,790 INFO     Current k8s node count is 2
2019-12-12 06:55:00,790 INFO     Setting the scale of ASG apps-1-ng-a20191211120754245000000001 based on 2 outdated instances.
2019-12-12 06:55:00,791 INFO     Modifying asg apps-1-ng-a20191211120754245000000001 autoscaling to resume ...
2019-12-12 06:55:01,076 INFO     No previous capacity value tags set on ASG; setting tags.
2019-12-12 06:55:01,077 INFO     Saving tag to asg key: eks-rolling-update:original_capacity, value : 2...
2019-12-12 06:55:01,374 INFO     Saving tag to asg key: eks-rolling-update:desired_capacity, value : 4...
2019-12-12 06:55:01,601 INFO     Saving tag to asg key: eks-rolling-update:original_max_capacity, value : 5...
2019-12-12 06:55:01,933 INFO     Setting asg desired capacity from 2 to 4 and max size to 5...
2019-12-12 06:55:02,189 INFO     Waiting for 90 seconds for ASG to scale before validating cluster health...
2019-12-12 06:56:32,191 INFO     Describing autoscaling groups...
2019-12-12 06:56:33,206 INFO     Current asg instance count in cluster is: 4. K8s node count should match this number
2019-12-12 06:56:33,207 INFO     Checking asg apps-1-ng-a20191211120754245000000001 instance count...
2019-12-12 06:56:33,397 INFO     Asg apps-1-ng-a20191211120754245000000001 scaled OK
2019-12-12 06:56:33,398 INFO     Actual instances: 4 Desired instances: 4
2019-12-12 06:56:33,398 INFO     '<' not supported between instances of 'int' and 'str'
2019-12-12 06:56:33,398 INFO     *** Rolling update of asg has failed. Exiting ***

Edit: The GLOBAL_MAX_RETRY I use because of #22 . A quick scan through codebase suggests that's the culprit as it's parsed as str.

Error 'NoneType' object is not iterable with Kubernetes Version 1.19

First of all, this project has helped me a lot in performing rolling upgrades of nodes on our Kubernetes cluster.
I have noticed that on Kubernetes Version 1.19 and above I get the following error message and the rolling upgrade process ends.

2021-04-22 15:35:16,761 INFO     Checking k8s expected nodes are online after asg scaled up... 
2021-04-22 15:35:16,787 ERROR    'NoneType' object is not iterable

I was able to fix this by upgrading the version of Kubernetes client to the latest version on pypi and get around this issue.
I am opening this issue to help others who might have been in a similar situation.

Support for launch template

Currently, eks-rolling-update only supports launch configurations. Doing a plan (dry-run) over a cluster created with launch template instead (using eksctl) will throw this error:

 2019-09-06 17:50:31,413 INFO     Describing autoscaling groups...
 2019-09-06 17:50:31,979 INFO     *** Checking autoscaling group eksctl-nicolas-test-cluster-nodegroup-ng-1-NodeGroup-1476EK4LIRYXE ***
 Traceback (most recent call last):
   File "eks_rolling_update.py", line 152, in <module>
     plan_asgs(filtered_asgs)
   File "/app/lib/aws.py", line 248, in plan_asgs
     asg_lc_name = asg['LaunchConfigurationName']
 KeyError: 'LaunchConfigurationName'

Any plan on supporting the new and shinny launch template ? Thanks!

Waiting for k8s nodes to reach count

We have a big and busy EKS cluster with nodes joining and leaving many times in a day (spot instances failing or being replaced). We try to update each ASG separately with ASG_NAMES setting. The problem is, the eks-rolling-update always checks the whole cluster for node count and it many times fails as node count is not matched with expected.

It should only monitor the selected ASG(s) for expected instance count.

2021-02-10 16:26:57,425 INFO     Current k8s node count is 94
2021-02-10 16:26:57,426 INFO     Current k8s node count is 94
2021-02-10 16:26:57,426 INFO     Waiting for k8s nodes to reach count 92...
2021-02-10 16:27:18,198 INFO     Getting k8s nodes...
2021-02-10 16:27:19,341 INFO     Current k8s node count is 94
2021-02-10 16:27:19,342 INFO     Current k8s node count is 94
2021-02-10 16:27:19,342 INFO     Waiting for k8s nodes to reach count 92...
2021-02-10 16:27:40,119 INFO     Getting k8s nodes...
2021-02-10 16:27:41,470 INFO     Current k8s node count is 94
2021-02-10 16:27:41,471 INFO     Current k8s node count is 94
2021-02-10 16:27:41,471 INFO     Waiting for k8s nodes to reach count 92...
...
2021-02-10 16:28:01,472 INFO     Validation failed for cluster *****. Didn't reach expected node count 92.
2021-02-10 16:28:01,472 INFO     Exiting since ASG healthcheck failed after 2 attempts
2021-02-10 16:28:01,472 ERROR    ASG healthcheck failed
2021-02-10 16:28:01,472 ERROR    *** Rolling update of ASG has failed. Exiting ***
2021-02-10 16:28:01,472 ERROR    AWS Auto Scaling Group processes will need resuming manually

K8s nodes count doesn't filter nodes when ASG_NAMES is applied

I used environment variable ASG_NAMES to select a specific Autoscaling Group, but then the condition of expected nodes count is never met - it counts nodes from all AGs.

Log example:

2021-01-08 13:49:56,753 INFO     Current asg instance count in cluster is: 6. K8s node count should match this number
2021-01-08 13:49:56,754 INFO     Checking k8s expected nodes are online after asg scaled up...
2021-01-08 13:49:57,402 INFO     Getting k8s nodes...
2021-01-08 13:49:58,343 INFO     Current k8s node count is 21
2021-01-08 13:49:58,343 INFO     Current k8s node count is 21
2021-01-08 13:49:58,343 INFO     Waiting for k8s nodes to reach count 6...

[Feature] Timeout waiting for instance scale up

Currently, after setting the desired capacity of the ASG, the script simply waits CLUSTER_HEALTH_WAIT seconds once (without any retries) to check if all instances come online. This works in best case scenarios, but we observe good amount of variances in AWS when it comes to time it takes for instances to come online. (In my current example that lead to this issue it took 9 minutes).

I know I can increase the CLUSTER_HEALTH_WAIT to 600s, but that means it always wait 10 minutes which is waste of time. So my request is to add a retry, so that we can have a worst case timeout without increasing the rollout time in best case scenario.

K8S_CONTEXT environment variable ignored when draining nodes

The K8S_CONTEXT environment variable is used when setting up the python Kubernetes API client, but the actual node draining operation is performed by shelling out to kubectl rather than using the API, and the --context flag (and K8S_CONTEXT) variable are not passed when doing so. You can work around this by using the EXTRA_DRAIN_ARGS variable, but it isn't documented that you need to do so, and, it probably shouldn't be necessary at all.

RequestExpired when calling the DescribeInstances operation

when script running, it suddenly throw An error occurred (RequestExpired) when calling the DescribeInstances operation: Request has expired.

Could not configure Kubernetes Python Client

I have been looking into this product and testing it, I am hitting this stumbling block of not being able to configure the kubernetes python client, the python client is installed, is this a known issue?, or any ways we can dig deeper in terms of what the kubernetes python client dependencies are?

[ root$] docker run -ti --rm -e AWS_DEFAULT_REGION -v "/root/.aws/config" -v "/root/.kube/us-gpd" eks-rolling-update:latest -c gpdeks1
2021-02-25 02:31:16,139 INFO Describing autoscaling groups...
2021-02-25 02:31:16,444 ERROR Could not configure kubernetes python client
2021-02-25 02:31:16,444 ERROR *** Rolling update of ASG has failed. Exiting ***
2021-02-25 02:31:16,444 ERROR AWS Auto Scaling Group processes will need resuming manually

Thanks

Wrong user when running eks_rolling_update script on a separate cluster

Hi!

We've implemented eks-rolling-update script as a separate stage in our CI (Gitlab).
There are 2 clusters involved:

"gitlab-runners" cluster, where script is executed inside Gitlab runner
"dev" cluster, destination cluster that script must affect

Once script is executed inside Gitlab runner, we receive the following permissions-related error:

$ eks_rolling_update.py --cluster_name ${TF_VAR_cluster_name}
2020-12-10 13:28:28,187 INFO     Describing autoscaling groups...
2020-12-10 13:28:28,194 INFO     Pausing k8s autoscaler...
2020-12-10 13:28:28,203 INFO     Scaling of k8s autoscaler failed. Error code was Forbidden, {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"deployments.apps \"cluster-autoscaler\" is forbidden: User \"system:serviceaccount:eks:default\" cannot patch resource \"deployments\" in API group \"apps\" in the namespace \"kube-system\"","reason":"Forbidden","details":{"name":"cluster-autoscaler","group":"apps","kind":"deployments"},"code":403}
. Exiting.

That user "system:serviceaccount:eks:default" belongs to "gitlab-runners" cluster, not to "dev" (eks namespace exists only in "gitlab-runners" cluster). Moreover, if we get inside this Gitlab runner's container and scale autoscaler manually - all works fine, deployment in "dev" cluster scales up&down:

That means kubeconfig file and AWS credentials are configured properly.

Note, locally eks_rolling_update.py works fine as well (with the same variables and creds which are used in CI).

Below is our eks-rolling-upgrade stage in Gitlab CI (aws cli, kubectl and eks-rolling-update already preinstalled in image):

upgrade:
  stage: rolling-upgrade
  variables:
    AWS_DEFAULT_REGION: "eu-west-1"
    K8S_AUTOSCALER_ENABLED: "true"
    GLOBAL_MAX_RETRY: 20
    K8S_AUTOSCALER_NAMESPACE: "kube-system"
    K8S_AUTOSCALER_DEPLOYMENT: "cluster-autoscaler"
    K8S_AUTOSCALER_REPLICAS: 2
    KUBECONFIG: "/root/.kube/config"
  script:
    - source variables
    - aws eks --region eu-west-1 update-kubeconfig --name ${TF_VAR_cluster_name}
    - eks_rolling_update.py --cluster_name ${TF_VAR_cluster_name}
  when: manual
  timeout: 4h

If any additional details are needed, please let me know.
Looking forward to your reply.
Thanks in advance!

Version of eks-rolling-update: most recent (10-Dec-2020)
version of Kubernetes: 1.18

Batching of scale-up?

Hi,

I've found that periodically the reliability of scale-up can be questionable in AWS. In particular, I've seen cases where the EC2 node never registers with the EKS cluster when doing a large amount of scaling (for instance 100+ nodes at once). This results in eks-rolling-update blocking indefinitely because the k8s node count never matches the EC2 count.

I modified eks-rolling-update to include a batch-count environment variable that allows me to instruct the tool to scale up X nodes at a time. This has appeared to eliminate the flakiness that I experience.

I would like to send a PR for the enhancement, but thought I should ask if the owners of this repo would even be interested before I make the PR. My implementation defaults to batching being optional.

Lemme know. Thanks.

Feature Request: force refresh

Hi,
Great tool and thanks for providing this !

Just an idea of a feature: have a "Force refresh" option, so we can recycle servers, even if they have the launch template has not been updated.

Reason: We use this process across our other EC2 servers, to make sure they are always patched and refreshed.

Thanks in advance

Ambiguous check of running instances

eks-rolling-update/eksrollup/lib/aws.py

Lines 112 to 126 in 70306ef

 if len(actual_instances) != desired_capacity: 

 logger.info('Asg {} does not have enough running instances to proceed'.format(asg_name)) 

 logger.info('Actual instances: {} Desired instances: {}'.format( 

 len(actual_instances), 

 desired_capacity) 

 ) 

 is_scaled = False 

 else: 

 logger.info('Asg {} scaled OK'.format(asg_name)) 

 logger.info('Actual instances: {} Desired instances: {}'.format( 

 len(actual_instances), 

 desired_capacity) 

 ) 

 is_scaled = True 

 return is_scaled

Causes a failure to run when you have more running instances than desired:

00:04:19.030  2022-01-25 16:24:43,568 INFO     Checking asg golf-dev-mgmt-worker-node-0-20191108085402700900000002 instance count...
00:04:19.030  2022-01-25 16:24:43,701 INFO     Asg golf-dev-mgmt-worker-node-0-20191108085402700900000002 does not have enough running instances to proceed
00:04:19.030  2022-01-25 16:24:43,701 INFO     Actual instances: 7 Desired instances: 6

Missing cordon on outdated nodes

This script seems to be useful and in general do the right thing. But I think one step is missing for the smoothest operation: All the outdated nodes should be cordoned just before draining them. Otherwise pods might be migrated to other outdated nodes when a node is drained, which means then they would be migrated more than once during the rolling update.

NLB still sending requests to node while shutting down

From the time the node goes down until it gets de-registered from the Load Balancer, it will still getting requests. Reducing the thresholds and retries before the LB considers the node as unavailable could improve things but still, at least a few second will be required for that and also this could cause other issues like de-registering nodes for the wrong reasons. Any suggestions?

Getting ASG healthcheck failed error while running eks-rolling-update.py

Even after increasing the value for GLOBAL_HEALTH_RETRY and/or GLOBAL_HEALTH_WAIT, getting below error:

Command: python eks_rolling_update.py --cluster_name YOUR_EKS_CLUSTER_NAME

2021-01-29 03:00:21,729 INFO     Found 3 outdated instances
2021-01-29 03:00:22,349 INFO     Getting k8s nodes...
2021-01-29 03:00:22,743 INFO     Current k8s node count is 3
2021-01-29 03:00:22,743 INFO     Setting the scale of ASG dev-gvh-worker19940502290119910100001011 based on 3 outdated instances.
2021-01-29 03:00:22,743 INFO     Modifying asg dev-gvh-worker19940502290119910100001011 autoscaling to resume ...
2021-01-29 03:00:22,976 INFO     Found previous desired capacity value tag set on asg from a previous run.
2021-01-29 03:00:22,976 INFO     Maintaining previous capacity of 3 to not overscale.
2021-01-29 03:00:22,976 INFO     Waiting for 90 seconds before validating cluster health...
2021-01-29 03:01:52,981 INFO     Checking asg dev-gvh-worker19940502290119910100001011 instance count...
2021-01-29 03:01:53,258 INFO     Asg dev-gvh-worker19940502290119910100001011 does not have enough running instances to proceed
2021-01-29 03:01:53,258 INFO     Actual instances: 3 Desired instances: 4
2021-01-29 03:01:53,258 INFO     Validation failed for asg dev-gvh-worker19940502290119910100001011. Not enough instances online.
2021-01-29 03:01:53,258 INFO     Exiting since ASG healthcheck failed after 1 attempts
2021-01-29 03:01:53,258 ERROR    ASG healthcheck failed
2021-01-29 03:01:53,258 ERROR    *** Rolling update of ASG has failed. Exiting ***
2021-01-29 03:01:53,258 ERROR    AWS Auto Scaling Group processes will need resuming manually

Changelog / Release notes

First of all, thanks for this wonderful project.
Is it possible to add changelog/release notes for each version release?
ATM, it is hard to find what is changed when I upgrade multiple versions ahead.

Using API to drain kubernetes node

Any specific reason for not using API to drain the node ? I have a code that i use reliably to drain the node using API. Is there any way i can contribute towards that ?

Please let me know.

Why not the upgrade command 'eksctl upgrade cluster --name dev' ?

Hi there, I mean new with this rolling upgrade tool, so my question may be silly. Why don't just use the eksctl upgrade command? Doesn't it support a rolling upgrade or it will bring in downtime for the user's application? Thanks.

Never able to get a error free run

Im sure this is something on my end, but cant figure out where else to look.

Currently managing 4 clusters, all on EKS, all deployed usinghttps://registry.terraform.io/modules/terraform-aws-modules/eks/aws/latest. Each deployment has 2 node groups, and is a launch template, with a 50/50 on-demand/spot mix.

I get errors related to

2020-10-28 14:00:56,916 INFO     Instance is failing to terminate. Cancelling out.
2020-10-28 14:00:56,916 ERROR    ('Rolling update on ASG failed', 'eks-epx-qa-video-video20201005154444675300000010')
2020-10-28 14:00:56,916 ERROR    *** Rolling update of ASG has failed. Exiting ***
2020-10-28 14:00:56,916 ERROR    AWS Auto Scaling Group processes will need resuming manually
2020-10-28 14:00:56,916 ERROR    Kubernetes Cluster Autoscaler will need resuming manually

2020-10-28 14:45:22,860 INFO     All k8s nodes are healthy
2020-10-28 14:45:22,860 INFO     Cluster validation passed. Proceeding with node draining and termination...
2020-10-28 14:45:22,860 INFO     Searching for k8s node name by instance id...
2020-10-28 14:45:22,860 INFO     Could not find a k8s node name for that instance id. Exiting
2020-10-28 14:45:22,860 ERROR    Encountered an error when adding taint/cordoning node
2020-10-28 14:45:22,860 ERROR    Could not find a k8s node name for that instance id. Exiting

Im running the command like this

K8S_AUTOSCALER_ENABLED=1 K8S_AUTOSCALER_NAMESPACE="kube-system" K8S_AUTOSCALER_DEPLOYMENT="autoscaler-aws-cluster-autoscaler-chart" python eks_rolling_update.py --cluster_name <cluster_name>

Any ideas on how to proceed troubleshooting?

'>' not supported between instances of 'NoneType' and 'int'

2020-12-15 11:57:58,155 INFO     Setting the scale of ASG xx-yy-zz based on 3 outdated instances.
2020-12-15 11:57:58,155 INFO     Modifying asg xx-yy-zz autoscaling to resume ...
2020-12-15 11:57:58,425 INFO     No previous capacity value tags set on ASG; setting tags.
2020-12-15 11:57:58,425 INFO     Saving tag to asg key: eks-rolling-update:original_capacity, value : 3...
2020-12-15 11:57:58,687 INFO     Saving tag to asg key: eks-rolling-update:desired_capacity, value : 6...
2020-12-15 11:57:58,960 INFO     Saving tag to asg key: eks-rolling-update:original_max_capacity, value : 10...
2020-12-15 11:57:59,243 INFO     Describing autoscaling groups...
2020-12-15 11:57:59,871 INFO     Current asg instance count in cluster is: 6. K8s node count should match this number
2020-12-15 11:57:59,871 ERROR    '>' not supported between instances of 'NoneType' and 'int'
2020-12-15 11:57:59,872 ERROR    *** Rolling update of ASG has failed. Exiting ***
2020-12-15 11:57:59,872 ERROR    AWS Auto Scaling Group processes will need resuming manually
2020-12-15 11:57:59,872 ERROR    Kubernetes Cluster Autoscaler will need resuming manually

ran into this couple of times today while running eks_rolling_update.py -c <cluster>

Instance Refresh for EC2 Auto Scaling

As a big fan of this tool, does the AWS Instance Refresh for EC2 Auto Scaling announcement offer anything new?

Allow Configurable Replica Count for Cluster Autoscaler

In the code resuming the cluster autoscaler is hard coded to 2 replicas. In our environment we only run a single replica. Since running multiple replicas is untested in our environment we manually reset it to replica=1 after running the rolling update.

It would be nice to have the option to pass in a value for cluster autoscaler replicas. I imagine that there are other people running either a single instance or more than two replicas that would benefit from this.

Support not updating specific ASGs/nodes based on node labels and/or ec2 tags

We run jupyterhub on EKS and in this setup, user's have these notebook servers they login to that they use interactively, much like a shell server. Updating the EKS nodes these notebook pods are running on would be hugely disruptive. Having a way to completely ignore updating some ASGs/nodes would be very helpful for us to skip updating the nodes running our users notebook servers.

Clarify documentation on termination check

The script timed out due to an instance taking longer than usual to terminate.

The environment variable that needs to be set to increase the number of checks is GLOBAL_MAX_RETRY. But the documentation says this is Number of attempts of a health check. However the termination check is not a health check and this documentation should be corrected to something like Number of attempts of a health or termination check.

Feature request - cordon one node at a time instead of all nodes

with RUN_MODE=1 all old nodes are cordoned at a same time, which makes AWS ELB to mark old nodes out of service, if new nodes sometimes take time to be in service then no healthy instances are left for sometime which causes outage

we tried cordnoning 1 node at a time and didnt see this issue, downside of this is that a pod may bounce multiple times because it may land up on old node because not all old nodes are cordoned, some people will be fine with bouncing one of a pod multiple times among multiple replicas.

can we have RUN_MODE 5 which is same as RUN_MODE 1 except it "cordon 1 node --> drain 1 node --> delete 1 node" at a time instead of "cordon all nodes --> drain 1 node --> delete 1 node"

Allow a configurable buffer of extra instances

This is more of a placeholder that I intend to follow up with a PR.

A number of times, I've had trouble with workloads that have special placement requirements (for example, they have an EBS backed PV), but other workloads have taken up space and I'm left with a pod in Pending.

I'm not sure if it's the run mode I use (2) or something else, but it might be good to have the option to pass in some sort of "overflow" value to add a number of extra instances to each ASG's calculated instance count. This would hopefully allow HPAs to scale out if necessary, and avoid pods getting stuck in pending.

Afterwards, the cluster autoscaler would work out if anything needed to be adjusted.

Cluster autoscaler paused - memory issue

Hello,

Just wondering if anybody experiences an issue we face often when using this tool.

We run the tool which pause cluster autoscaler and then proceeds to cordon/drain ASGS one at a time.
We often find we get pods unschedulable on to nodes due to not enough memory, we always set a memory request=limit (java) but as to be expected when pods are evicted in some scenarios they spread over the new nodes in a way that no single node has enough free space to run a large pod.

What if anything could we do to stop this ? One options to not disable autoscaler which the eks-rolling-update docs say is optional to disable, but what negative issues would occur if we did this ?

	if len(actual_instances) != desired_capacity:
	logger.info('Asg {} does not have enough running instances to proceed'.format(asg_name))
	logger.info('Actual instances: {} Desired instances: {}'.format(
	len(actual_instances),
	desired_capacity)
	)
	is_scaled = False
	else:
	logger.info('Asg {} scaled OK'.format(asg_name))
	logger.info('Actual instances: {} Desired instances: {}'.format(
	len(actual_instances),
	desired_capacity)
	)
	is_scaled = True
	return is_scaled

hellofresh / eks-rolling-update Goto Github PK

eks-rolling-update's People

Contributors

Stargazers

Watchers

Forkers

eks-rolling-update's Issues

Recommend Projects

Recommend Topics

Recommend Org