hellofresh / eks-rolling-update Goto Github PK
View Code? Open in Web Editor NEWEKS Rolling Update is a utility for updating the launch configuration of worker nodes in an EKS cluster.
License: Apache License 2.0
EKS Rolling Update is a utility for updating the launch configuration of worker nodes in an EKS cluster.
License: Apache License 2.0
The documentation in README.md
states that
ASG_DESIRED_STATE_TAG
ASG_ORIG_CAPACITY_TAG
ASG_ORIG_MAX_CAPACITY_TAG
can be specified from the environment, but in https://github.com/hellofresh/eks-rolling-update/blob/master/eksrollup/config.py the variables are hardcoded.
Is there any reason?
Even though they are temporary, I'd like to customize them.
Setting env var K8S_AUTOSCALER_ENABLED
to False
will no try to scale down any autoscaler (expected), but still tries to resume it at the end of the process (not expected), with error:
2019-09-10 17:49:57,012 INFO Describing autoscaling groups...
2019-09-10 17:49:57,468 INFO
2019-09-10 17:49:57,468 INFO **** Starting rolling update for autoscaling group eks_tmp_cluster ****
2019-09-10 17:49:57,468 INFO Instance id i-060706ffa2de48e75 : OK
2019-09-10 17:49:57,468 INFO Instance id i-0bf549dc2949f1d67 : OK
2019-09-10 17:49:57,468 INFO Found 0 outdated instances
2019-09-10 17:49:57,469 INFO
2019-09-10 17:49:57,469 INFO **** Starting rolling update for autoscaling group eks_tmp_cluster-green ****
2019-09-10 17:49:57,469 INFO Instance id i-06f0d8f693e0a3340 : OK
2019-09-10 17:49:57,469 INFO Instance id i-0e5b6c0c6bae84131 : OK
2019-09-10 17:49:57,469 INFO Found 0 outdated instances
2019-09-10 17:49:57,469 INFO All asgs processed
2019-09-10 17:49:57,984 INFO Resuming k8s autoscaler...
2019-09-10 17:49:57,984 INFO Missing the required parameter `name` when calling `patch_namespaced_deployment`
2019-09-10 17:49:57,984 INFO *** Rolling update of asg has failed. Exiting ***
Hi, this is srinivasa am created EKS cluster in AWS using EKSCTL but default it will create public eks (API server endpoint access) but it is i need to change this one into private am trying from AWS console after changing in to private from kube-server where i installed kubectl and eksctl i cant able to access that cluster am getting error tcp:ip ip:443 i/o timeout my kubeserver is in private subnet only and all my worker nodes is also in private only but i dont know why am getting this error from my kube-machine please help me for this to troubleshoot incase u need any info i will provide
EKS-version 1.15
thank you
Even after increasing the value for GLOBAL_HEALTH_RETRY
and/or GLOBAL_HEALTH_WAIT
, getting below error:
2021-01-29 03:00:21,729 INFO Found 3 outdated instances
2021-01-29 03:00:22,349 INFO Getting k8s nodes...
2021-01-29 03:00:22,743 INFO Current k8s node count is 3
2021-01-29 03:00:22,743 INFO Setting the scale of ASG dev-gvh-worker19940502290119910100001011 based on 3 outdated instances.
2021-01-29 03:00:22,743 INFO Modifying asg dev-gvh-worker19940502290119910100001011 autoscaling to resume ...
2021-01-29 03:00:22,976 INFO Found previous desired capacity value tag set on asg from a previous run.
2021-01-29 03:00:22,976 INFO Maintaining previous capacity of 3 to not overscale.
2021-01-29 03:00:22,976 INFO Waiting for 90 seconds before validating cluster health...
2021-01-29 03:01:52,981 INFO Checking asg dev-gvh-worker19940502290119910100001011 instance count...
2021-01-29 03:01:53,258 INFO Asg dev-gvh-worker19940502290119910100001011 does not have enough running instances to proceed
2021-01-29 03:01:53,258 INFO Actual instances: 3 Desired instances: 4
2021-01-29 03:01:53,258 INFO Validation failed for asg dev-gvh-worker19940502290119910100001011. Not enough instances online.
2021-01-29 03:01:53,258 INFO Exiting since ASG healthcheck failed after 1 attempts
2021-01-29 03:01:53,258 ERROR ASG healthcheck failed
2021-01-29 03:01:53,258 ERROR *** Rolling update of ASG has failed. Exiting ***
2021-01-29 03:01:53,258 ERROR AWS Auto Scaling Group processes will need resuming manually
As discussed here kubernetes/kubernetes#65013, while rolling the node pool, cordoning the nodes will remove all the nodes from the loadbalancer which might cause downtime. Instead, we can taint the node to make it non-schedulable instead of cordoning the node.
Update: this might not cause downtime as the scale up of nodes in the ASG is done before cordoning of the nodes.
Hi!
I want to implement eks-rolling-update tool with least privilege permissions in my environment. I have created IAM Role with required permissions listed in README.md file and found out that there is a typo in one of the permissions - ec2:DescribeInstances - should be with "s" at the end.
If I am not wrong, "autoscaling:DescribeLaunchConfigurations" permission is required in addition to the listed in README for this utility to work with LaunchConfigurations too.
Did anyone actually test this tool with permissions listed in README file?
Also, Iโm planning to give least privilege permissions on kubernetes side โ I have created such a ClusterRole:
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: eks-rolling
rules:
- apiGroups: [""]
resources: ["pods/eviction"]
verbs: ["create"]
- apiGroups: [""]
resources: ["pods"]
verbs: ["get", "list"]
- apiGroups: [""]
resources: ["nodes"]
verbs: ["delete", "get", "list", "patch"]
- apiGroups: ["apps"]
resources: ["statefulsets"]
verbs: ["get", "list"]
- apiGroups: ["extensions"]
resources: ["daemonsets", "replicasets"]
verbs: ["get", "list"]
- apiGroups: ["apps"]
resources: ["deployments"]
resourceNames: ["cluster-autoscaler"]
verbs: ["update", "get"]
Could anyone please confirm if the permissions listed in this ClusterRole are sufficient for this utility to work properly?
Many thanks
K8S_AUTOSCALER_ENABLED
and DRY_RUN
are being used as bool
values in the code
CLUSTER_HEALTH_WAIT
, GLOBAL_MAX_RETRY
, GLOBAL_HEALTH_WAIT
, BETWEEN_NODES_WAIT
, and RUN_MODE
are being used as int
values.
Code will need to cast them appropriately.
Currently if you need to use some sort of proxy to access your K8S master/API server, eks-rolling-update doesn't seem to work. The kubectl
bits work fine because kubectl
seems to have support for these env vars out of the box.
With a bit of digging, this appears to be because the Kubernetes Python Client library does not support these env vars out of the box.
There is a suggested implementation at kubernetes-client/python#333 (comment) which could possible be sorted on the eks-rolling-update side to be able to support this.
I'm currently setting up EKS clusters with worker groups based on launch templates (using terraform-aws-eks).
Gave this tool a spin and so far all attempts aborted when waiting for the first node to detach from the asg.
Now of course it's possible to increase GLOBAL_MAX_RETRY
and that works, typically at the 16th attempt or so. But it raises some questions:
See below fragment of aws autoscaling describe-auto-scaling-groups
output. Wouldn't checking for LifecycleState=Detaching
suffice? It could speed up the rolling update process by a significant amount.
{
"InstanceId": "i-nnnnnnnnn",
"AvailabilityZone": "eu-west-1c",
"LifecycleState": "Detaching",
"HealthStatus": "Unhealthy",
"LaunchTemplate": {
"LaunchTemplateId": "lt-nnnnnnnnnn",
"LaunchTemplateName": "apps-ng-annnnnnnnn",
"Version": "2"
},
"ProtectedFromScaleIn": false
}
this isnt a bug or any issue.
what is meaning of below log that comes during the script run sometimes, do we need to find what request is throttled?
I0209 21:27:49.835447 59139 request.go:645] Throttling request took 2.429459004s, request: GET:https://xxxxxxxxxxxxxxx.gr7.us-west-2.eks.amazonaws.com/api/v1/namespaces/xxxxx-yyyy/pods/aaa-bbb-ccc-dddd-6b4f44c7f8-vtsr2
I have installed the latest python3 and tried to install this module from pip and pip3 both.
trial-1
root@ip-10-XX-XX-XX:/home/ubuntu/eks-rolling-update# pip3 install eks-rolling-update
ERROR: Could not find a version that satisfies the requirement eks-rolling-update (from versions: none)
ERROR: No matching distribution found for eks-rolling-update
trial-2
root@ip-10-XX-XX-XX:/home/ubuntu/eks-rolling-update# pip3 install "eks-rolling-update==1.0.96"
ERROR: Could not find a version that satisfies the requirement eks-rolling-update==1.0.96 (from versions: none)
ERROR: No matching distribution found for eks-rolling-update==1.0.96
trial-3 - downloaded the tar.gz file and tried installing through pip3 install
root@ip-10-XX-XX-XX:/home/ubuntu/eks-rolling-update# pip3 install eks-rolling-update-1.0.96.tar.gz
Processing ./eks-rolling-update-1.0.96.tar.gz
ERROR: Command errored out with exit status 1:
command: /usr/bin/python3 -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'/tmp/pip-req-build-n36f_y60/setup.py'"'"'; file='"'"'/tmp/pip-req-build-n36f_y60/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(file);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, file, '"'"'exec'"'"'))' egg_info --egg-base /tmp/pip-pip-egg-info-4fa6hv6b
cwd: /tmp/pip-req-build-n36f_y60/
Complete output (32 lines):
Traceback (most recent call last):
File "", line 1, in
File "/tmp/pip-req-build-n36f_y60/setup.py", line 7, in
setuptools.setup()
File "/usr/lib/python3/dist-packages/setuptools/init.py", line 129, in setup
return distutils.core.setup(**attrs)
File "/usr/lib/python3.6/distutils/core.py", line 121, in setup
dist.parse_config_files()
File "/usr/lib/python3/dist-packages/setuptools/dist.py", line 494, in parse_config_files
ignore_option_errors=ignore_option_errors)
trial-4
root@ip-10-XX-XX-XX:/home/ubuntu/eks-rolling-update# pip3 install eks-rolling-update --trusted-host files.pythonhosted.org --trusted-host pypi.org --trusted-host pypi.python.org
ERROR: Could not find a version that satisfies the requirement eks-rolling-update (from versions: none)
ERROR: No matching distribution found for eks-rolling-update
I am not able to find any other method to install this module. I have latest TLS=1.3 and if you could help me install this module, it will be a great help.
Since kubernetes released v 12.0.0 of the python library I have this error:
null_resource.eks-rolling-update (local-exec): 2020-10-16 20:17:02,122 INFO Describing autoscaling groups...
null_resource.eks-rolling-update (local-exec): 2020-10-16 20:17:02,544 INFO Pausing k8s autoscaler...
null_resource.eks-rolling-update (local-exec): Traceback (most recent call last):
null_resource.eks-rolling-update (local-exec): File "/usr/local/lib/python3.7/site-packages/urllib3/connection.py", line 157, in _new_conn
null_resource.eks-rolling-update (local-exec): (self._dns_host, self.port), self.timeout, **extra_kw
null_resource.eks-rolling-update (local-exec): File "/usr/local/lib/python3.7/site-packages/urllib3/util/connection.py", line 84, in create_connection
null_resource.eks-rolling-update (local-exec): raise err
null_resource.eks-rolling-update (local-exec): File "/usr/local/lib/python3.7/site-packages/urllib3/util/connection.py", line 74, in create_connection
null_resource.eks-rolling-update (local-exec): sock.connect(sa)
null_resource.eks-rolling-update (local-exec): ConnectionRefusedError: [Errno 111] Connection refused
null_resource.eks-rolling-update (local-exec): During handling of the above exception, another exception occurred:
null_resource.eks-rolling-update (local-exec): Traceback (most recent call last):
null_resource.eks-rolling-update (local-exec): File "/usr/local/lib/python3.7/site-packages/urllib3/connectionpool.py", line 672, in urlopen
null_resource.eks-rolling-update (local-exec): chunked=chunked,
null_resource.eks-rolling-update (local-exec): File "/usr/local/lib/python3.7/site-packages/urllib3/connectionpool.py", line 387, in _make_request
null_resource.eks-rolling-update (local-exec): conn.request(method, url, **httplib_request_kw)
null_resource.eks-rolling-update (local-exec): File "/usr/local/lib/python3.7/http/client.py", line 1244, in request
null_resource.eks-rolling-update (local-exec): self._send_request(method, url, body, headers, encode_chunked)
null_resource.eks-rolling-update (local-exec): File "/usr/local/lib/python3.7/http/client.py", line 1290, in _send_request
null_resource.eks-rolling-update (local-exec): self.endheaders(body, encode_chunked=encode_chunked)
null_resource.eks-rolling-update (local-exec): File "/usr/local/lib/python3.7/http/client.py", line 1239, in endheaders
null_resource.eks-rolling-update (local-exec): self._send_output(message_body, encode_chunked=encode_chunked)
null_resource.eks-rolling-update (local-exec): File "/usr/local/lib/python3.7/http/client.py", line 1026, in _send_output
null_resource.eks-rolling-update (local-exec): self.send(msg)
null_resource.eks-rolling-update (local-exec): File "/usr/local/lib/python3.7/http/client.py", line 966, in send
null_resource.eks-rolling-update (local-exec): self.connect()
null_resource.eks-rolling-update (local-exec): File "/usr/local/lib/python3.7/site-packages/urllib3/connection.py", line 184, in connect
null_resource.eks-rolling-update (local-exec): conn = self._new_conn()
null_resource.eks-rolling-update (local-exec): File "/usr/local/lib/python3.7/site-packages/urllib3/connection.py", line 169, in _new_conn
null_resource.eks-rolling-update (local-exec): self, "Failed to establish a new connection: %s" % e
null_resource.eks-rolling-update (local-exec): urllib3.exceptions.NewConnectionError: <urllib3.connection.HTTPConnection object at 0x7fc53687a3d0>: Failed to establish a new connection: [Errno 111] Connection refused
null_resource.eks-rolling-update (local-exec): During handling of the above exception, another exception occurred:
null_resource.eks-rolling-update (local-exec): Traceback (most recent call last):
null_resource.eks-rolling-update (local-exec): File "/usr/local/bin/eks_rolling_update.py", line 8, in <module>
null_resource.eks-rolling-update (local-exec): sys.exit(main())
null_resource.eks-rolling-update (local-exec): File "/usr/local/lib/python3.7/site-packages/eksrollup/cli.py", line 282, in main
null_resource.eks-rolling-update (local-exec): modify_k8s_autoscaler("pause")
null_resource.eks-rolling-update (local-exec): File "/usr/local/lib/python3.7/site-packages/eksrollup/lib/k8s.py", line 83, in modify_k8s_autoscaler
null_resource.eks-rolling-update (local-exec): body
null_resource.eks-rolling-update (local-exec): File "/usr/local/lib/python3.7/site-packages/kubernetes/client/api/apps_v1_api.py", line 4511, in patch_namespaced_deployment
null_resource.eks-rolling-update (local-exec): return self.patch_namespaced_deployment_with_http_info(name, namespace, body, **kwargs) # noqa: E501
null_resource.eks-rolling-update (local-exec): File "/usr/local/lib/python3.7/site-packages/kubernetes/client/api/apps_v1_api.py", line 4636, in patch_namespaced_deployment_with_http_info
null_resource.eks-rolling-update (local-exec): collection_formats=collection_formats)
null_resource.eks-rolling-update (local-exec): File "/usr/local/lib/python3.7/site-packages/kubernetes/client/api_client.py", line 353, in call_api
null_resource.eks-rolling-update (local-exec): _preload_content, _request_timeout, _host)
null_resource.eks-rolling-update (local-exec): File "/usr/local/lib/python3.7/site-packages/kubernetes/client/api_client.py", line 184, in __call_api
null_resource.eks-rolling-update (local-exec): _request_timeout=_request_timeout)
null_resource.eks-rolling-update (local-exec): File "/usr/local/lib/python3.7/site-packages/kubernetes/client/api_client.py", line 413, in request
null_resource.eks-rolling-update (local-exec): body=body)
null_resource.eks-rolling-update (local-exec): File "/usr/local/lib/python3.7/site-packages/kubernetes/client/rest.py", line 300, in PATCH
null_resource.eks-rolling-update (local-exec): body=body)
null_resource.eks-rolling-update (local-exec): File "/usr/local/lib/python3.7/site-packages/kubernetes/client/rest.py", line 172, in request
null_resource.eks-rolling-update (local-exec): headers=headers)
null_resource.eks-rolling-update (local-exec): File "/usr/local/lib/python3.7/site-packages/urllib3/request.py", line 80, in request
null_resource.eks-rolling-update (local-exec): method, url, fields=fields, headers=headers, **urlopen_kw
null_resource.eks-rolling-update (local-exec): File "/usr/local/lib/python3.7/site-packages/urllib3/request.py", line 171, in request_encode_body
null_resource.eks-rolling-update (local-exec): return self.urlopen(method, url, **extra_kw)
null_resource.eks-rolling-update (local-exec): File "/usr/local/lib/python3.7/site-packages/urllib3/poolmanager.py", line 330, in urlopen
null_resource.eks-rolling-update (local-exec): response = conn.urlopen(method, u.request_uri, **kw)
null_resource.eks-rolling-update (local-exec): File "/usr/local/lib/python3.7/site-packages/urllib3/connectionpool.py", line 762, in urlopen
null_resource.eks-rolling-update (local-exec): **response_kw
null_resource.eks-rolling-update (local-exec): File "/usr/local/lib/python3.7/site-packages/urllib3/connectionpool.py", line 762, in urlopen
null_resource.eks-rolling-update (local-exec): **response_kw
null_resource.eks-rolling-update (local-exec): File "/usr/local/lib/python3.7/site-packages/urllib3/connectionpool.py", line 762, in urlopen
null_resource.eks-rolling-update (local-exec): **response_kw
null_resource.eks-rolling-update (local-exec): File "/usr/local/lib/python3.7/site-packages/urllib3/connectionpool.py", line 720, in urlopen
null_resource.eks-rolling-update (local-exec): method, url, error=e, _pool=self, _stacktrace=sys.exc_info()[2]
null_resource.eks-rolling-update (local-exec): F
ile "/usr/local/lib/python3.7/site-packages/urllib3/util/retry.py", line 436, in increment
null_resource.eks-rolling-update (local-exec): raise MaxRetryError(_pool, url, error or ResponseError(cause))
null_resource.eks-rolling-update (local-exec): urllib3.exceptions.MaxRetryError: HTTPConnectionPool(host='localhost', port=80): Max retries exceeded with url: /apis/apps/v1/namespaces/kube-system/deployments/cluster-autoscaler (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7fc53687a3d0>: Failed to establish a new connection: [Errno 111] Connection refused'))
Reverting manually to v 11.0.0 works as expected.
during our node rotation az rebalanced the nodes, at this time desired and running nodes were 6
ASG added an extra node to balance the nodes across AZs regardless of desired count
as soon as the new node started, ASG found 7 nodes running but desired is 6 so it killed an old node (because of OldestLaunchConfiguration termination policy) to match the desired count
after the last worker node rotation eks-rolling-update changed ASG from 6 to 4 thinking the activity is completed, it caused 1 node abruptly terminate
so for next cluster upgrade we updated the script locally to include AZRebalance
in suspend
it would be good if that fix is included here as well
I am getting this issue, any clue?
I already removed the tags from the autoscaling group after the first failed run.
(venv) [centos@ip-10-16-35-7 eks-rolling-update]$ ./roll.sh
2021-02-02 19:05:45,792 INFO Describing autoscaling groups...
2021-02-02 19:05:46,067 INFO Pausing k8s autoscaler...
2021-02-02 19:05:46,164 INFO K8s autoscaler modified to replicas: 0
2021-02-02 19:05:46,164 INFO *** Checking for nodes older than 7 days in autoscaling group ms-dev-apps-general_purpose_xlarge20200617203556813100000014 ***
2021-02-02 19:05:46,353 INFO Instance id i-051b2437c0caa8d09 : OK
2021-02-02 19:05:46,729 INFO Instance id i-07f5e4c9aa84b8579 : OK
2021-02-02 19:05:46,860 INFO Instance id i-0867d2b14b2ef7b95 : OK
2021-02-02 19:05:46,900 ERROR list index out of range
2021-02-02 19:05:46,900 ERROR *** Rolling update of ASG has failed. Exiting ***
2021-02-02 19:05:46,900 ERROR AWS Auto Scaling Group processes will need resuming manually
2021-02-02 19:05:46,900 ERROR Kubernetes Cluster Autoscaler will need resuming manually
This is my config
export ASG_NAMES="ms-dev-apps-general_purpose_xlarge20200617203556813100000014"
export K8S_AUTOSCALER_ENABLED=True
export K8S_AUTOSCALER_NAMESPACE="kube-system"
export K8S_AUTOSCALER_DEPLOYMENT="cluster-autoscaler-autodetect-aws-cluster-autoscaler"
export K8S_AUTOSCALER_REPLICAS=1
export EXTRA_DRAIN_ARGS="--delete-local-data=true --disable-eviction=true --force=true --grace-period=10 --ignore-daemonsets=true"
export MAX_ALLOWABLE_NODE_AGE=7
export RUN_MODE=4
I am using the latest version
Allow the user to explicitly select the context to be used from the kubeconfig. Given the number of tools that can alter the current-context
of the kubeconfig file, it isn't always safe to rely on it.
Description
If draining a node fails part way through (due to, say, a Kube API error, or a Pod
which can't be evicted) the tool ends up in an unrecoverable state that requires manually editing the ASG to remove the tags.
Details
This happens because it loses track of the desired capacity as it gradually drains nodes one-by-one, rather than keeping track of where it got up to using the tags.
Take a scenario like
ASG original_capacity
: 3
Outdated nodes
: 3
which runs like
instances online = 5
When you re-run the tool it sees
desired_capacity
= 6 (from the ASG tag)current node count
= 5 (from the cluster)And then fails per the below
2020-01-26 16:44:52,264 INFO Current k8s node count is 5
2020-01-26 16:44:53,444 INFO Setting the scale of ASG test-asg based on 2 outdated instances.
2020-01-26 16:44:53,444 INFO Modifying asg test-asg autoscaling to resume ...
2020-01-26 16:44:53,559 INFO Found previous desired capacity value tag set on asg from a previous run.
2020-01-26 16:44:53,559 INFO Maintaining previous capacity of 5 to not overscale.
2020-01-26 16:44:53,559 INFO Describing autoscaling groups...
2020-01-26 16:44:53,828 INFO Current asg instance count in cluster is: 5. K8s node count should match this number
2020-01-26 16:44:53,828 INFO Checking asg test-asg instance count...
2020-01-26 16:44:53,901 INFO Asg test-asg does not have enough running instances to proceed
2020-01-26 16:44:53,901 INFO Actual instances: 5 Desired instances: 6
2020-01-26 16:44:53,901 INFO Validation failed for asg test-asg. Not enough instances online.
2020-01-26 16:44:53,901 INFO Exiting since ASG healthcheck failed
2020-01-26 16:44:53,901 INFO ASG healthcheck failed
2020-01-26 16:44:53,901 INFO *** Rolling update of asg has failed. Exiting ***
Expected Behaviour
In the case above when the tool re-runs, it should really see desired_capacity = 5
because the first node was successfully drained and terminated and no longer is expected to be healthy.
Resolving this somehow would mean re-running the tool would be able to resume where it left off.
Possible Solution
I would suggest that immediately prior to terminating each instance, that the desired_capacity
tag value should be decremented to represent the intended cluster state. I think this would mean when/if the tool is re-run that it would be able to recover from where it left off and drain/terminate the remaining nodes.
I was looking for where #117 is released and discovered that the latest build of master
failed: https://github.com/hellofresh/eks-rolling-update/runs/2716443115?check_suite_focus=true
Sadly logs are no longer present :(
Hi there,
I've been using this tool for a while. And recently the dry-run behaviour changed.
Unfortunately, I can't say what version introduced this though :-(
I have the RUN_MODE
set to 1
, but when I run the tool with -p
cli key, it runs in RUN_MODE
=4
.
Expected behaviour:
dry run pays attention to the RUN_MODE variable
Running with the following
K8S_AUTOSCALER_ENABLED=false python3 ./eks_rolling_update.py -c sandbox
At the end it still tries to enable it
2019-10-23 10:37:47,698 INFO All asgs processed
2019-10-23 10:37:47,755 INFO Resuming k8s autoscaler...
2019-10-23 10:37:47,755 INFO Missing the required parameter `name` when calling `patch_namespaced_deployment`
2019-10-23 10:37:47,755 INFO *** Rolling update of asg has failed. Exiting ***
We struggle to update clusters at times because changing circumstances can require our cluster to scale up the number of nodes. If cluster-autoscaler wasn't shut down, we could allow our cluster to scale up for increases in demand during the process.
I'm not sure of the best way to handle this, but it'd be fantastic if we could.
I think the primary issue with leaving the autoscaler on is that it will prefer to shut down nodes if nothing has scheduled there. This means that the nodes that get spun up before rotating nodes will be shut down prematurely. To combat this, we could annotate those nodes with "cluster-autoscaler.kubernetes.io/scale-down-disabled": "true"
. It might be hard to determine which nodes are "new", but disabling scale down on all nodes that match the new launch configuration should be a good heuristic. If new nodes join, they'll also need to be annotated, however. I believe it's also possible to disable scale-down entirely, but that would require modifying the autoscaler deployment so that's less attractive for that reason.
The next issue would be when the desired count increased while rotating nodes. This would mess with eks-rolling-update
as the original count will have diverged from where it was. eks-rolling-update
could perhaps tolerate increases to this number and update the ASG tags to match. If the number went down unexpectedly, that would be an issue that would still cause the tool to abort.
I searched for related issues, but didn't see anything, so I apologize if this has already been noted.
It would be great to document what level of access eks-rolling-update
needs to successfully run. This is useful when the script is being run by automation with an specific IAM-role.
Some of the nodes in eks might not be controlled by ASG. Such as in a scenario where few nodes are created using ASG and few are managed using spotinst.
rolling-update should exclude those nodes from rolling operation.
Ran into the error displayed below.
Command invoked:
GLOBAL_MAX_RETRY=30 python eks_rolling_update.py --cluster_name apps-1`
Kind of odd as recently a similar operation worked fine.
2019-12-12 06:54:58,337 INFO Found 2 outdated instances
2019-12-12 06:54:59,462 INFO Getting k8s nodes...
2019-12-12 06:55:00,790 INFO Current k8s node count is 2
2019-12-12 06:55:00,790 INFO Setting the scale of ASG apps-1-ng-a20191211120754245000000001 based on 2 outdated instances.
2019-12-12 06:55:00,791 INFO Modifying asg apps-1-ng-a20191211120754245000000001 autoscaling to resume ...
2019-12-12 06:55:01,076 INFO No previous capacity value tags set on ASG; setting tags.
2019-12-12 06:55:01,077 INFO Saving tag to asg key: eks-rolling-update:original_capacity, value : 2...
2019-12-12 06:55:01,374 INFO Saving tag to asg key: eks-rolling-update:desired_capacity, value : 4...
2019-12-12 06:55:01,601 INFO Saving tag to asg key: eks-rolling-update:original_max_capacity, value : 5...
2019-12-12 06:55:01,933 INFO Setting asg desired capacity from 2 to 4 and max size to 5...
2019-12-12 06:55:02,189 INFO Waiting for 90 seconds for ASG to scale before validating cluster health...
2019-12-12 06:56:32,191 INFO Describing autoscaling groups...
2019-12-12 06:56:33,206 INFO Current asg instance count in cluster is: 4. K8s node count should match this number
2019-12-12 06:56:33,207 INFO Checking asg apps-1-ng-a20191211120754245000000001 instance count...
2019-12-12 06:56:33,397 INFO Asg apps-1-ng-a20191211120754245000000001 scaled OK
2019-12-12 06:56:33,398 INFO Actual instances: 4 Desired instances: 4
2019-12-12 06:56:33,398 INFO '<' not supported between instances of 'int' and 'str'
2019-12-12 06:56:33,398 INFO *** Rolling update of asg has failed. Exiting ***
Edit: The GLOBAL_MAX_RETRY
I use because of #22 . A quick scan through codebase suggests that's the culprit as it's parsed as str
.
First of all, this project has helped me a lot in performing rolling upgrades of nodes on our Kubernetes cluster.
I have noticed that on Kubernetes Version 1.19 and above I get the following error message and the rolling upgrade process ends.
2021-04-22 15:35:16,761 INFO Checking k8s expected nodes are online after asg scaled up...
2021-04-22 15:35:16,787 ERROR 'NoneType' object is not iterable
I was able to fix this by upgrading the version of Kubernetes client to the latest version on pypi and get around this issue.
I am opening this issue to help others who might have been in a similar situation.
Currently, eks-rolling-update
only supports launch configurations
. Doing a plan (dry-run) over a cluster created with launch template
instead (using eksctl
) will throw this error:
2019-09-06 17:50:31,413 INFO Describing autoscaling groups...
2019-09-06 17:50:31,979 INFO *** Checking autoscaling group eksctl-nicolas-test-cluster-nodegroup-ng-1-NodeGroup-1476EK4LIRYXE ***
Traceback (most recent call last):
File "eks_rolling_update.py", line 152, in <module>
plan_asgs(filtered_asgs)
File "/app/lib/aws.py", line 248, in plan_asgs
asg_lc_name = asg['LaunchConfigurationName']
KeyError: 'LaunchConfigurationName'
Any plan on supporting the new and shinny launch template
? Thanks!
We have a big and busy EKS cluster with nodes joining and leaving many times in a day (spot instances failing or being replaced). We try to update each ASG separately with ASG_NAMES setting. The problem is, the eks-rolling-update always checks the whole cluster for node count and it many times fails as node count is not matched with expected.
It should only monitor the selected ASG(s) for expected instance count.
2021-02-10 16:26:57,425 INFO Current k8s node count is 94
2021-02-10 16:26:57,426 INFO Current k8s node count is 94
2021-02-10 16:26:57,426 INFO Waiting for k8s nodes to reach count 92...
2021-02-10 16:27:18,198 INFO Getting k8s nodes...
2021-02-10 16:27:19,341 INFO Current k8s node count is 94
2021-02-10 16:27:19,342 INFO Current k8s node count is 94
2021-02-10 16:27:19,342 INFO Waiting for k8s nodes to reach count 92...
2021-02-10 16:27:40,119 INFO Getting k8s nodes...
2021-02-10 16:27:41,470 INFO Current k8s node count is 94
2021-02-10 16:27:41,471 INFO Current k8s node count is 94
2021-02-10 16:27:41,471 INFO Waiting for k8s nodes to reach count 92...
...
2021-02-10 16:28:01,472 INFO Validation failed for cluster *****. Didn't reach expected node count 92.
2021-02-10 16:28:01,472 INFO Exiting since ASG healthcheck failed after 2 attempts
2021-02-10 16:28:01,472 ERROR ASG healthcheck failed
2021-02-10 16:28:01,472 ERROR *** Rolling update of ASG has failed. Exiting ***
2021-02-10 16:28:01,472 ERROR AWS Auto Scaling Group processes will need resuming manually
I used environment variable ASG_NAMES to select a specific Autoscaling Group, but then the condition of expected nodes count is never met - it counts nodes from all AGs.
Log example:
2021-01-08 13:49:56,753 INFO Current asg instance count in cluster is: 6. K8s node count should match this number
2021-01-08 13:49:56,754 INFO Checking k8s expected nodes are online after asg scaled up...
2021-01-08 13:49:57,402 INFO Getting k8s nodes...
2021-01-08 13:49:58,343 INFO Current k8s node count is 21
2021-01-08 13:49:58,343 INFO Current k8s node count is 21
2021-01-08 13:49:58,343 INFO Waiting for k8s nodes to reach count 6...
Currently, after setting the desired capacity of the ASG, the script simply waits CLUSTER_HEALTH_WAIT seconds once (without any retries) to check if all instances come online. This works in best case scenarios, but we observe good amount of variances in AWS when it comes to time it takes for instances to come online. (In my current example that lead to this issue it took 9 minutes).
I know I can increase the CLUSTER_HEALTH_WAIT to 600s, but that means it always wait 10 minutes which is waste of time. So my request is to add a retry, so that we can have a worst case timeout without increasing the rollout time in best case scenario.
The K8S_CONTEXT
environment variable is used when setting up the python Kubernetes API client, but the actual node draining operation is performed by shelling out to kubectl
rather than using the API, and the --context
flag (and K8S_CONTEXT
) variable are not passed when doing so. You can work around this by using the EXTRA_DRAIN_ARGS
variable, but it isn't documented that you need to do so, and, it probably shouldn't be necessary at all.
when script running, it suddenly throw An error occurred (RequestExpired) when calling the DescribeInstances operation: Request has expired.
Hi
I have been looking into this product and testing it, I am hitting this stumbling block of not being able to configure the kubernetes python client, the python client is installed, is this a known issue?, or any ways we can dig deeper in terms of what the kubernetes python client dependencies are?
[ root$] docker run -ti --rm -e AWS_DEFAULT_REGION -v "/root/.aws/config" -v "/root/.kube/us-gpd" eks-rolling-update:latest -c gpdeks1
2021-02-25 02:31:16,139 INFO Describing autoscaling groups...
2021-02-25 02:31:16,444 ERROR Could not configure kubernetes python client
2021-02-25 02:31:16,444 ERROR *** Rolling update of ASG has failed. Exiting ***
2021-02-25 02:31:16,444 ERROR AWS Auto Scaling Group processes will need resuming manually
Thanks
Hi!
We've implemented eks-rolling-update script as a separate stage in our CI (Gitlab).
There are 2 clusters involved:
Once script is executed inside Gitlab runner, we receive the following permissions-related error:
$ eks_rolling_update.py --cluster_name ${TF_VAR_cluster_name}
2020-12-10 13:28:28,187 INFO Describing autoscaling groups...
2020-12-10 13:28:28,194 INFO Pausing k8s autoscaler...
2020-12-10 13:28:28,203 INFO Scaling of k8s autoscaler failed. Error code was Forbidden, {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"deployments.apps \"cluster-autoscaler\" is forbidden: User \"system:serviceaccount:eks:default\" cannot patch resource \"deployments\" in API group \"apps\" in the namespace \"kube-system\"","reason":"Forbidden","details":{"name":"cluster-autoscaler","group":"apps","kind":"deployments"},"code":403}
. Exiting.
That user "system:serviceaccount:eks:default"
belongs to "gitlab-runners" cluster, not to "dev" (eks
namespace exists only in "gitlab-runners" cluster). Moreover, if we get inside this Gitlab runner's container and scale autoscaler manually - all works fine, deployment in "dev" cluster scales up&down:
That means kubeconfig file and AWS credentials are configured properly.
Note, locally eks_rolling_update.py works fine as well (with the same variables and creds which are used in CI).
Below is our eks-rolling-upgrade stage in Gitlab CI (aws cli
, kubectl
and eks-rolling-update
already preinstalled in image):
upgrade:
stage: rolling-upgrade
variables:
AWS_DEFAULT_REGION: "eu-west-1"
K8S_AUTOSCALER_ENABLED: "true"
GLOBAL_MAX_RETRY: 20
K8S_AUTOSCALER_NAMESPACE: "kube-system"
K8S_AUTOSCALER_DEPLOYMENT: "cluster-autoscaler"
K8S_AUTOSCALER_REPLICAS: 2
KUBECONFIG: "/root/.kube/config"
script:
- source variables
- aws eks --region eu-west-1 update-kubeconfig --name ${TF_VAR_cluster_name}
- eks_rolling_update.py --cluster_name ${TF_VAR_cluster_name}
when: manual
timeout: 4h
If any additional details are needed, please let me know.
Looking forward to your reply.
Thanks in advance!
Version of eks-rolling-update: most recent (10-Dec-2020)
version of Kubernetes: 1.18
Hi,
I've found that periodically the reliability of scale-up can be questionable in AWS. In particular, I've seen cases where the EC2 node never registers with the EKS cluster when doing a large amount of scaling (for instance 100+ nodes at once). This results in eks-rolling-update blocking indefinitely because the k8s node count never matches the EC2 count.
I modified eks-rolling-update to include a batch-count environment variable that allows me to instruct the tool to scale up X nodes at a time. This has appeared to eliminate the flakiness that I experience.
I would like to send a PR for the enhancement, but thought I should ask if the owners of this repo would even be interested before I make the PR. My implementation defaults to batching being optional.
Lemme know. Thanks.
Hi,
Great tool and thanks for providing this !
Just an idea of a feature: have a "Force refresh" option, so we can recycle servers, even if they have the launch template has not been updated.
Reason: We use this process across our other EC2 servers, to make sure they are always patched and refreshed.
Thanks in advance
eks-rolling-update/eksrollup/lib/aws.py
Lines 112 to 126 in 70306ef
Causes a failure to run when you have more running instances than desired:
00:04:19.030 2022-01-25 16:24:43,568 INFO Checking asg golf-dev-mgmt-worker-node-0-20191108085402700900000002 instance count...
00:04:19.030 2022-01-25 16:24:43,701 INFO Asg golf-dev-mgmt-worker-node-0-20191108085402700900000002 does not have enough running instances to proceed
00:04:19.030 2022-01-25 16:24:43,701 INFO Actual instances: 7 Desired instances: 6
This script seems to be useful and in general do the right thing. But I think one step is missing for the smoothest operation: All the outdated nodes should be cordoned just before draining them. Otherwise pods might be migrated to other outdated nodes when a node is drained, which means then they would be migrated more than once during the rolling update.
From the time the node goes down until it gets de-registered from the Load Balancer, it will still getting requests. Reducing the thresholds and retries before the LB considers the node as unavailable could improve things but still, at least a few second will be required for that and also this could cause other issues like de-registering nodes for the wrong reasons. Any suggestions?
Even after increasing the value for GLOBAL_HEALTH_RETRY
and/or GLOBAL_HEALTH_WAIT
, getting below error:
Command: python eks_rolling_update.py --cluster_name YOUR_EKS_CLUSTER_NAME
2021-01-29 03:00:21,729 INFO Found 3 outdated instances
2021-01-29 03:00:22,349 INFO Getting k8s nodes...
2021-01-29 03:00:22,743 INFO Current k8s node count is 3
2021-01-29 03:00:22,743 INFO Setting the scale of ASG dev-gvh-worker19940502290119910100001011 based on 3 outdated instances.
2021-01-29 03:00:22,743 INFO Modifying asg dev-gvh-worker19940502290119910100001011 autoscaling to resume ...
2021-01-29 03:00:22,976 INFO Found previous desired capacity value tag set on asg from a previous run.
2021-01-29 03:00:22,976 INFO Maintaining previous capacity of 3 to not overscale.
2021-01-29 03:00:22,976 INFO Waiting for 90 seconds before validating cluster health...
2021-01-29 03:01:52,981 INFO Checking asg dev-gvh-worker19940502290119910100001011 instance count...
2021-01-29 03:01:53,258 INFO Asg dev-gvh-worker19940502290119910100001011 does not have enough running instances to proceed
2021-01-29 03:01:53,258 INFO Actual instances: 3 Desired instances: 4
2021-01-29 03:01:53,258 INFO Validation failed for asg dev-gvh-worker19940502290119910100001011. Not enough instances online.
2021-01-29 03:01:53,258 INFO Exiting since ASG healthcheck failed after 1 attempts
2021-01-29 03:01:53,258 ERROR ASG healthcheck failed
2021-01-29 03:01:53,258 ERROR *** Rolling update of ASG has failed. Exiting ***
2021-01-29 03:01:53,258 ERROR AWS Auto Scaling Group processes will need resuming manually
First of all, thanks for this wonderful project.
Is it possible to add changelog/release notes for each version release?
ATM, it is hard to find what is changed when I upgrade multiple versions ahead.
Any specific reason for not using API to drain the node ? I have a code that i use reliably to drain the node using API. Is there any way i can contribute towards that ?
Please let me know.
Hi there, I mean new with this rolling upgrade tool, so my question may be silly. Why don't just use the eksctl upgrade command? Doesn't it support a rolling upgrade or it will bring in downtime for the user's application? Thanks.
Im sure this is something on my end, but cant figure out where else to look.
Currently managing 4 clusters, all on EKS, all deployed usinghttps://registry.terraform.io/modules/terraform-aws-modules/eks/aws/latest. Each deployment has 2 node groups, and is a launch template, with a 50/50 on-demand/spot mix.
I get errors related to
2020-10-28 14:00:56,916 INFO Instance is failing to terminate. Cancelling out.
2020-10-28 14:00:56,916 ERROR ('Rolling update on ASG failed', 'eks-epx-qa-video-video20201005154444675300000010')
2020-10-28 14:00:56,916 ERROR *** Rolling update of ASG has failed. Exiting ***
2020-10-28 14:00:56,916 ERROR AWS Auto Scaling Group processes will need resuming manually
2020-10-28 14:00:56,916 ERROR Kubernetes Cluster Autoscaler will need resuming manually
or
2020-10-28 14:45:22,860 INFO All k8s nodes are healthy
2020-10-28 14:45:22,860 INFO Cluster validation passed. Proceeding with node draining and termination...
2020-10-28 14:45:22,860 INFO Searching for k8s node name by instance id...
2020-10-28 14:45:22,860 INFO Could not find a k8s node name for that instance id. Exiting
2020-10-28 14:45:22,860 ERROR Encountered an error when adding taint/cordoning node
2020-10-28 14:45:22,860 ERROR Could not find a k8s node name for that instance id. Exiting
Im running the command like this
K8S_AUTOSCALER_ENABLED=1 K8S_AUTOSCALER_NAMESPACE="kube-system" K8S_AUTOSCALER_DEPLOYMENT="autoscaler-aws-cluster-autoscaler-chart" python eks_rolling_update.py --cluster_name <cluster_name>
Any ideas on how to proceed troubleshooting?
2020-12-15 11:57:58,155 INFO Setting the scale of ASG xx-yy-zz based on 3 outdated instances.
2020-12-15 11:57:58,155 INFO Modifying asg xx-yy-zz autoscaling to resume ...
2020-12-15 11:57:58,425 INFO No previous capacity value tags set on ASG; setting tags.
2020-12-15 11:57:58,425 INFO Saving tag to asg key: eks-rolling-update:original_capacity, value : 3...
2020-12-15 11:57:58,687 INFO Saving tag to asg key: eks-rolling-update:desired_capacity, value : 6...
2020-12-15 11:57:58,960 INFO Saving tag to asg key: eks-rolling-update:original_max_capacity, value : 10...
2020-12-15 11:57:59,243 INFO Describing autoscaling groups...
2020-12-15 11:57:59,871 INFO Current asg instance count in cluster is: 6. K8s node count should match this number
2020-12-15 11:57:59,871 ERROR '>' not supported between instances of 'NoneType' and 'int'
2020-12-15 11:57:59,872 ERROR *** Rolling update of ASG has failed. Exiting ***
2020-12-15 11:57:59,872 ERROR AWS Auto Scaling Group processes will need resuming manually
2020-12-15 11:57:59,872 ERROR Kubernetes Cluster Autoscaler will need resuming manually
ran into this couple of times today while running eks_rolling_update.py -c <cluster>
As a big fan of this tool, does the AWS Instance Refresh for EC2 Auto Scaling announcement offer anything new?
In the code resuming the cluster autoscaler is hard coded to 2 replicas. In our environment we only run a single replica. Since running multiple replicas is untested in our environment we manually reset it to replica=1 after running the rolling update.
It would be nice to have the option to pass in a value for cluster autoscaler replicas. I imagine that there are other people running either a single instance or more than two replicas that would benefit from this.
We run jupyterhub on EKS and in this setup, user's have these notebook servers they login to that they use interactively, much like a shell server. Updating the EKS nodes these notebook pods are running on would be hugely disruptive. Having a way to completely ignore updating some ASGs/nodes would be very helpful for us to skip updating the nodes running our users notebook servers.
The script timed out due to an instance taking longer than usual to terminate.
The environment variable that needs to be set to increase the number of checks is GLOBAL_MAX_RETRY
. But the documentation says this is Number of attempts of a health check
. However the termination check is not a health check and this documentation should be corrected to something like Number of attempts of a health or termination check
.
with RUN_MODE=1
all old nodes are cordoned at a same time, which makes AWS ELB to mark old nodes out of service, if new nodes sometimes take time to be in service then no healthy instances are left for sometime which causes outage
we tried cordnoning 1 node at a time and didnt see this issue, downside of this is that a pod may bounce multiple times because it may land up on old node because not all old nodes are cordoned, some people will be fine with bouncing one of a pod multiple times among multiple replicas.
can we have RUN_MODE
5 which is same as RUN_MODE
1 except it "cordon 1 node --> drain 1 node --> delete 1 node" at a time instead of "cordon all nodes --> drain 1 node --> delete 1 node"
This is more of a placeholder that I intend to follow up with a PR.
A number of times, I've had trouble with workloads that have special placement requirements (for example, they have an EBS backed PV), but other workloads have taken up space and I'm left with a pod in Pending.
I'm not sure if it's the run mode I use (2) or something else, but it might be good to have the option to pass in some sort of "overflow" value to add a number of extra instances to each ASG's calculated instance count. This would hopefully allow HPAs to scale out if necessary, and avoid pods getting stuck in pending.
Afterwards, the cluster autoscaler would work out if anything needed to be adjusted.
Hello,
Just wondering if anybody experiences an issue we face often when using this tool.
We run the tool which pause cluster autoscaler and then proceeds to cordon/drain ASGS one at a time.
We often find we get pods unschedulable on to nodes due to not enough memory, we always set a memory request=limit (java) but as to be expected when pods are evicted in some scenarios they spread over the new nodes in a way that no single node has enough free space to run a large pod.
What if anything could we do to stop this ? One options to not disable autoscaler which the eks-rolling-update docs say is optional to disable, but what negative issues would occur if we did this ?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.