target / pod-reaper Goto Github PK
View Code? Open in Web Editor NEWRule based pod killing kubernetes controller
License: MIT License
Rule based pod killing kubernetes controller
License: MIT License
To avoid the warning messages below and future blocking issues we need to start using the v1 RBAC API instead.
W1206 09:54:41.302957 6924 warnings.go:70] rbac.authorization.k8s.io/v1beta1 ClusterRole is deprecated in v1.17+, unavailable in v1.22+; use rbac.authorization.k8s.io/v1 ClusterRole W1206 09:54:41.502812 6924 warnings.go:70] rbac.authorization.k8s.io/v1beta1 ClusterRoleBinding is deprecated in v1.17+, unavailable in v1.22+; use rbac.authorization.k8s.io/v1 ClusterRoleBinding W1206 09:54:42.119382 6924 warnings.go:70] rbac.authorization.k8s.io/v1beta1 ClusterRole is deprecated in v1.17+, unavailable in v1.22+; use rbac.authorization.k8s.io/v1 ClusterRole W1206 09:54:42.306488 6924 warnings.go:70] rbac.authorization.k8s.io/v1beta1 ClusterRoleBinding is deprecated in v1.17+, unavailable in v1.22+; use rbac.authorization.k8s.io/v1 ClusterRoleBinding
Current thought would be to allow this to interact with prometheus and/or statsD. Looking for feedback on how people would like metrics to look from the reaper.
This is one that I have discussed with a few people in person. On one hand: it would allow for a safer learning curve: particularly for things that are clustered.
A couple of options:
n
pods for any one replica set from being deleted at one time.n
pods.There is some fair discussion about whether or not this is a feature pod-reaper should have. I would like to avoid letting people hide problems with this option. For example: if you're running a single pod, and we're using option 2: then pod reaper would effectively be whitelisting that pod. Another example, in the first case: a small n
value doesn't necessarily provide much value to large replica sets.
How to configure this tool for this usecase
Hi,
First of all, thanks for open sourcing this interesting project.
I was playing with it and found out something odd. However, I am not completely sure if the issue is in your service, or in the cron library that you are using down the line.
I have set up a deployment using the following Schedule option:
- name: SCHEDULE
value: "0 20 * * *"
My expectation was that pod-reaper
would check for pods to ๐ฅ at 20h everyday. However, this is the result I am getting:
{"level":"info","msg":"loaded rule: chaos chance 0.999","time":"2020-03-26T16:04:16Z"} โ
โ {"level":"info","msg":"reaping pod","pod":"service-2-764cdfbbdd-kft22","reasons":["was flagged for chaos"],"time":"2020-03-26T16:20:00Z"} โ
โ {"level":"info","msg":"reaping pod","pod":"service-1-5468d66b7b-wfqs2","reasons":["was flagged for chaos"],"time":"2020-03-26T16:20:00Z"} โ
โ {"level":"info","msg":"reaping pod","pod":"service-2-764cdfbbdd-mxcwg","reasons":["was flagged for chaos"],"time":"2020-03-26T17:20:00Z"} โ
โ {"level":"info","msg":"reaping pod","pod":"service-1-5468d66b7b-vx4kb","reasons":["was flagged for chaos"],"time":"2020-03-26T17:20:00Z"} โ
โ {"level":"info","msg":"reaping pod","pod":"service-2-5468d66b7b-cgq9c","reasons":["was flagged for chaos"],"time":"2020-03-26T18:20:00Z"} โ
โ {"level":"info","msg":"reaping pod","pod":"service-1-5468d66b7b-7zlv9","reasons":["was flagged for chaos"],"time":"2020-03-26T18:20:00Z"} โ
โ {"level":"info","msg":"reaping pod","pod":"service-2-764cdfbbdd-rkcmh","reasons":["was flagged for chaos"],"time":"2020-03-26T19:20:00Z"} โ
โ {"level":"info","msg":"reaping pod","pod":"service-1-5468d66b7b-cgq9c","reasons":["was flagged for chaos"],"time":"2020-03-26T19:20:00Z"} โ
โ {"level":"info","msg":"reaping pod","pod":"service-2-asfasfsasf-fsfds","reasons":["was flagged for chaos"],"time":"2020-03-26T20:20:00Z"} โ
โ {"level":"info","msg":"reaping pod","pod":"service-1-5468d66b7b-2mcpp","reasons":["was flagged for chaos"],"time":"2020-03-26T20:20:00Z"}
Is there anything that I am doing incorrectly?
Thanks
Hello!
Thank you for pod-reaper!
One feature we would find useful is to have the ability to decide which pods should be reaped when max_pods
is defined. A few strategies that come to mind:
pod-deletion-cost
.The current behavior is random
based on observations.
Requests that have been made for logging
It would be nice to be able to override default pod-reaper settings with annotations.
For example if MAX_DURATION=1d
, but a pod had the annotation pod-reaper/maxduration: 12h
, then that pod would be reaped in 12 hours.
Currently the pod-reaper terminates "nicely".
Consider an option to hard kill with no SIGTERM
... simulating a VERY hard kill
@slushpupie
With recent changes to docker's licensing, the automated builds are now a payed feature.
It doesn't look like I have permissions to upload new builds myself, and I was previously relying on the automated builds.
Do you have thoughts on this?
I'm pretty upset with Docker's license changes overall, but I'll have to find time to pickup podman or an alternative.
Is there a reasonable thing to do in the meantime?
I've started looking at pod-reaper after reaping 10,000 old pods in my cluster (don't ask...). This is probably the first of several feature requests, sorry if they're a bit spammy.
One feature that would make adoption easier and less risky is a dry run mode, where it does all the work, but doesn't kill anything โ and probably exits right away.
If a pod-reaper is being managed with a deployment, how can we implement health checks against it?
panic: pods "hello-cloud-deployment-4100001433-scb8x" not found
goroutine 1 [running]:
panic(0x1275080, 0xc420319880)
/usr/local/Cellar/go/1.7.5/libexec/src/runtime/panic.go:500 +0x1a1
main.reap(0x1bf08eb000, 0x989680, 0xc4204325c0, 0x2, 0x2, 0x13a7d7b, 0xa, 0x13a598e, 0x8, 0xc42001204a, ...)
/Users/z001kkm/code/go/src/pod-reaper/main.go:52 +0x319
main.main()
/Users/z001kkm/code/go/src/pod-reaper/main.go:102 +0x98
Thrown by this line:
err := clientSet.Core().Pods(pod.ObjectMeta.Namespace).Delete(pod.ObjectMeta.Name, nil)
This shouldn't be a panic: if the pod isn't found (might have been deleted by something else/some other event might have happened). If the pod isn't found/the delete fails, we should probably just log it and continue happily.
A minimum duration rule could be useful to prevent the pod-reaper from killing any pods that are just starting or have only be alive for a short duration.
Example use case: rolling deployments. Having a pod killed during a rolling deployment isn't necessarily bad, but it could cause undesirable effects in the case of automated canary analysis: where a pod being killed could prevent a move forward towards production for no fault of the pod.
kubectl run
cannot be used when an environment variables has commas due to the way that the command parses the command line flags. More investigation needs to happen about this, as it might have been fixed upstream.
I apologize in advance if this was specified in your documentation, but I could not find it in either Github or Docker Hub.
I was wondering if pod-reaper acts on all pods that match based on REQUIRE_LABEL_KEY/EXCLUDE_LABEL_KEY
at the same time, or if it iteratively does 1 pod at a time.
This matters to me because I need to ensure when pod-reaper kills off pods, we have zero downtime. So in a way, I am basically looking for a RollingUpdate
+ maxUnavailable: 0
, option for killing off pods.
I understand I can use CHAOS_CHANCE
to try to ensure some pods stay alive. But a rolling strategy for killing off pods would be far more deterministic and predictable.
Please let me know if this is the default implementation, or if there is something I can set to make this happen.
Thank you.
Hi, thanks for your tool.
I'd like to apply pod-reaper to my k8s cluster, but In the example deployment.yaml
, I don't see any policy setting for granting pod-reaper to have permission to delete pods.
Logging level should be configurable from environment variables. It's possible that this is already done by the logrus library, in which case the only changes would need to be documentation
Hi I am trying to run pod reaper as a deployment but keep getting this bug during run time of the reaper {"error":"no rules were loaded","level":"panic","msg":"error loading options","time":"2020-03-26T04:11:46Z"} panic: (*logrus.Entry) (0x142fba0,0xc42034f810) goroutine 1 [running]: github.com/target/pod-reaper/vendor/github.com/sirupsen/logrus.Entry.log(0xc42004e060, 0xc420211620, 0x0, 0x0, 0x0, 0x0, 0x0 /go/src/github.com/target/pod-reaper/vendor/github.com/sirupsen/logrus/entry.go:239 +0x350 github.com/target/pod-reaper/vendor/github.com/sirupsen/logrus.(*Entry).Log(0xc42034f7a0, 0xc400000000, 0xc4205f9d30, 0x1, 0 /go/src/github.com/target/pod-reaper/vendor/github.com/sirupsen/logrus/entry.go:268 +0xc8 github.com/target/pod-reaper/vendor/github.com/sirupsen/logrus.(*Entry).Panic(0xc42034f7a0, 0xc4205f9d30, 0x1, 0x1) /go/src/github.com/target/pod-reaper/vendor/github.com/sirupsen/logrus/entry.go:306 +0x55 main.newReaper(0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...) /go/src/github.com/target/pod-reaper/reaper/reaper.go:37 +0x2de main.main() /go/src/github.com/target/pod-reaper/reaper/main.go:22 +0x50
Here is my manifest that includes the resources I am deploying.
`
apiVersion: v1
kind: Namespace
metadata:
name: reaper
apiVersion: v1
kind: ServiceAccount
metadata:
name: pod-reaper-service-account
namespace: reaper
kind: ClusterRole
apiVersion: rbac.authorization.k8s.io/v1beta1
metadata:
name: pod-reaper-cluster-role
rules:
apiVersion: rbac.authorization.k8s.io/v1beta1
kind: ClusterRoleBinding
metadata:
name: pod-reaper-role-binding
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: pod-reaper-cluster-role
subjects:
apiVersion: apps/v1
kind: Deployment
metadata:
name: pod-reaper
namespace: reaper
spec:
replicas: 1
selector:
matchLabels:
app: pod-reaper
template:
metadata:
labels:
app: pod-reaper
pod-reaper: disabled
spec:
serviceAccount: pod-reaper-service-account
containers:
- name: airflow-scheduler-terminator
image: target/pod-reaper
resources:
limits:
cpu: 30m
memory: 30Mi
requests:
cpu: 20m
memory: 20Mi
env:
- name: NAMESPACE
value: dataloader-airflow-blue
- name: SCHEDULE
value: "@every 15m"
- name: REQUIRE_LABEL_KEY
value: component
- name: REQUIRE_LABEL_VALUES
value: scheduler
`
Thanks in advance for a great tool.
The option MAX_DURATION option does not count the Pod Status Start time but instead the Pod Start time.
I deployed a pod with an entry point that Evicts it after 10 minutes using the command below:
sleep 600; apt update; apt install curl -y; while true; do curl http://some.url --output some.file; done
The pod reaper is configured with a MAX_DURATION of 5 minutes, POD_STATUSES with Evicted, and running every 1 minute.
I was expecting to see the pod-reaper reaping the Evicted pod at minute 15 of the POD, but instead, the POD was reaped right away at minute 11.
I took a look in the code and it is using the Pod Status start time, but looks like it is getting the first status Start time and not the pod-reaper configured.
https://github.com/target/pod-reaper/blob/master/rules/duration.go#L33
We should consider updating the dependencies to use the newer built in modules of go.
Aim for a set of statuses: and delete all pods with that status.
Is there a desire for a helm chart for this? Even just a folder in the main repo which can be referred to.
And... has anyone done that work already?
Firstly thank you for creating this tool - I really like its simplicity. I've been experimenting with it on GCP and whilst not really an issue with pod-reaper, the default Logrus structured logs do not get handled well by the fleuntd/stackdriver collectors, and all the log messages, regardless of severity, get logged as Errors.
I've put together a small PR to allow for different formatting of the logs, let me know what you think.
Need to get access from @jmccann to publish to https://hub.docker.com/r/target/.
RUN_DURATION
is unsafe in the case that pod-reaper is killed. It should be better documented that you should NOT use this configuration option if you are controlling the pod-reaper via a self healing process (such as a kubernetes deployment) since each time the reaper is restarted it will recalculate the run duration.
This was something that I was "vaguely aware of" when I was writing the feature as I was imagining two desperate use cases:
This should really be documented clearly.
Hello, I'm configuring the pod-reaper helm chart with Renovate bot, then I figured that the helm chart version is not the same as the pod-reaper code.
For that there are 2 strategies that I can see now:
1 - Separate the Helm chart code in a new repo and maintain it there (That would be a better OMHO)
2 - Equate the Helm Chart version to match the pod-reaper version so then I could create a new MR with a real tagging version.
Working on #42 and running some local testing of #40 made me think more about my local development. I found a lot of quick success playing around with KinD (kubernetes in docker) https://github.com/kubernetes-sigs/kind
I know that I probably overdo local testing on pod-reaper because I want to make sure that something that is capable of killing every pod in a cluster functioning like I want. As part of that, I want to make sure that I, and anyone else, has a quick and easy way to try out changes locally without throwing potentially dangerous prototype versions out into non-local docker repositories.
There has been a request for a rule that could be used to control when pods are reaped relative to the time of the day/week.
Use case: I only want to clean stuff up after standard working hours.
Use case: I want to periodically kill pods, but only when most people are in the office.
Hi,
Pod-reaper is not deleting pods which are in evicted state.
Is it the expected behavior? If yes, then can we have a feature in place which deletes pods which are in evicted state.
Please let us know your inputs. @brianberzins @hblanks
Thanks.
Kubectl shows the pod phase as the status, e.g. Running, Succeeded, Failed. The pod status rule checks the optional reason, e.g. Evicted. This is confusing, and also means you cannot create a POD_STATUS=Running
rule. Can we change the pod status rule to use phase? Should the current reason rule be renamed/use a different env variable? POD_STATUS_REASON
?
Relevant documentation:
// The phase of a Pod is a simple, high-level summary of where the Pod is in its lifecycle.
// The conditions array, the reason and message fields, and the individual container status
// arrays contain more detail about the pod's status.
// There are five possible phase values:
//
// Pending: The pod has been accepted by the Kubernetes system, but one or more of the
// container images has not been created. This includes time before being scheduled as
// well as time spent downloading images over the network, which could take a while.
// Running: The pod has been bound to a node, and all of the containers have been created.
// At least one container is still running, or is in the process of starting or restarting.
// Succeeded: All containers in the pod have terminated in success, and will not be restarted.
// Failed: All containers in the pod have terminated, and at least one container has
// terminated in failure. The container either exited with non-zero status or was terminated
// by the system.
// Unknown: For some reason the state of the pod could not be obtained, typically due to an
// error in communicating with the host of the pod.
//
// More info: https://kubernetes.io/docs/concepts/workloads/pods/pod-lifecycle#pod-phase
// +optional
Phase PodPhase `json:"phase,omitempty" protobuf:"bytes,1,opt,name=phase,casttype=PodPhase"`
// A brief CamelCase message indicating details about why the pod is in this state.
// e.g. 'Evicted'
// +optional
Reason string `json:"reason,omitempty" protobuf:"bytes,4,opt,name=reason"`
Related to #50
After we get end-to-end testing that executes against a cluster, what I've got setup right now for CI isn't going to be good enough. Specifically, it won't handle the end-to-end testing well.
Figure this would be a good time to look into github actions!
While testing pod-reaper in dry run, one issue we observed when numerous pods would match the rules defined and MAX_PODS
is defined is that all the matching pods would be marked as pod would be reaped but pod-reaper is in dry-run mode
.
Our expectation in this case would have been that we would see MAX_PODS
pods marked as can reap, while the remainder would be marked aspod would be reaped but maxPods is exceeded
(possibly also indicating pod-reaper is in dry-run mode
). This would better reflect the non dry run behavior (i.e., reaping at most MAX_PODS
pods) and would appear safer if dry run was turned off.
A simple approach to solve this issue would be to log/indicate that we're in dry run mode on start, and keep all subsequent log output just as if it was a live run, simply not executing the reaping process.
Should also consider json logging.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.