linki / chaoskube Goto Github PK
View Code? Open in Web Editor NEWchaoskube periodically kills random pods in your Kubernetes cluster.
License: MIT License
chaoskube periodically kills random pods in your Kubernetes cluster.
License: MIT License
It'd be good to have an HTTP endpoint that could be used for Kubernetes readiness and liveness probes. It could return 200 OK. We could create the HTTP listener just before the infinite for loop in a goroutine.
I'm happy to do a PR if you think this is a good idea.
I am running following os on power machine.
Linux icp1p1 4.10.0-42-generic #46~16.04.1-Ubuntu SMP Mon Dec 4 15:55:56 UTC 2017 ppc64le ppc64le ppc64le GNU/Linux
After deploying chaoskube in the host, I get following error:
standard_init_linux.go:185: exec user process caused "exec format error"
The chaoskube runs well in x86 hosts, but not in power. Is there any workaround how to run that in power, is there any image available for power machine.
Thanks for this project. I'd started to write my own when I found this.
I'd like to filter pods by there age. Our use case is to not delete pods that are younger than a certain time.
Any interest in adding a min-age
flag? I'd be happy to do the work and submit a PR if so.
Thanks!
As the README documented, chaoskube default to run in the dry-run mode, that's okay, and the actual behavior align with that. After I confirmed the targets, and want to turn-off the dry-run mode, I failed. The README only mentioned users can turn off that, wondering the way to do it.
PS: I am using the version v0.14.0, via the helm chart template 3.1.2, which I thought is the latest one.
I found another issue #103 but seems not the same thing.
Our cluster includes ephemeral, randomly named namespaces which are used to run automated integration tests as part of a Jenkine pipeline. Rather than tagging numerous pod with a labels to exclude them, we'd like to be able to exclude entire namespaces using labels.
This could be achieved with an additional option, --namespace-labels
. Just wondering if this would be a useful addition? I'd happy to raise a PR for this.
See #75
If chaoskube has trouble talking to Kubernetes (during runtime) it should be detectable either via readiness/healthiness probes or via metrics.
Currently chaoskube
requires global pod-reader
to find targets even when narrowing down the search space with the --namespaces
flag.
See: https://github.com/linki/chaoskube/blob/v0.9.0/chaoskube/chaoskube.go#L144
Deletion works fine as the API targets a specific namespace.
Hi Martin,
I am currently working on a project where we are trying to improve reliability of our software via using chaos engineering (but, unfortunately, have a very little experience with it). Currently, our software runs on Azure/Kubernetes.
We found chaoskube as a promising tool to help us, but we found out, that it's behavior is different than expected. In the description of chaoskube, there is an information that it kills the pods, so I created a hypothesis about what will happen when one of our pods will just be dealing with a request when it is killed (there should be an error response and next requests should be processed by the other pod). When I started the experiment, the pods were killed but no error occured.
Then one of my colleagues looked in the source code of chaoskube and found out, that the pod is not killed (i.e. force killed instantly), but rather terminated (if I got it correctly, then by using this approach, the pod finishes dealing with it's current task and then "dies" peacefuly).
Is this really how chaoskube works?
We are learning more about chaos every day, but there is a lot of knowledge that we need to gain.
Since my hypothesis was probably wrong, I would be really graceful for any advice about what other chaos experiments is chaoskube suitable for.
Thank You,
Ladislav
Chaoskube should be able to create more chaos during hours where people are around to notice, fix and learn from failures. It makes no sense for the artificial chaos to occur at night and get people paged when they are asleep.
Hi @linki,
Thoughts on also adding the termination event to the top most owner of a pod in addition to the pod itself?
We're attempting to add visibility for application owners when their pods get terminated – and new pods are starting. The terminated pods kind of disappear from the current view of the deployment, which adds a bit of work to find if any pod has been terminated through ChaosKube.
I know that it's possible to look at events in that namespace to find out what happened, but I believe that adding an event to the deployment/parent of the pod would greatly help in surfacing the actions of ChaosKube.
If that sounds like a useful feature, I'll happily submit a PR.
Opened a PR in the helm charts repos to add support for minimum age and use version 0.10.0.
I tested your simple but effective chaoskube & I think it really useful for us.
Do you have any plans to exclude evicted pods from terminations ?
e.g.
kubectl get pods -n {your_namespace} -w
cassandra-db-fc85698c7-hsw6w 1/1 Running 0 7m
cassandra-db-fc85698c7-l57ms 0/1 Evicted 0 4h
cassandra-db-fc85698c7-pg8ps 0/1 Evicted 0 4h
cassandra-db-fc85698c7-swpk9 0/1 Evicted 0 4h
cassandra-db-fc85698c7-tv4nv 0/1 Evicted 0 4h
cassandra-db-fc85698c7-zjvwz 0/1 Evicted 0 4h
In this case chaoskube try to evict pods which already evicted.
If this feature already implemented then can anybody guide me in configuration?
Observation: pod which are in error state or not fully started victim of chaoskube.
some instructions on what policy to add to the service-account would be nice ... or is it supposed to work out of the box ?
Hi Martin,
I am struggling with multiple labels. I need to select pods for termination using labels (one key with multiple values). If there is only one value, everything works fine. When there are multiple values, the chaoskube runs (in dry mode), there is no error output, but there is also no log about killing the pods. I tried following syntax of "labels" line of yaml file:
No luck so far :-(
What is the correct syntax of "labels"? We want to create the "pool" of pods of multiple applications to be killed.
What would be the syntax if I ever need to select pods by multiple keys with multiple values?
Thank You,
Ladislav
$ helm install stable/chaoskube --set dryRun=false --set namespaces=test --set interval=10m --set timezone=America/New_York --set excludedWeekdays="Mon,Wed,Thu,Fri" --set excludedTimesOfDay="08:00-18:00"
Error: failed parsing --set data: key "Wed" has no value (cannot end with ,)
Cannot set multiple days in excluded Weekdays
Hi,
I'm looking for a way to notify my team every-time the chaos bot started to perform actions.
As Slack usage is widely used, that will be my preference.
I want to start and implement that capability for chaoskube.
Any thoughts?
Go 1.8 is out and we should support it :)
I'd like the ability to exclude apps with only 1 replica; any ideas? Would require a change I presume...
kubectl logs -f chaoskube-7b68cccbcf-g67cx
time="2019-08-08T19:38:36Z" level=info msg="starting up" dryRun=true interval=5s version=v0.15.0
W0808 19:38:36.596835 6 client_config.go:541] Neither --kubeconfig nor --master was specified. Using the inClusterConfig. This might not work.
log: exiting because of error: log: cannot create log: open /tmp/chaoskube.chaoskube-7b68cccbcf-g67cx.nobody.log.WARNING.20190808-193836.6: read-only file system
That's a regression. I though the switch to klog
removes the need to rewrite the import. Seems like klog
writes to disk like glog
.
version: v0.15.0
with
securityContext:
capabilities:
drop:
- ALL
procMount: Default
readOnlyRootFilesystem: true
runAsNonRoot: true
runAsUser: 65534
...would come in handy. I think chaosmonkey has a similar feature.
I'm building the image on quay.io
so I have to compile the binary in the image which greatly increases its size.
Options:
the USER value in the Dockerfile must be numeric, because kubernetes expect an ID to verify if the USER is root or not.
It is possible to negate an annotation?
For example
- --annotations=!myannotation=true
Hello,
First of all, tanks for your awesome work!
It makes sense allow one to configure the grace period that K8s will give to the deleted pod before violently killing it (using SIGKIL
). According to the documentation the default value (at least for delete
commands issued via kubectl
) is 30 seconds, which I think is more than enough time for applications that supports graceful shutdown.
I've checked the code, and ATM the grace period parameter is not supplied when invoking the K8s API:
chaoskube/chaoskube/chaoskube.go
Line 174 in a4acf6f
I've currently applied the following patch In order to kill processes almost immediately:
diff --git a/chaoskube/chaoskube.go b/chaoskube/chaoskube.go
index 645b6e8..be7828b 100644
--- a/chaoskube/chaoskube.go
+++ b/chaoskube/chaoskube.go
@@ -171,7 +171,9 @@ func (c *Chaoskube) DeletePod(victim v1.Pod) error {
return nil
}
- return c.Client.Core().Pods(victim.Namespace).Delete(victim.Name, nil)
+ secs := int64(0)
+ deleteopts := &metav1.DeleteOptions{GracePeriodSeconds: &secs}
+ return c.Client.Core().Pods(victim.Namespace).Delete(victim.Name, deleteopts)
}
// filterByNamespaces filters a list of pods by a given namespace selector.
Please let me know if any more details are needed.
Cheers,
Ivan
Hi!
I am testing ChaosKube on k8s cluster with a large number of pods. The current approach of terminating only one pod per run means that some pods will not be scheduled for termination, given how large the pool is.
Would you be interested in a PR that adds a new configuration option (defaults to 1, current behavior) to override the current behavior?
Something like --max-kill=10
, would attempt to terminate up to 10 pods.
Let me know if this feature makes sense for the project and I'll happily submit a PR.
add support for filtering pods by annotations
Similar to #78 and https://github.com/zalando-incubator/cluster-lifecycle-manager/blob/8042e37ad3fb482879112e8bc6d095c01ff2ef7c/pkg/updatestrategy/node_pool_manager.go#L486-L489 we should avoid trying to kill pods that aren't running.
from #2 (comment)
Another possibility is attributes.
@kfox1111 want do you have in mind?
something like this maybe:
all pods with some value for an attribute (e.g. serviceAccountName=default)
all pods containing at least one container with some value for an attribute (e.g. image=nginx )
I'm proposing a feature addition to chaoskube that would add the ability suspend the chaos during nights, weekends and holidays using the following command-line options. These are designed to be somewhat consistent with the current pattern of chaoskube options as well as the configuration options for Chaos Monkey. They should be self-explanatory:
--observe-off-times true # defaults to false
--location 'America/New_York' # , or 'UTC'. Req'd if observe-off-times is true
--offdays 'Saturday, Sunday' # default
--workhours 'start=09:00, end=17:00' # default
--holidays '2017-12-25, 2018-01-01' # defaults to empty list
The options above imply that both --observe-off-times true
and --location '...'
must be present for the feature to take effect. There is purposefully no default location so the user is forced to provide this, since most SRE staff is probably not working in the GMT timezone, so defaulting to UTC would not really make sense in this case.
Note that this requires a IANA Time Zone as opposed to a three-letter timezone abbreviation such as 'EDT' or 'EST', that would have to change with Daylight Saving conventions. Daylight Saving is automatically accounted for by using the IANA Time Zones.
I intend to post a PR as soon as I have this implemented, but wanted to get some feedback in case I'm missing something.
Since #28 I am seeing the error:
log: exiting because of error: log: cannot create log: open /tmp/chaoskube.chaoskube-production-
4075332500-53jjx.unknownuser.log.WARNING.20170707-151720.1: no such file or directory```
I see that the `config, err := rest.InClusterConfig()` was removed, this might just be the error.
The number of arguments for Chaoskube.New()
has increased to a point where it becomes annoying to use.
We could switch to a struct to have order-independent arguments by Name. It would also allow us to leave out keys when we want the default value.
Currently one has to run go test ./...
and go build main.go
to test and build the binary.
To make it easier for people cloning/forking this repository we should add a simple Makefile to run these tasks.
So, I'd really like to use chaoskube to force our deployment objects to have to exercise its connection tracking/safe shutdown code. Some assurances too many pods don't get killed would be good though. Would it be possible to add support for looking at the .spec.strategy.rollingUpdate.maxUnavailable field and the .spec.replica's field to ensure not too many are out at a time?
lots of details in https://github.com/Yelp/dumb-init ... want a PR for that ?
(adding a apk package)
RUN apk add --no-cache dumb-init
ENTRYPOINT ["dumb-init", "--", "/bin/chaoskube"]
I'm a metrics newbie. Time to play around on a simple project like this.
Hey, thanks for this useful tool. The documentation has me confused, as it instructs installation via a helm deployment (which is cool), but 90% of the documentation references a binary (chaoskube
) but there isn't any documentation on how to get access to that binary. Am I supposed to clone repository this and put something in my PATH? Am I expected to create a local binary which runs this tool through docker?
It would be nice to have a metric which pod has been killed with labels like namespace
and pod
(name of pod).
So teams can have a easy monitoring for their killed pods.
update chart to v0.4.0 and document that as an install option in the reame
We would like to automatically pull the latest stable release as soon as it is released instead of updating the tag manually. Is using the latest
or master
image tag safe to do so? Or would it be possible for you to tag the latest stable release accordingly with e.g. stable
or release
?
Hi,
Since last update to the chart (version 0.8.1) it's now support only K8S > 1.9.0 (apps/v1 api) which is a blocker for us.
Is this a change is a most or you can reconsider it?
I've successfully ran 0.10.0 image on my k8s 1.8 cluster without the helm chart.
Currently, the probability of a pod being killed depends on the number of pods being in the target group. This is bad if you want to run chaoskube
as a cluster addon and opt-in to being killed via annotations because you cannot estimate how often that would happen.
Allow specifying or at least somehow keep track of what's going on so Pod terminations happen in a somewhat predictable way. For example, instead of terminating a single pod every 10 minutes, each pod may have a probability of X% of being killed per hour. This, hopefully, would make pod terminations independent of the number of pods running in total.
from #6 deployment limits
@kfox1111 So, I'd really like to use chaoskube to force our deployment objects to have to exercise its connection tracking/safe shutdown code. Some assurances too many pods don't get killed would be good though. Would it be possible to add support for looking at the .spec.strategy.rollingUpdate.maxUnavailable field and the .spec.replica's field to ensure not too many are out at a time?
@linki I looked into PodDisruptionBudget
s yesterday and they are pretty much want you want.
Kubernetes defines voluntary evictions (e.g. due to draining, auto-downscaling, etc.) and involuntary pod evictions (node failures etc.).
With those budgets you define a label selector and a minimun number of pods that should exists matching this selector. If not you cannot evict the pod. you can still delete it. kubectl drain
uses evict under the hood in order to honor the disruption budgets. You can still fall under your minimum when an involuntary eviction happens while you are at your minimum value from your disruption budget.
I tested it yesterday with chaoskube and it works as expected. Unfortunately, the golang fake client that I use for writing tests doesn't quite show the same behaviour. It's usually very accurate.
The outcode should be that chaoskube can be run with a mode respecting the budgets and without for true chaos.
Now that chaoskube
is listed on https://kubeapps.com we should get a pretty logo :)
static pods cannot be killed via the Kubernetes API: https://kubernetes.io/docs/tasks/administer-cluster/static-pod/
Let's ignore them. We could use these annotations to detect what is a static pod:
metadata:
annotations:
kubernetes.io/config.hash: 3ffad4b19c937d5bb9cbacadb2f463a1
kubernetes.io/config.mirror: 3ffad4b19c937d5bb9cbacadb2f463a1
kubernetes.io/config.seen: 2018-04-09T07:44:01.286945749Z
kubernetes.io/config.source: file
I installed the project and ran this command:
chaoskube --interval=1m --debug --deploy
I'm getting a CrashLoopBackOff with this error:
2017-08-22T21:47:40.214051821Z time="2017-08-22T21:47:40Z" level=info msg="Dry run enabled. I won't kill anything. Use --no-dry-run when you're ready."
2017-08-22T21:47:40.214124368Z time="2017-08-22T21:47:40Z" level=debug msg="Using current context from kubeconfig at /root/.kube/config."
2017-08-22T21:47:40.214130887Z time="2017-08-22T21:47:40Z" level=fatal msg="stat /root/.kube/config: no such file or directory"
Any insights?
Current version of client-go is around 2.0 and most of the apis are deprecated in the new version.
Hi Guys,
I have created Kops Cluster using the following commands(Cluster Level rbac not enabled yet). But chaoskube doesn´t kill any pods. Please help me if anything wrong.
+++++++++++
kops create cluster --cloud "gce" --name test.k8s.local --zones=us-east1-b --master-zones=us-east1-b --state gs://testbucket --master-size n1-standard-2 --node-size n1-standard-4 --node-count 1 --admin-access 104.xxx.xx.xxx/32
+++++++++++
Chaoskube Link Used:https://github.com/linki/chaoskube/tree/master/examples
Please my chaoskube pod logs:
+++++
kubectl logs -f chaoskube-6d95c94b4d-nrqjn
time="2018-09-17T10:24:53Z" level=info msg="starting up" dryRun=true interval=2m0s version=v0.10.0
time="2018-09-17T10:24:53Z" level=info msg="connected to cluster" master="https://100.56.0.7:443" serverVersion=v1.10.3
time="2018-09-17T10:24:53Z" level=info msg="setting pod filter" annotations="chaos.alpha.kubernetes.io/enabled=true" labels= namespaces="!kube-system"
time="2018-09-17T10:24:53Z" level=info msg="setting quiet times" daysOfYear="[Apr 1 Dec24]" timesOfDay="[]" weekdays="[Saturday Sunday]"
time="2018-09-17T10:24:53Z" level=info msg="setting timezone" location=UTC name=UTC offset=0
+++++
Here is my edited yamls:
cat rbac.yaml
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: chaoskube
rules:
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
name: chaoskube
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: chaoskube
subjects:
cat chaoskube.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: chaoskube
labels:
app: chaoskube
spec:
strategy:
type: Recreate
selector:
matchLabels:
app: chaoskube
template:
metadata:
labels:
app: chaoskube
spec:
serviceAccountName: chaoskube
containers:
- name: chaoskube
image: quay.io/linki/chaoskube:v0.10.0
args:
# kill a pod every 10 minutes
- --interval=02m
# only target pods in the test environment
#- --labels=environment=test
# only consider pods with this annotation
- --annotations=chaos.alpha.kubernetes.io/enabled=true
# exclude all pods in the kube-system namespace
- --namespaces=!kube-system
# don't kill anything on weekends
- --excluded-weekdays=Sat,Sun
# don't kill anything during the night or at lunchtime
#- --excluded-times-of-day=22:00-08:00,11:00-13:00
# don't kill anything as a joke or on christmas eve
- --excluded-days-of-year=Apr1,Dec24
# let's make sure we all agree on what the above times mean
- --timezone=UTC
# exclude all pods that haven't been running for at least one hour
- --minimum-age=1m
# terminate pods for real: this disables dry-run mode which is on by default
# - --no-dry-run
apiVersion: v1
kind: ServiceAccount
metadata:
name: chaoskube
labels:
app: chaoskube
Hi,
I have deployed the yaml's in in the examples on my K8s cluster. It's successfully installed,but it doesn't kill any pod or anything like that. Is there any changes in installation. Can you help me please.
Regards
Subin
Hi,
I have a deployment where I'm using the operator role for my kubernetes namespace, so I have full access, but only within my own namespace. chaoskube becomes ready but fails to operate.
pods is forbidden: User \\\"system:serviceaccount:poirot-test:operator\\\" cannot list pods at the cluster scope: unauthorized access system:serviceaccount:xxxxxxxx:operator/[system:serviceaccounts system:serviceaccounts:xxxxxxxx system:authenticated]
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.