linki / chaoskube Goto Github PK

chaoskube periodically kills random pods in your Kubernetes cluster.

License: MIT License

Go 97.10% Dockerfile 0.64% Makefile 0.23% Smarty 2.02%

kubernetes chaos chaos-monkey chaos-engineering

chaoskube's Issues

Add /healthz HTTP endpoint that can be used as a livenessProbe

It'd be good to have an HTTP endpoint that could be used for Kubernetes readiness and liveness probes. It could return 200 OK. We could create the HTTP listener just before the infinite for loop in a goroutine.

I'm happy to do a PR if you think this is a good idea.

[power] chaoskube (v0.6.1) image is not available for power.

I am running following os on power machine.

Linux icp1p1 4.10.0-42-generic #46~16.04.1-Ubuntu SMP Mon Dec 4 15:55:56 UTC 2017 ppc64le ppc64le ppc64le GNU/Linux

After deploying chaoskube in the host, I get following error:
standard_init_linux.go:185: exec user process caused "exec format error"
The chaoskube runs well in x86 hosts, but not in power. Is there any workaround how to run that in power, is there any image available for power machine.

Filter pods based on age

Thanks for this project. I'd started to write my own when I found this.

I'd like to filter pods by there age. Our use case is to not delete pods that are younger than a certain time.

Any interest in adding a min-age flag? I'd be happy to do the work and submit a PR if so.

Thanks!

how to turn off the dry-run mode

As the README documented, chaoskube default to run in the dry-run mode, that's okay, and the actual behavior align with that. After I confirmed the targets, and want to turn-off the dry-run mode, I failed. The README only mentioned users can turn off that, wondering the way to do it.

PS: I am using the version v0.14.0, via the helm chart template 3.1.2, which I thought is the latest one.

I found another issue #103 but seems not the same thing.

Filter namespaces by labels

Our cluster includes ephemeral, randomly named namespaces which are used to run automated integration tests as part of a Jenkine pipeline. Rather than tagging numerous pod with a labels to exclude them, we'd like to be able to exclude entire namespaces using labels.

This could be achieved with an additional option, --namespace-labels. Just wondering if this would be a useful addition? I'd happy to raise a PR for this.

Probes should fail when chaoskube misses RBAC permissions

See #75

If chaoskube has trouble talking to Kubernetes (during runtime) it should be detectable either via readiness/healthiness probes or via metrics.

Reduce required permissions when filtering by namespace

Currently chaoskube requires global pod-reader to find targets even when narrowing down the search space with the --namespaces flag.

See: https://github.com/linki/chaoskube/blob/v0.9.0/chaoskube/chaoskube.go#L144

Deletion works fine as the API targets a specific namespace.

Does chaoskube really kill the pods?

Hi Martin,

I am currently working on a project where we are trying to improve reliability of our software via using chaos engineering (but, unfortunately, have a very little experience with it). Currently, our software runs on Azure/Kubernetes.

We found chaoskube as a promising tool to help us, but we found out, that it's behavior is different than expected. In the description of chaoskube, there is an information that it kills the pods, so I created a hypothesis about what will happen when one of our pods will just be dealing with a request when it is killed (there should be an error response and next requests should be processed by the other pod). When I started the experiment, the pods were killed but no error occured.

Then one of my colleagues looked in the source code of chaoskube and found out, that the pod is not killed (i.e. force killed instantly), but rather terminated (if I got it correctly, then by using this approach, the pod finishes dealing with it's current task and then "dies" peacefuly).

Is this really how chaoskube works?

We are learning more about chaos every day, but there is a lot of knowledge that we need to gain.

Since my hypothesis was probably wrong, I would be really graceful for any advice about what other chaos experiments is chaoskube suitable for.

Thank You,

Ladislav

Create more chaos during business hours

Chaoskube should be able to create more chaos during hours where people are around to notice, fix and learn from failures. It makes no sense for the artificial chaos to occur at night and get people paged when they are asleep.

Add k8s event to the resource owning the pod(s) being terminated.

Hi @linki,

Thoughts on also adding the termination event to the top most owner of a pod in addition to the pod itself?

We're attempting to add visibility for application owners when their pods get terminated – and new pods are starting. The terminated pods kind of disappear from the current view of the deployment, which adds a bit of work to find if any pod has been terminated through ChaosKube.

I know that it's possible to look at events in that namespace to find out what happened, but I believe that adding an event to the deployment/parent of the pod would greatly help in surfacing the actions of ChaosKube.

If that sounds like a useful feature, I'll happily submit a PR.

Use version 0.10.0 + add minimum age to the chart

Opened a PR in the helm charts repos to add support for minimum age and use version 0.10.0.

helm/charts#7289

Exclude evicted pods from termination

I tested your simple but effective chaoskube & I think it really useful for us.

Do you have any plans to exclude evicted pods from terminations ?

e.g.
kubectl get pods -n {your_namespace} -w

cassandra-db-fc85698c7-hsw6w                      1/1       Running            0          7m
cassandra-db-fc85698c7-l57ms                      0/1       Evicted            0          4h
cassandra-db-fc85698c7-pg8ps                      0/1       Evicted            0          4h
cassandra-db-fc85698c7-swpk9                      0/1       Evicted            0          4h
cassandra-db-fc85698c7-tv4nv                      0/1       Evicted            0          4h
cassandra-db-fc85698c7-zjvwz                      0/1       Evicted            0          4h

In this case chaoskube try to evict pods which already evicted.

If this feature already implemented then can anybody guide me in configuration?

Observation: pod which are in error state or not fully started victim of chaoskube.

Add --minimum-age flag to startup logs and docs

cannot delete pod "No policy matched"

some instructions on what policy to add to the service-account would be nice ... or is it supposed to work out of the box ?

Selection of pods to be terminated by using multiple labels is not working for me

Hi Martin,

I am struggling with multiple labels. I need to select pods for termination using labels (one key with multiple values). If there is only one value, everything works fine. When there are multiple values, the chaoskube runs (in dry mode), there is no error output, but there is also no log about killing the pods. I tried following syntax of "labels" line of yaml file:

--labels=app=xxx, yyy
--labels=app=xxx,yyy
--labels=app=xxx,app=yyy

No luck so far :-(

What is the correct syntax of "labels"? We want to create the "pool" of pods of multiple applications to be killed.

What would be the syntax if I ever need to select pods by multiple keys with multiple values?

Thank You,

Ladislav

Cannot set multiple days in excluded Weekdays

$ helm install stable/chaoskube --set dryRun=false --set namespaces=test --set interval=10m --set timezone=America/New_York --set excludedWeekdays="Mon,Wed,Thu,Fri" --set excludedTimesOfDay="08:00-18:00"

Error: failed parsing --set data: key "Wed" has no value (cannot end with ,)

Cannot set multiple days in excluded Weekdays

Adding slack notifications

Hi,

I'm looking for a way to notify my team every-time the chaos bot started to perform actions.
As Slack usage is widely used, that will be my preference.

I want to start and implement that capability for chaoskube.

Any thoughts?

update travis.yml to test against go1.8

Go 1.8 is out and we should support it :)

exclude apps, with only 1 replica

I'd like the ability to exclude apps with only 1 replica; any ideas? Would require a change I presume...

chaoskube fails when starting with read-only filesystem

kubectl logs -f chaoskube-7b68cccbcf-g67cx
time="2019-08-08T19:38:36Z" level=info msg="starting up" dryRun=true interval=5s version=v0.15.0
W0808 19:38:36.596835       6 client_config.go:541] Neither --kubeconfig nor --master was specified.  Using the inClusterConfig.  This might not work.
log: exiting because of error: log: cannot create log: open /tmp/chaoskube.chaoskube-7b68cccbcf-g67cx.nobody.log.WARNING.20190808-193836.6: read-only file system

That's a regression. I though the switch to klog removes the need to rewrite the import. Seems like klog writes to disk like glog.

version: v0.15.0 with

        securityContext:
          capabilities:
            drop:
            - ALL
          procMount: Default
          readOnlyRootFilesystem: true
          runAsNonRoot: true
          runAsUser: 65534

Feature: black- and whitelisting

...would come in handy. I think chaosmonkey has a similar feature.

reduce size of "official" image

I'm building the image on quay.io so I have to compile the binary in the image which greatly increases its size.

Options:

build somewhere else where compilation of the binary and docker building can be split (drone, google container builder)
remove everything and have the image squashed by quay.io?

chaoskube.yaml doens't work, if a PodSecurityPolicy with runasnonroot is applied

the USER value in the Dockerfile must be numeric, because kubernetes expect an ID to verify if the USER is root or not.

Is it possible to ignore pods by annotations?

It is possible to negate an annotation?

For example

- --annotations=!myannotation=true

Allow configuration of grace period when deleting pods

Hello,

First of all, tanks for your awesome work!

It makes sense allow one to configure the grace period that K8s will give to the deleted pod before violently killing it (using SIGKIL). According to the documentation the default value (at least for delete commands issued via kubectl) is 30 seconds, which I think is more than enough time for applications that supports graceful shutdown.

I've checked the code, and ATM the grace period parameter is not supplied when invoking the K8s API:

chaoskube/chaoskube/chaoskube.go

Line 174 in a4acf6f

return c.Client.Core().Pods(victim.Namespace).Delete(victim.Name, nil)

I've currently applied the following patch In order to kill processes almost immediately:

diff --git a/chaoskube/chaoskube.go b/chaoskube/chaoskube.go
index 645b6e8..be7828b 100644
--- a/chaoskube/chaoskube.go
+++ b/chaoskube/chaoskube.go
@@ -171,7 +171,9 @@ func (c *Chaoskube) DeletePod(victim v1.Pod) error {
 		return nil
 	}
 
-	return c.Client.Core().Pods(victim.Namespace).Delete(victim.Name, nil)
+	secs := int64(0)
+	deleteopts := &metav1.DeleteOptions{GracePeriodSeconds: &secs}
+	return c.Client.Core().Pods(victim.Namespace).Delete(victim.Name, deleteopts)
 }
 
 // filterByNamespaces filters a list of pods by a given namespace selector.

Please let me know if any more details are needed.
Cheers,
Ivan

Terminating multiple pods per run

Hi!

I am testing ChaosKube on k8s cluster with a large number of pods. The current approach of terminating only one pod per run means that some pods will not be scheduled for termination, given how large the pool is.

Would you be interested in a PR that adds a new configuration option (defaults to 1, current behavior) to override the current behavior?

Something like --max-kill=10, would attempt to terminate up to 10 pods.

Let me know if this feature makes sense for the project and I'll happily submit a PR.

filter target pods by annotations

add support for filtering pods by annotations

Don't try to kill Pods that are not Running

Similar to #78 and https://github.com/zalando-incubator/cluster-lifecycle-manager/blob/8042e37ad3fb482879112e8bc6d095c01ff2ef7c/pkg/updatestrategy/node_pool_manager.go#L486-L489 we should avoid trying to kill pods that aren't running.

filter target pods by attributes

from #2 (comment)

Another possibility is attributes.

@kfox1111 want do you have in mind?
something like this maybe:
all pods with some value for an attribute (e.g. serviceAccountName=default)
all pods containing at least one container with some value for an attribute (e.g. image=nginx )

Observe non-working times feature

I'm proposing a feature addition to chaoskube that would add the ability suspend the chaos during nights, weekends and holidays using the following command-line options. These are designed to be somewhat consistent with the current pattern of chaoskube options as well as the configuration options for Chaos Monkey. They should be self-explanatory:

--observe-off-times true # defaults to false
--location 'America/New_York' # , or 'UTC'. Req'd if observe-off-times is true
--offdays 'Saturday, Sunday'         # default
--workhours 'start=09:00, end=17:00' # default
--holidays '2017-12-25, 2018-01-01'  # defaults to empty list

The options above imply that both --observe-off-times true and --location '...' must be present for the feature to take effect. There is purposefully no default location so the user is forced to provide this, since most SRE staff is probably not working in the GMT timezone, so defaulting to UTC would not really make sense in this case.

Note that this requires a IANA Time Zone as opposed to a three-letter timezone abbreviation such as 'EDT' or 'EST', that would have to change with Daylight Saving conventions. Daylight Saving is automatically accounted for by using the IANA Time Zones.

I intend to post a PR as soon as I have this implemented, but wanted to get some feedback in case I'm missing something.

Using the inClusterConfig. This might not work.

Since #28 I am seeing the error:

log: exiting because of error: log: cannot create log: open /tmp/chaoskube.chaoskube-production-
4075332500-53jjx.unknownuser.log.WARNING.20170707-151720.1: no such file or directory```

I see that the `config, err := rest.InClusterConfig()` was removed, this might just be the error.

Consider changing contructor parameters to a struct

The number of arguments for Chaoskube.New() has increased to a point where it becomes annoying to use.

We could switch to a struct to have order-independent arguments by Name. It would also allow us to leave out keys when we want the default value.

Add a makefile to run tests and build more easily

Currently one has to run go test ./... and go build main.go to test and build the binary.

To make it easier for people cloning/forking this repository we should add a simple Makefile to run these tasks.

deployment limits

So, I'd really like to use chaoskube to force our deployment objects to have to exercise its connection tracking/safe shutdown code. Some assurances too many pods don't get killed would be good though. Would it be possible to add support for looking at the .spec.strategy.rollingUpdate.maxUnavailable field and the .spec.replica's field to ensure not too many are out at a time?

add dumb-init

lots of details in https://github.com/Yelp/dumb-init ... want a PR for that ?
(adding a apk package)

RUN apk add --no-cache dumb-init
ENTRYPOINT ["dumb-init", "--", "/bin/chaoskube"]

Add some interesting custom metrics

I'm a metrics newbie. Time to play around on a simple project like this.

time to filter pods (I guess it's quite expensive the way it's implemented now)
killed pods per minute, per namespace etc.
errors / successes
% blocked by pod disruption budgets

Docs reference a binary, but instruct helm installation

Hey, thanks for this useful tool. The documentation has me confused, as it instructs installation via a helm deployment (which is cool), but 90% of the documentation references a binary (chaoskube) but there isn't any documentation on how to get access to that binary. Am I supposed to clone repository this and put something in my PATH? Am I expected to create a local binary which runs this tool through docker?

Add killed pod metric

It would be nice to have a metric which pod has been killed with labels like namespace and pod (name of pod).

So teams can have a easy monitoring for their killed pods.

update and document helm chart

update chart to v0.4.0 and document that as an install option in the reame

Lifecycle with release tags

We would like to automatically pull the latest stable release as soon as it is released instead of updating the tag manually. Is using the latest or master image tag safe to do so? Or would it be possible for you to tag the latest stable release accordingly with e.g. stable or release?

New chart compatible only with k8s > 1.9.0

Hi,

Since last update to the chart (version 0.8.1) it's now support only K8S > 1.9.0 (apps/v1 api) which is a blocker for us.
Is this a change is a most or you can reconsider it?

I've successfully ran 0.10.0 image on my k8s 1.8 cluster without the helm chart.

Decouple individual pod termination frequency from cluster size

Currently, the probability of a pod being killed depends on the number of pods being in the target group. This is bad if you want to run chaoskube as a cluster addon and opt-in to being killed via annotations because you cannot estimate how often that would happen.

Proposal

Allow specifying or at least somehow keep track of what's going on so Pod terminations happen in a somewhat predictable way. For example, instead of terminating a single pod every 10 minutes, each pod may have a probability of X% of being killed per hour. This, hopefully, would make pod terminations independent of the number of pods running in total.

respect pod disruption budgets

from #6 deployment limits

@kfox1111 So, I'd really like to use chaoskube to force our deployment objects to have to exercise its connection tracking/safe shutdown code. Some assurances too many pods don't get killed would be good though. Would it be possible to add support for looking at the .spec.strategy.rollingUpdate.maxUnavailable field and the .spec.replica's field to ensure not too many are out at a time?

@linki I looked into PodDisruptionBudgets yesterday and they are pretty much want you want.

Kubernetes defines voluntary evictions (e.g. due to draining, auto-downscaling, etc.) and involuntary pod evictions (node failures etc.).

With those budgets you define a label selector and a minimun number of pods that should exists matching this selector. If not you cannot evict the pod. you can still delete it. kubectl drain uses evict under the hood in order to honor the disruption budgets. You can still fall under your minimum when an involuntary eviction happens while you are at your minimum value from your disruption budget.

I tested it yesterday with chaoskube and it works as expected. Unfortunately, the golang fake client that I use for writing tests doesn't quite show the same behaviour. It's usually very accurate.

The outcode should be that chaoskube can be run with a mode respecting the budgets and without for true chaos.

Add a logo

Now that chaoskube is listed on https://kubeapps.com we should get a pretty logo :)

Exclude static pods from termination

static pods cannot be killed via the Kubernetes API: https://kubernetes.io/docs/tasks/administer-cluster/static-pod/

Let's ignore them. We could use these annotations to detect what is a static pod:

metadata:
  annotations:
    kubernetes.io/config.hash: 3ffad4b19c937d5bb9cbacadb2f463a1
    kubernetes.io/config.mirror: 3ffad4b19c937d5bb9cbacadb2f463a1
    kubernetes.io/config.seen: 2018-04-09T07:44:01.286945749Z
    kubernetes.io/config.source: file

"/root/.kube/config: no such file or directory" fatal error

I installed the project and ran this command:

chaoskube --interval=1m --debug --deploy

I'm getting a CrashLoopBackOff with this error:

2017-08-22T21:47:40.214051821Z time="2017-08-22T21:47:40Z" level=info msg="Dry run enabled. I won't kill anything. Use --no-dry-run when you're ready." 
2017-08-22T21:47:40.214124368Z time="2017-08-22T21:47:40Z" level=debug msg="Using current context from kubeconfig at /root/.kube/config." 
2017-08-22T21:47:40.214130887Z time="2017-08-22T21:47:40Z" level=fatal msg="stat /root/.kube/config: no such file or directory"

Any insights?

client-go needs a version bump

Current version of client-go is around 2.0 and most of the apis are deprecated in the new version.

ChaosKube doesn´t working.

Hi Guys,

I have created Kops Cluster using the following commands(Cluster Level rbac not enabled yet). But chaoskube doesn´t kill any pods. Please help me if anything wrong.

+++++++++++
kops create cluster --cloud "gce" --name test.k8s.local --zones=us-east1-b --master-zones=us-east1-b --state gs://testbucket --master-size n1-standard-2 --node-size n1-standard-4 --node-count 1 --admin-access 104.xxx.xx.xxx/32
+++++++++++

Chaoskube Link Used:https://github.com/linki/chaoskube/tree/master/examples

Please my chaoskube pod logs:
+++++
kubectl logs -f chaoskube-6d95c94b4d-nrqjn

time="2018-09-17T10:24:53Z" level=info msg="starting up" dryRun=true interval=2m0s version=v0.10.0
time="2018-09-17T10:24:53Z" level=info msg="connected to cluster" master="https://100.56.0.7:443" serverVersion=v1.10.3
time="2018-09-17T10:24:53Z" level=info msg="setting pod filter" annotations="chaos.alpha.kubernetes.io/enabled=true" labels= namespaces="!kube-system"
time="2018-09-17T10:24:53Z" level=info msg="setting quiet times" daysOfYear="[Apr 1 Dec24]" timesOfDay="[]" weekdays="[Saturday Sunday]"
time="2018-09-17T10:24:53Z" level=info msg="setting timezone" location=UTC name=UTC offset=0
+++++

Here is my edited yamls:

cat rbac.yaml

apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: chaoskube
rules:

apiGroups: [""]
resources: ["pods"]
verbs: ["list", "delete"]

apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
name: chaoskube
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: chaoskube
subjects:

kind: ServiceAccount
name: chaoskube
namespace: default

cat chaoskube.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: chaoskube
labels:
app: chaoskube
spec:
strategy:
type: Recreate
selector:
matchLabels:
app: chaoskube
template:
metadata:
labels:
app: chaoskube
spec:
serviceAccountName: chaoskube
containers:
- name: chaoskube
image: quay.io/linki/chaoskube:v0.10.0
args:
# kill a pod every 10 minutes
- --interval=02m
# only target pods in the test environment
#- --labels=environment=test
# only consider pods with this annotation
- --annotations=chaos.alpha.kubernetes.io/enabled=true
# exclude all pods in the kube-system namespace
- --namespaces=!kube-system
# don't kill anything on weekends
- --excluded-weekdays=Sat,Sun
# don't kill anything during the night or at lunchtime
#- --excluded-times-of-day=22:00-08:00,11:00-13:00
# don't kill anything as a joke or on christmas eve
- --excluded-days-of-year=Apr1,Dec24
# let's make sure we all agree on what the above times mean
- --timezone=UTC
# exclude all pods that haven't been running for at least one hour
- --minimum-age=1m
# terminate pods for real: this disables dry-run mode which is on by default
# - --no-dry-run

apiVersion: v1
kind: ServiceAccount
metadata:
name: chaoskube
labels:
app: chaoskube

ChaosKube Doesn't working in my K8s Cluster (GCP)

Hi,

I have deployed the yaml's in in the examples on my K8s cluster. It's successfully installed,but it doesn't kill any pod or anything like that. Is there any changes in installation. Can you help me please.

Regards
Subin

Allow running with operator role

Hi,

I have a deployment where I'm using the operator role for my kubernetes namespace, so I have full access, but only within my own namespace. chaoskube becomes ready but fails to operate.

pods is forbidden: User \\\"system:serviceaccount:poirot-test:operator\\\" cannot list pods at the cluster scope: unauthorized access system:serviceaccount:xxxxxxxx:operator/[system:serviceaccounts system:serviceaccounts:xxxxxxxx system:authenticated]

linki / chaoskube Goto Github PK

chaoskube's Issues

Proposal

Recommend Projects

Recommend Topics

Recommend Org