Giter VIP home page Giter VIP logo

chaoskube's Introduction

chaoskube

GitHub release go-doc

chaoskube periodically kills random pods in your Kubernetes cluster.

chaoskube

Why

Test how your system behaves under arbitrary pod failures.

Example

Running it will kill a pod in any namespace every 10 minutes by default.

$ chaoskube
INFO[0000] starting up              dryRun=true interval=10m0s version=v0.21.0
INFO[0000] connecting to cluster    master="https://kube.you.me" serverVersion=v1.10.5+coreos.0
INFO[0000] setting pod filter       annotations= labels= minimumAge=0s namespaces=
INFO[0000] setting quiet times      daysOfYear="[]" timesOfDay="[]" weekdays="[]"
INFO[0000] setting timezone         location=UTC name=UTC offset=0
INFO[0001] terminating pod          name=kube-dns-v20-6ikos namespace=kube-system
INFO[0601] terminating pod          name=nginx-701339712-u4fr3 namespace=chaoskube
INFO[1201] terminating pod          name=kube-proxy-gke-earthcoin-pool-3-5ee87f80-n72s namespace=kube-system
INFO[1802] terminating pod          name=nginx-701339712-bfh2y namespace=chaoskube
INFO[2402] terminating pod          name=heapster-v1.2.0-1107848163-bhtcw namespace=kube-system
INFO[3003] terminating pod          name=l7-default-backend-v1.0-o2hc9 namespace=kube-system
INFO[3603] terminating pod          name=heapster-v1.2.0-1107848163-jlfcd namespace=kube-system
INFO[4203] terminating pod          name=nginx-701339712-bfh2y namespace=chaoskube
INFO[4804] terminating pod          name=nginx-701339712-51nt8 namespace=chaoskube
...

chaoskube allows to filter target pods by namespaces, labels, annotations and age as well as exclude certain weekdays, times of day and days of a year from chaos.

How

Helm

You can install chaoskube with Helm. Follow Helm's Quickstart Guide and then install the chaoskube chart.

$ helm repo add chaoskube https://linki.github.io/chaoskube/
$ helm install chaoskube chaoskube/chaoskube --atomic --namespace=chaoskube --create-namespace

Refer to chaoskube on kubeapps.com to learn how to configure it and to find other useful Helm charts.

Raw manifest

Refer to example manifest. Be sure to give chaoskube appropriate permissions using provided ClusterRole.

Configuration

By default chaoskube will be friendly and not kill anything. When you validated your target cluster you may disable dry-run mode by passing the flag --no-dry-run. You can also specify a more aggressive interval and other supported flags for your deployment.

If you're running in a Kubernetes cluster and want to target the same cluster then this is all you need to do.

If you want to target a different cluster or want to run it locally specify your cluster via the --master flag or provide a valid kubeconfig via the --kubeconfig flag. By default, it uses your standard kubeconfig path in your home. That means, whatever is the current context in there will be targeted.

If you want to increase or decrease the amount of chaos change the interval between killings with the --interval flag. Alternatively, you can increase the number of replicas of your chaoskube deployment.

Remember that chaoskube by default kills any pod in all your namespaces, including system pods and itself.

chaoskube provides a simple HTTP endpoint that can be used to check that it is running. This can be used for Kubernetes liveness and readiness probes. By default, this listens on port 8080. To disable, pass --metrics-address="" to chaoskube.

Filtering targets

However, you can limit the search space of chaoskube by providing label, annotation, and namespace selectors, pod name include/exclude patterns, as well as a minimum age setting.

$ chaoskube --labels 'app=mate,chaos,stage!=production'
...
INFO[0000] setting pod filter       labels="app=mate,chaos,stage!=production"

This selects all pods that have the label app set to mate, the label chaos set to anything and the label stage not set to production or unset.

You can filter target pods by namespace selector as well.

$ chaoskube --namespaces 'default,testing,staging'
...
INFO[0000] setting pod filter       namespaces="default,staging,testing"

This will filter for pods in the three namespaces default, staging and testing.

Namespaces can additionally be filtered by a namespace label selector.

$ chaoskube --namespace-labels='!integration'
...
INFO[0000] setting pod filter       namespaceLabels="!integration"

This will exclude all pods from namespaces with the label integration.

You can filter target pods by OwnerReference's kind selector.

$ chaoskube --kinds '!DaemonSet,!StatefulSet'
...
INFO[0000] setting pod filter       kinds="!DaemonSet,!StatefulSet"

This will exclude any DaemonSet and StatefulSet pods.

$ chaoskube --kinds 'DaemonSet'
...
INFO[0000] setting pod filter       kinds="DaemonSet"

This will only include any DaemonSet pods.

Please note: any include filter will automatically exclude all the pods with no OwnerReference defined.

You can filter pods by name:

$ chaoskube --included-pod-names 'foo|bar' --excluded-pod-names 'prod'
...
INFO[0000] setting pod filter       excludedPodNames=prod includedPodNames="foo|bar"

This will cause only pods whose name contains 'foo' or 'bar' and does not contain 'prod' to be targeted.

You can also exclude namespaces and mix and match with the label and annotation selectors.

$ chaoskube \
    --labels 'app=mate,chaos,stage!=production' \
    --annotations '!scheduler.alpha.kubernetes.io/critical-pod' \
    --namespaces '!kube-system,!production'
...
INFO[0000] setting pod filter       annotations="!scheduler.alpha.kubernetes.io/critical-pod" labels="app=mate,chaos,stage!=production" namespaces="!kube-system,!production"

This further limits the search space of the above label selector by also excluding any pods in the kube-system and production namespaces as well as ignore all pods that are marked as critical.

The annotation selector can also be used to run chaoskube as a cluster addon and allow pods to opt-in to being terminated as you see fit. For example, you could run chaoskube like this:

$ chaoskube --annotations 'chaos.alpha.kubernetes.io/enabled=true' --debug
...
INFO[0000] setting pod filter       annotations="chaos.alpha.kubernetes.io/enabled=true"
DEBU[0000] found candidates         count=0
DEBU[0000] no victim found

Unless you already use that annotation somewhere, this will initially ignore all of your pods (you can see the number of candidates in debug mode). You could then selectively opt-in individual deployments to chaos mode by annotating their pods with chaos.alpha.kubernetes.io/enabled=true.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: my-app
spec:
  replicas: 3
  template:
    metadata:
      annotations:
        chaos.alpha.kubernetes.io/enabled: "true"
    spec:
      ...

You can exclude pods that have recently started by using the --minimum-age flag.

$ chaoskube --minimum-age 6h
...
INFO[0000] setting pod filter       minimumAge=6h0m0s

Limit the Chaos

You can limit the time when chaos is introduced by weekdays, time periods of a day, day of a year or all of them together.

Add a comma-separated list of abbreviated weekdays via the --excluded-weekdays options, a comma-separated list of time periods via the --excluded-times-of-day option and/or a comma-separated list of days of a year via the --excluded-days-of-year option and specify a --timezone by which to interpret them.

$ chaoskube \
    --excluded-weekdays=Sat,Sun \
    --excluded-times-of-day=22:00-08:00,11:00-13:00 \
    --excluded-days-of-year=Apr1,Dec24 \
    --timezone=Europe/Berlin
...
INFO[0000] setting quiet times      daysOfYear="[Apr 1 Dec24]" timesOfDay="[22:00-08:00 11:00-13:00]" weekdays="[Saturday Sunday]"
INFO[0000] setting timezone         location=Europe/Berlin name=CET offset=1

Use UTC, Local or pick a timezone name from the (IANA) tz database. If you're testing chaoskube from your local machine then Local makes the most sense. Once you deploy chaoskube to your cluster you should deploy it with a specific timezone, e.g. where most of your team members are living, so that both your team and chaoskube have a common understanding when a particular weekday begins and ends, for instance. If your team is spread across multiple time zones it's probably best to pick UTC which is also the default. Picking the wrong timezone shifts the meaning of a particular weekday by a couple of hours between you and the server.

Flags

Option Environment Description Default
--interval CHAOSKUBE_INTERVAL interval between pod terminations 10m
--labels CHAOSKUBE_LABELS label selector to filter pods by (matches everything)
--annotations CHAOSKUBE_ANNOTATIONS annotation selector to filter pods by (matches everything)
--kinds CHAOSKUBE_KINDS owner's kind selector to filter pods by (all kinds)
--namespaces CHAOSKUBE_NAMESPACES namespace selector to filter pods by (all namespaces)
--namespace-labels CHAOSKUBE_NAMESPACE_LABELS label selector to filter namespaces and its pods by (all namespaces)
--included-pod-names CHAOSKUBE_INCLUDED_POD_NAMES regular expression pattern for pod names to include (all included)
--excluded-pod-names CHAOSKUBE_EXCLUDED_POD_NAMES regular expression pattern for pod names to exclude (none excluded)
--excluded-weekdays CHAOSKUBE_EXCLUDED_WEEKDAYS weekdays when chaos is to be suspended, e.g. "Sat,Sun" (no weekday excluded)
--excluded-times-of-day CHAOSKUBE_EXCLUDED_TIMES_OF_DAY times of day when chaos is to be suspended, e.g. "22:00-08:00" (no times of day excluded)
--excluded-days-of-year CHAOSKUBE_EXCLUDED_DAYS_OF_YEAR days of a year when chaos is to be suspended, e.g. "Apr1,Dec24" (no days of year excluded)
--timezone CHAOSKUBE_TIMEZONE timezone from tz database, e.g. "America/New_York", "UTC" or "Local" (UTC)
--max-runtime CHAOSKUBE_MAX_RUNTIME Maximum runtime before chaoskube exits -1s (infinite time)
--max-kill CHAOSKUBE_MAX_KILL Specifies the maximum number of pods to be terminated per interval 1
--minimum-age CHAOSKUBE_MINIMUM_AGE Minimum age to filter pods by 0s (matches every pod)
--dry-run CHAOSKUBE_DRY_RUN don't kill pods, only log what would have been done true
--log-format CHAOSKUBE_LOG_FORMAT specify the format of the log messages. Options are text and json text
--log-caller CHAOSKUBE_LOG_CALLER include the calling function name and location in the log messages false
--slack-webhook CHAOSKUBE_SLACK_WEBHOOK The address of the slack webhook for notifications disabled
--client-namespace-scope CHAOSKUBE_CLIENT_NAMESPACE_SCOPE Scope Kubernetes API calls to the given namespace (all namespaces)

Related work

There are several other projects that allow you to create some chaos in your Kubernetes cluster.

  • kube-monkey is a sophisticated pod-based chaos monkey for Kubernetes. Each morning it compiles a schedule of pod terminations that should happen throughout the day. It allows to specify a mean time between failures on a per-pod basis, a feature that chaoskube lacks. It can also be made aware of groups of pods forming an application so that it can treat them specially, e.g. kill all pods of an application at once. kube-mokey allows filtering targets globally via configuration options as well allows pods to opt-in to chaos via annotations,it allows individual apps to opt-in in their own unique way, as an example, app-a can request to kill him each week day one pod, while app-b which more couragues can request to kill 50% of pods. It understands a similar configuration file used by Netflix's ChaosMonkey.
  • PowerfulSeal is indeed a powerful tool to trouble your Kubernetes setup. Besides killing pods it can also take out your Cloud VMs or kill your Docker daemon. It has a vast number of configuration options to define what can be killed and when. It also has an interactive mode that allows you to kill pods easily.
  • fabric8's chaos monkey: A chaos monkey that comes bundled as an app with fabric8's Kubernetes platform. It can be deployed via a UI and reports any actions taken as a chat message and/or desktop notification. It can be configured with an interval and a pod name pattern that possible targets must match.
  • k8aos: An interactive tool that can issue a series of random pod deletions across an entire Kubernetes cluster or scoped to a namespace.
  • pod-reaper kills pods based on an interval and a configurable chaos chance. It allows to specify possible target pods via a label selector and namespace. It has the ability successfully shutdown itself after a while and therefore might be suited to work well with Kubernetes Job objects. It can also be configured to kill every pod that has been running for longer than a configurable duration.
  • kubernetes-pod-chaos-monkey: A very simple random pod killer using kubectl written in a couple lines of bash. Given a namespace and an interval it kills a random pod in that namespace at each interval. Pretty much like chaoskube worked in the beginning.
  • kubeinvaders gamified chaos engineering tool for Kubernetes. It is like Space Invaders but the aliens are pods or worker nodes.

Acknowledgements

This project wouldn't be where it is with the ideas and help of several awesome contributors:

Contributing

Feel free to create issues or submit pull requests.

chaoskube's People

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

chaoskube's Issues

Lifecycle with release tags

We would like to automatically pull the latest stable release as soon as it is released instead of updating the tag manually. Is using the latest or master image tag safe to do so? Or would it be possible for you to tag the latest stable release accordingly with e.g. stable or release?

Docs reference a binary, but instruct helm installation

Hey, thanks for this useful tool. The documentation has me confused, as it instructs installation via a helm deployment (which is cool), but 90% of the documentation references a binary (chaoskube) but there isn't any documentation on how to get access to that binary. Am I supposed to clone repository this and put something in my PATH? Am I expected to create a local binary which runs this tool through docker?

Add some interesting custom metrics

I'm a metrics newbie. Time to play around on a simple project like this.

  • time to filter pods (I guess it's quite expensive the way it's implemented now)
  • killed pods per minute, per namespace etc.
  • errors / successes
  • % blocked by pod disruption budgets

New chart compatible only with k8s > 1.9.0

Hi,

Since last update to the chart (version 0.8.1) it's now support only K8S > 1.9.0 (apps/v1 api) which is a blocker for us.
Is this a change is a most or you can reconsider it?

I've successfully ran 0.10.0 image on my k8s 1.8 cluster without the helm chart.

reduce size of "official" image

I'm building the image on quay.io so I have to compile the binary in the image which greatly increases its size.

Options:

  • build somewhere else where compilation of the binary and docker building can be split (drone, google container builder)
  • remove everything and have the image squashed by quay.io?

Filter namespaces by labels

Our cluster includes ephemeral, randomly named namespaces which are used to run automated integration tests as part of a Jenkine pipeline. Rather than tagging numerous pod with a labels to exclude them, we'd like to be able to exclude entire namespaces using labels.

This could be achieved with an additional option, --namespace-labels. Just wondering if this would be a useful addition? I'd happy to raise a PR for this.

Add a makefile to run tests and build more easily

Currently one has to run go test ./... and go build main.go to test and build the binary.

To make it easier for people cloning/forking this repository we should add a simple Makefile to run these tasks.

Observe non-working times feature

I'm proposing a feature addition to chaoskube that would add the ability suspend the chaos during nights, weekends and holidays using the following command-line options. These are designed to be somewhat consistent with the current pattern of chaoskube options as well as the configuration options for Chaos Monkey. They should be self-explanatory:

--observe-off-times true # defaults to false
--location 'America/New_York' # , or 'UTC'. Req'd if observe-off-times is true
--offdays 'Saturday, Sunday'         # default
--workhours 'start=09:00, end=17:00' # default
--holidays '2017-12-25, 2018-01-01'  # defaults to empty list

The options above imply that both --observe-off-times true and --location '...' must be present for the feature to take effect. There is purposefully no default location so the user is forced to provide this, since most SRE staff is probably not working in the GMT timezone, so defaulting to UTC would not really make sense in this case.

Note that this requires a IANA Time Zone as opposed to a three-letter timezone abbreviation such as 'EDT' or 'EST', that would have to change with Daylight Saving conventions. Daylight Saving is automatically accounted for by using the IANA Time Zones.

I intend to post a PR as soon as I have this implemented, but wanted to get some feedback in case I'm missing something.

ChaosKube Doesn't working in my K8s Cluster (GCP)

Hi,

I have deployed the yaml's in in the examples on my K8s cluster. It's successfully installed,but it doesn't kill any pod or anything like that. Is there any changes in installation. Can you help me please.

Regards
Subin

Filter pods based on age

Thanks for this project. I'd started to write my own when I found this.

I'd like to filter pods by there age. Our use case is to not delete pods that are younger than a certain time.

Any interest in adding a min-age flag? I'd be happy to do the work and submit a PR if so.

Thanks!

Add killed pod metric

It would be nice to have a metric which pod has been killed with labels like namespace and pod (name of pod).

So teams can have a easy monitoring for their killed pods.

Adding slack notifications

Hi,

I'm looking for a way to notify my team every-time the chaos bot started to perform actions.
As Slack usage is widely used, that will be my preference.

I want to start and implement that capability for chaoskube.

Any thoughts?

respect pod disruption budgets

from #6 deployment limits

@kfox1111 So, I'd really like to use chaoskube to force our deployment objects to have to exercise its connection tracking/safe shutdown code. Some assurances too many pods don't get killed would be good though. Would it be possible to add support for looking at the .spec.strategy.rollingUpdate.maxUnavailable field and the .spec.replica's field to ensure not too many are out at a time?

@linki I looked into PodDisruptionBudgets yesterday and they are pretty much want you want.

Kubernetes defines voluntary evictions (e.g. due to draining, auto-downscaling, etc.) and involuntary pod evictions (node failures etc.).

With those budgets you define a label selector and a minimun number of pods that should exists matching this selector. If not you cannot evict the pod. you can still delete it. kubectl drain uses evict under the hood in order to honor the disruption budgets. You can still fall under your minimum when an involuntary eviction happens while you are at your minimum value from your disruption budget.

I tested it yesterday with chaoskube and it works as expected. Unfortunately, the golang fake client that I use for writing tests doesn't quite show the same behaviour. It's usually very accurate.

The outcode should be that chaoskube can be run with a mode respecting the budgets and without for true chaos.

how to turn off the dry-run mode

As the README documented, chaoskube default to run in the dry-run mode, that's okay, and the actual behavior align with that. After I confirmed the targets, and want to turn-off the dry-run mode, I failed. The README only mentioned users can turn off that, wondering the way to do it.

image

PS: I am using the version v0.14.0, via the helm chart template 3.1.2, which I thought is the latest one.

I found another issue #103 but seems not the same thing.

Selection of pods to be terminated by using multiple labels is not working for me

Hi Martin,

I am struggling with multiple labels. I need to select pods for termination using labels (one key with multiple values). If there is only one value, everything works fine. When there are multiple values, the chaoskube runs (in dry mode), there is no error output, but there is also no log about killing the pods. I tried following syntax of "labels" line of yaml file:

  • --labels=app=xxx, yyy
  • --labels=app=xxx,yyy
  • --labels=app=xxx,app=yyy

No luck so far :-(

What is the correct syntax of "labels"? We want to create the "pool" of pods of multiple applications to be killed.

What would be the syntax if I ever need to select pods by multiple keys with multiple values?

Thank You,

Ladislav

Allow configuration of grace period when deleting pods

Hello,

First of all, tanks for your awesome work!

It makes sense allow one to configure the grace period that K8s will give to the deleted pod before violently killing it (using SIGKIL). According to the documentation the default value (at least for delete commands issued via kubectl) is 30 seconds, which I think is more than enough time for applications that supports graceful shutdown.

I've checked the code, and ATM the grace period parameter is not supplied when invoking the K8s API:

return c.Client.Core().Pods(victim.Namespace).Delete(victim.Name, nil)

I've currently applied the following patch In order to kill processes almost immediately:

diff --git a/chaoskube/chaoskube.go b/chaoskube/chaoskube.go
index 645b6e8..be7828b 100644
--- a/chaoskube/chaoskube.go
+++ b/chaoskube/chaoskube.go
@@ -171,7 +171,9 @@ func (c *Chaoskube) DeletePod(victim v1.Pod) error {
 		return nil
 	}
 
-	return c.Client.Core().Pods(victim.Namespace).Delete(victim.Name, nil)
+	secs := int64(0)
+	deleteopts := &metav1.DeleteOptions{GracePeriodSeconds: &secs}
+	return c.Client.Core().Pods(victim.Namespace).Delete(victim.Name, deleteopts)
 }
 
 // filterByNamespaces filters a list of pods by a given namespace selector.

Please let me know if any more details are needed.
Cheers,
Ivan

[power] chaoskube (v0.6.1) image is not available for power.

I am running following os on power machine.

Linux icp1p1 4.10.0-42-generic #46~16.04.1-Ubuntu SMP Mon Dec 4 15:55:56 UTC 2017 ppc64le ppc64le ppc64le GNU/Linux

After deploying chaoskube in the host, I get following error:
standard_init_linux.go:185: exec user process caused "exec format error"
The chaoskube runs well in x86 hosts, but not in power. Is there any workaround how to run that in power, is there any image available for power machine.

Consider changing contructor parameters to a struct

The number of arguments for Chaoskube.New() has increased to a point where it becomes annoying to use.

We could switch to a struct to have order-independent arguments by Name. It would also allow us to leave out keys when we want the default value.

chaoskube fails when starting with read-only filesystem

kubectl logs -f chaoskube-7b68cccbcf-g67cx
time="2019-08-08T19:38:36Z" level=info msg="starting up" dryRun=true interval=5s version=v0.15.0
W0808 19:38:36.596835       6 client_config.go:541] Neither --kubeconfig nor --master was specified.  Using the inClusterConfig.  This might not work.
log: exiting because of error: log: cannot create log: open /tmp/chaoskube.chaoskube-7b68cccbcf-g67cx.nobody.log.WARNING.20190808-193836.6: read-only file system

That's a regression. I though the switch to klog removes the need to rewrite the import. Seems like klog writes to disk like glog.

version: v0.15.0 with

        securityContext:
          capabilities:
            drop:
            - ALL
          procMount: Default
          readOnlyRootFilesystem: true
          runAsNonRoot: true
          runAsUser: 65534

Cannot set multiple days in excluded Weekdays

$ helm install stable/chaoskube --set dryRun=false --set namespaces=test --set interval=10m --set timezone=America/New_York --set excludedWeekdays="Mon,Wed,Thu,Fri" --set excludedTimesOfDay="08:00-18:00"

Error: failed parsing --set data: key "Wed" has no value (cannot end with ,)

Cannot set multiple days in excluded Weekdays

Exclude evicted pods from termination

I tested your simple but effective chaoskube & I think it really useful for us.

Do you have any plans to exclude evicted pods from terminations ?

e.g.
kubectl get pods -n {your_namespace} -w

cassandra-db-fc85698c7-hsw6w                      1/1       Running            0          7m
cassandra-db-fc85698c7-l57ms                      0/1       Evicted            0          4h
cassandra-db-fc85698c7-pg8ps                      0/1       Evicted            0          4h
cassandra-db-fc85698c7-swpk9                      0/1       Evicted            0          4h
cassandra-db-fc85698c7-tv4nv                      0/1       Evicted            0          4h
cassandra-db-fc85698c7-zjvwz                      0/1       Evicted            0          4h

In this case chaoskube try to evict pods which already evicted.

If this feature already implemented then can anybody guide me in configuration?

Observation: pod which are in error state or not fully started victim of chaoskube.

ChaosKube doesn´t working.

Hi Guys,

I have created Kops Cluster using the following commands(Cluster Level rbac not enabled yet). But chaoskube doesn´t kill any pods. Please help me if anything wrong.

+++++++++++
kops create cluster --cloud "gce" --name test.k8s.local --zones=us-east1-b --master-zones=us-east1-b --state gs://testbucket --master-size n1-standard-2 --node-size n1-standard-4 --node-count 1 --admin-access 104.xxx.xx.xxx/32
+++++++++++

Chaoskube Link Used:https://github.com/linki/chaoskube/tree/master/examples

Please my chaoskube pod logs:
+++++
kubectl logs -f chaoskube-6d95c94b4d-nrqjn

time="2018-09-17T10:24:53Z" level=info msg="starting up" dryRun=true interval=2m0s version=v0.10.0
time="2018-09-17T10:24:53Z" level=info msg="connected to cluster" master="https://100.56.0.7:443" serverVersion=v1.10.3
time="2018-09-17T10:24:53Z" level=info msg="setting pod filter" annotations="chaos.alpha.kubernetes.io/enabled=true" labels= namespaces="!kube-system"
time="2018-09-17T10:24:53Z" level=info msg="setting quiet times" daysOfYear="[Apr 1 Dec24]" timesOfDay="[]" weekdays="[Saturday Sunday]"
time="2018-09-17T10:24:53Z" level=info msg="setting timezone" location=UTC name=UTC offset=0
+++++

Here is my edited yamls:


cat rbac.yaml

apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: chaoskube
rules:

  • apiGroups: [""]
    resources: ["pods"]
    verbs: ["list", "delete"]

apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
name: chaoskube
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: chaoskube
subjects:

  • kind: ServiceAccount
    name: chaoskube
    namespace: default

cat chaoskube.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: chaoskube
labels:
app: chaoskube
spec:
strategy:
type: Recreate
selector:
matchLabels:
app: chaoskube
template:
metadata:
labels:
app: chaoskube
spec:
serviceAccountName: chaoskube
containers:
- name: chaoskube
image: quay.io/linki/chaoskube:v0.10.0
args:
# kill a pod every 10 minutes
- --interval=02m
# only target pods in the test environment
#- --labels=environment=test
# only consider pods with this annotation
- --annotations=chaos.alpha.kubernetes.io/enabled=true
# exclude all pods in the kube-system namespace
- --namespaces=!kube-system
# don't kill anything on weekends
- --excluded-weekdays=Sat,Sun
# don't kill anything during the night or at lunchtime
#- --excluded-times-of-day=22:00-08:00,11:00-13:00
# don't kill anything as a joke or on christmas eve
- --excluded-days-of-year=Apr1,Dec24
# let's make sure we all agree on what the above times mean
- --timezone=UTC
# exclude all pods that haven't been running for at least one hour
- --minimum-age=1m
# terminate pods for real: this disables dry-run mode which is on by default
# - --no-dry-run


apiVersion: v1
kind: ServiceAccount
metadata:
name: chaoskube
labels:
app: chaoskube


Does chaoskube really kill the pods?

Hi Martin,

I am currently working on a project where we are trying to improve reliability of our software via using chaos engineering (but, unfortunately, have a very little experience with it). Currently, our software runs on Azure/Kubernetes.

We found chaoskube as a promising tool to help us, but we found out, that it's behavior is different than expected. In the description of chaoskube, there is an information that it kills the pods, so I created a hypothesis about what will happen when one of our pods will just be dealing with a request when it is killed (there should be an error response and next requests should be processed by the other pod). When I started the experiment, the pods were killed but no error occured.

Then one of my colleagues looked in the source code of chaoskube and found out, that the pod is not killed (i.e. force killed instantly), but rather terminated (if I got it correctly, then by using this approach, the pod finishes dealing with it's current task and then "dies" peacefuly).

Is this really how chaoskube works?

We are learning more about chaos every day, but there is a lot of knowledge that we need to gain.

Since my hypothesis was probably wrong, I would be really graceful for any advice about what other chaos experiments is chaoskube suitable for.

Thank You,

Ladislav

Decouple individual pod termination frequency from cluster size

Currently, the probability of a pod being killed depends on the number of pods being in the target group. This is bad if you want to run chaoskube as a cluster addon and opt-in to being killed via annotations because you cannot estimate how often that would happen.

Proposal

Allow specifying or at least somehow keep track of what's going on so Pod terminations happen in a somewhat predictable way. For example, instead of terminating a single pod every 10 minutes, each pod may have a probability of X% of being killed per hour. This, hopefully, would make pod terminations independent of the number of pods running in total.

Terminating multiple pods per run

Hi!

I am testing ChaosKube on k8s cluster with a large number of pods. The current approach of terminating only one pod per run means that some pods will not be scheduled for termination, given how large the pool is.

Would you be interested in a PR that adds a new configuration option (defaults to 1, current behavior) to override the current behavior?

Something like --max-kill=10, would attempt to terminate up to 10 pods.

Let me know if this feature makes sense for the project and I'll happily submit a PR.

Add /healthz HTTP endpoint that can be used as a livenessProbe

It'd be good to have an HTTP endpoint that could be used for Kubernetes readiness and liveness probes. It could return 200 OK. We could create the HTTP listener just before the infinite for loop in a goroutine.

I'm happy to do a PR if you think this is a good idea.

Allow running with operator role

Hi,

I have a deployment where I'm using the operator role for my kubernetes namespace, so I have full access, but only within my own namespace. chaoskube becomes ready but fails to operate.

pods is forbidden: User \\\"system:serviceaccount:poirot-test:operator\\\" cannot list pods at the cluster scope: unauthorized access system:serviceaccount:xxxxxxxx:operator/[system:serviceaccounts system:serviceaccounts:xxxxxxxx system:authenticated]

filter target pods by attributes

from #2 (comment)

Another possibility is attributes.

@kfox1111 want do you have in mind?
something like this maybe:
all pods with some value for an attribute (e.g. serviceAccountName=default)
all pods containing at least one container with some value for an attribute (e.g. image=nginx )

Add k8s event to the resource owning the pod(s) being terminated.

Hi @linki,

Thoughts on also adding the termination event to the top most owner of a pod in addition to the pod itself?

We're attempting to add visibility for application owners when their pods get terminated – and new pods are starting. The terminated pods kind of disappear from the current view of the deployment, which adds a bit of work to find if any pod has been terminated through ChaosKube.

I know that it's possible to look at events in that namespace to find out what happened, but I believe that adding an event to the deployment/parent of the pod would greatly help in surfacing the actions of ChaosKube.

If that sounds like a useful feature, I'll happily submit a PR.

Using the inClusterConfig. This might not work.

Since #28 I am seeing the error:

log: exiting because of error: log: cannot create log: open /tmp/chaoskube.chaoskube-production-
4075332500-53jjx.unknownuser.log.WARNING.20170707-151720.1: no such file or directory```

I see that the `config, err := rest.InClusterConfig()` was removed, this might just be the error.

"/root/.kube/config: no such file or directory" fatal error

I installed the project and ran this command:

chaoskube --interval=1m --debug --deploy

I'm getting a CrashLoopBackOff with this error:

2017-08-22T21:47:40.214051821Z time="2017-08-22T21:47:40Z" level=info msg="Dry run enabled. I won't kill anything. Use --no-dry-run when you're ready." 
2017-08-22T21:47:40.214124368Z time="2017-08-22T21:47:40Z" level=debug msg="Using current context from kubeconfig at /root/.kube/config." 
2017-08-22T21:47:40.214130887Z time="2017-08-22T21:47:40Z" level=fatal msg="stat /root/.kube/config: no such file or directory" 

Any insights?

deployment limits

So, I'd really like to use chaoskube to force our deployment objects to have to exercise its connection tracking/safe shutdown code. Some assurances too many pods don't get killed would be good though. Would it be possible to add support for looking at the .spec.strategy.rollingUpdate.maxUnavailable field and the .spec.replica's field to ensure not too many are out at a time?

Create more chaos during business hours

Chaoskube should be able to create more chaos during hours where people are around to notice, fix and learn from failures. It makes no sense for the artificial chaos to occur at night and get people paged when they are asleep.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.