krkn-chaos / cerberus Goto Github PK

View Code? Open in Web Editor NEW

91.0 13.0 41.0 810 KB

Guardian of Kubernetes clusters. Tool to monitor clusters health and signal/alert on failures.

License: Apache License 2.0

Python 88.54% Dockerfile 1.12% Shell 9.27% HTML 1.07%

kubernetes monitoring watcher health-check scalability performance-testing reliability component-failures

cerberus's Introduction

Cerberus

Guardian of Kubernetes and OpenShift Clusters

Cerberus watches the Kubernetes/OpenShift clusters for dead nodes, system component failures/health and exposes a go or no-go signal which can be consumed by other workload generators or applications in the cluster and act accordingly.

Workflow

Installation

Instructions on how to setup, configure and run Cerberus can be found at Installation.

What Kubernetes/OpenShift components can Cerberus monitor?

Following are the components of Kubernetes/OpenShift that Cerberus can monitor today, we will be adding more soon.

Component	Description	Working
Nodes	Watches all the nodes including masters, workers as well as nodes created using custom MachineSets	✔️
Namespaces	Watches all the pods including containers running inside the pods in the namespaces specified in the config	✔️
Cluster Operators	Watches all Cluster Operators	✔️
Masters Schedulability	Watches and warns if masters nodes are marked as schedulable	✔️
Routes	Watches specified routes	✔️
CSRs	Warns if any CSRs are not approved	✔️
Critical Alerts	Warns the user on observing abnormal behavior which might effect the health of the cluster	✔️
Bring your own checks	Users can bring their own checks and Ceberus runs and includes them in the reporting as wells as go/no-go signal	✔️

An explanation of all the components that Cerberus can monitor are explained here

How does Cerberus report cluster health?

Cerberus exposes the cluster health and failures through a go/no-go signal, report and metrics API.

Go or no-go signal

When the cerberus is configured to run in the daemon mode, it will continuosly monitor the components specified, runs a light weight http server at http://0.0.0.0:8080 and publishes the signal i.e True or False depending on the components status. The tools can consume the signal and act accordingly.

Report

The report is generated in the run directory and it contains the information about each check/monitored component status per iteration with timestamps. It also displays information about the components in case of failure. Refer report for example.

You can use the "-o <file_path_name>" option to change the location of the created report

Metrics API

Cerberus exposes the metrics including the failures observed during the run through an API. Tools consuming Cerberus can query the API to get a blob of json with the observed failures to scrape and act accordingly. For example, we can query for etcd failures within a start and end time and take actions to determine pass/fail for test cases or report whether the cluster is healthy or unhealthy for that duration.

The failures in the past 1 hour can be retrieved in the json format by visiting http://0.0.0.0:8080/history.
The failures in a specific time window can be retrieved in the json format by visiting http://0.0.0.0:8080/history?loopback=.
The failures between two time timestamps, the failures of specific issues types and the failures related to specific components can be retrieved in the json format by visiting http://0.0.0.0:8080/analyze url. The filters have to be applied to scrape the failures accordingly.

Slack integration

Cerberus supports reporting failures in slack. Refer slack integration for information on how to set it up.

Node Problem Detector

Cerberus also consumes node-problem-detector to detect various failures in Kubernetes/OpenShift nodes. More information on setting it up can be found at node-problem-detector

Bring your own checks

Users can add additional checks to monitor components that are not being monitored by Cerberus and consume it as part of the go/no-go signal. This can be accomplished by placing relative paths of files containing additional checks under custom_checks in config file. All the checks should be placed within the main function of the file. If the additional checks need to be considered in determining the go/no-go signal of Cerberus, the main function can return a boolean value for the same. Having a dict return value of the format {'status':status, 'message':message} shall send signal to Cerberus along with message to be displayed in slack notification. However, it's optional to return a value. Refer to example_check for an example custom check file.

Alerts

Monitoring metrics and alerting on abnormal behavior is critical as they are the indicators for clusters health. Information on supported alerts can be found at alerts.

Use cases

There can be number of use cases, here are some of them:

We run tools to push the limits of Kubernetes/OpenShift to look at the performance and scalability. There are a number of instances where system components or nodes start to degrade, which invalidates the results and the workload generator continues to push the cluster until it is unrecoverable.
When running chaos experiments on a kubernetes/OpenShift cluster, they can potentially break the components unrelated to the targeted components which means that the chaos experiment won't be able to find it. The go/no-go signal can be used here to decide whether the cluster recovered from the failure injection as well as to decide whether to continue with the next chaos scenario.

Tools consuming Cerberus

Benchmark Operator: The intent of this Operator is to deploy common workloads to establish a performance baseline of Kubernetes cluster on your provider. Benchmark Operator consumes Cerberus to determine if the cluster was healthy during the benchmark run. More information can be found at cerberus-integration.
Kraken: Tool to inject deliberate failures into Kubernetes/OpenShift clusters to check if it is resilient. Kraken consumes Cerberus to determine if the cluster is healthy as a whole in addition to the targeted component during chaos testing. More information can be found at cerberus-integration.

Blogs and other useful resources

Contributions

We are always looking for more enhancements, fixes to make it better, any contributions are most welcome. Feel free to report or work on the issues filed on github.

More information on how to Contribute

Community

Key Members(slack_usernames): paige, rook, mffiedler, mohit, dry923, rsevilla, ravi

Credits

Thanks to Mary Shakshober ( https://github.com/maryshak1996 ) for designing the logo.

cerberus's People

Contributors

Stargazers

Watchers

cerberus's Issues

Cerberus scalability issues

A simple run of Cerberus on a 10 node cluster vs 220 nodes proved that we need to improve the way we run the checks for Cerberus to scale well on a cluster with hundreds/thousands of nodes. The time taken to run the checks has been increasing as we add more checks like @mffiedler mentioned in #53. Here are the observed timings:

10 nodes:
2020-05-22 23:00:17,391 [INFO] -------------------------- Iteration Stats -------------------------------
2020-05-22 23:00:17,392 [INFO] Time taken to run watch_nodes in iteration 1: 0.1057283878326416 seconds
2020-05-22 23:00:17,392 [INFO] Time taken to run watch_cluster_operators in iteration 1: 0.3759939670562744 seconds
2020-05-22 23:00:17,392 [INFO] Time taken to run watch_namespaces in iteration 1: 1.1533699035644531 seconds
2020-05-22 23:00:17,392 [INFO] Time taken to run entire_iteration in iteration 1: 4.368650436401367 seconds
2020-05-22 23:00:17,392 [INFO] --------------------------------------------------------------------------

220 nodes:
2020-05-22 23:13:00,130 [INFO] -------------------------- Iteration Stats -------------------------------
2020-05-22 23:13:00,130 [INFO] Time taken to run watch_nodes in iteration 2: 19.62144660949707 seconds
2020-05-22 23:13:00,131 [INFO] Time taken to run watch_cluster_operators in iteration 2: 1.0622196197509766 seconds
2020-05-22 23:13:00,131 [INFO] Time taken to run watch_namespaces in iteration 2: 68.95069146156311 seconds
2020-05-22 23:13:00,131 [INFO] Time taken to run entire_iteration in iteration 2: 161.81616592407227 seconds
2020-05-22 23:13:00,131 [INFO] --------------------------------------------------------------------------

@portante suggested areas for improvement. Multiprocessing will reduce the timing by using available cores to run checks in parallel ( #60 ) but we also need to take a look at optimizing the code to reduce the number of API calls and loops which iterate over the objects to get the status wherever possible in addition to running checks in parallel like @portante suggested. This issue is to track the observations and discuss ways to make Cerberus scale well on a large and dense cluster.

NOTE: The 10 nodes and 220 nodes clusters were different.

Thoughts?

Test to check the functionality

We need tests in CI to make sure a new commit is not breaking the functionality of cerberus. We can just run the tool with iterations set to 3 and sleep time between each iteration set to 30 seconds instead of daemon mode to see if it's working as expected.

Slack Integration "Exception: name 'thread_ts' is not defined"

I'm integrating cerberus with my slack workspace and after working through things thought I had it working. However today cerberus started returning a no-go signal based on an issue and I was expecting to see it pop up in slack but didn't. Looking at the output for Cerberus I see this error message appearing regularly:

Feb 11 09:30:49 lab-server bash[368944]: 2021-02-11 09:30:49,399 [INFO] Encountered issues in cluster. Hence, setting the go/no-go signal to false
**Feb 11 09:30:49 lab-server bash[368944]: 2021-02-11 09:30:49,399 [INFO] Exception: name 'thread_ts' is not defined**
Feb 11 09:30:49 lab-server bash[369025]: 2021-02-11 09:30:49,756 [INFO] Iteration 205: Node status: True
Feb 11 09:30:50 lab-server bash[369028]: 2021-02-11 09:30:50,071 [INFO] Iteration 205: Cluster Operator status: True
Feb 11 09:30:50 lab-server bash[369027]: 2021-02-11 09:30:50,219 [INFO] Iteration 205: openshift-machine-api: True
Feb 11 09:30:50 lab-server bash[369032]: 2021-02-11 09:30:50,224 [INFO] Iteration 205: openshift-apiserver: True
Feb 11 09:30:50 lab-server bash[369033]: 2021-02-11 09:30:50,225 [INFO] Iteration 205: openshift-kube-controller-manager: True
Feb 11 09:30:50 lab-server bash[369029]: 2021-02-11 09:30:50,240 [INFO] Iteration 205: openshift-kube-apiserver: True
Feb 11 09:30:50 lab-server bash[369024]: 2021-02-11 09:30:50,242 [INFO] Iteration 205: openshift-sdn: True
Feb 11 09:30:50 lab-server bash[369026]: 2021-02-11 09:30:50,253 [INFO] Iteration 205: openshift-etcd: True
Feb 11 09:30:50 lab-server bash[369023]: 2021-02-11 09:30:50,253 [INFO] Iteration 205: openshift-kube-scheduler: True
Feb 11 09:30:50 lab-server bash[369031]: 2021-02-11 09:30:50,254 [INFO] Iteration 205: openshift-ingress: False
Feb 11 09:30:50 lab-server bash[369034]: 2021-02-11 09:30:50,295 [INFO] Iteration 205: openshift-monitoring: True
Feb 11 09:30:50 lab-server bash[368944]: 2021-02-11 09:30:50,296 [INFO] HTTP requests served: 1
Feb 11 09:30:50 lab-server bash[368944]: 2021-02-11 09:30:50,296 [INFO] Iteration 205: Failed pods and components
Feb 11 09:30:50 lab-server bash[368944]: 2021-02-11 09:30:50,296 [INFO] openshift-ingress: ['router-default-7645688499-8j5c4']
Feb 11 09:30:50 lab-server bash[368944]: 2021-02-11 09:30:50,296 [INFO]

I do not have any watcher or team alias defined in my config:

   slack_integration: True
    watcher_slack_ID:                                        
        Monday:
        Tuesday:
        Wednesday:
        Thursday:
        Friday:
        Saturday:
        Sunday:
    slack_team_alias:

Track Kube/OpenShift object restarts

@mffiedler suggested an enhancement for Cerberus to monitor pod restarts. Cerberus will miss a failure if a pod restarts and gets back to running state during the wait time between each iteration. It needs to check if there was restart for each iteration and take it into consideration for setting the go/no-go signal.

One way to implement this would to be keep track of the restart count at the start and stop of each iteration and compare them to know if there was a pod restart during the wait time.

Cerberus errors out when unable to find namespace

When launching the container and it is unable to find the requested namespaces (openshift-sdn and openshift-ovn-kubernetes) Cerberus Errors out. Should this be an error OR should we just be reporting a failed state but continue to start up Cerberus?

2020-04-27 20:23:35,693 [ERROR] Could not find openshift-sdn and openshift-ovn-kubernetes namespaces,         please specify the correct networking namespace in config file
Traceback (most recent call last):
  File "start_cerberus.py", line 223, in <module>
    main(options.cfg)
  File "start_cerberus.py", line 50, in main
    watch_namespaces = [w.replace('openshift-sdn', sdn_namespace) for w in watch_namespaces]
  File "start_cerberus.py", line 50, in <listcomp>
    watch_namespaces = [w.replace('openshift-sdn', sdn_namespace) for w in watch_namespaces]
TypeError: replace() argument 2 must be str, not None

CI for running linters on PR's

We need to enable travis CI to run linters on each PR to make sure it follows the best practices and doesn't break the tool because of things like wrong indentation e.t.c.

Ability to add/enable collections of optional monitors

This might be stretching the original intent of Cerberus, but I see a trend. As we add additional checks, the patterns we could follow are a) make the new check a default and always run it b) give the new check an option in the config or c) introduce the idea of collections of optional checks - or maybe just one optional collection of "verbose health checks" for simplicity.

There are a lot of detailed things that could be monitored on a cluster - whether Cerberus should monitor them is open for discussion (issue #42 ). As new checks are added the monitor loop time grows at least linearly with the number of monitored namespaces and higher when pod checks are included (PR #52 ).

For discussion, should we identify a core set of critical checks and enable some mechanism for optional/verbose checks without adding a config flag for everyone of them?

/cc: @paigerube14 @chaitanyaenr @yashashreesuresh

Monitor kube scheduler

Tool needs to monitor Kube scheduler as it's one of the key component which is essential for cluster functionality.

Sweet logo for Cerberus

All great projects must have a logo!

[RFE] New API for capturing number of failures

Today we have a single signal, go/no-go, however this signal could miss operators/pods flapping. We should provide an additional signal, which accepts a time range. This signal will report the number of failures seen within the time window.

/history?lookback=60

lookback should be in minutes, and allow the user to determine how far back in time they want to know if failures happened.

The return of this should be json

history : {
  duration: int(),
  failures : {
    count: int(),
    issues : [ list of problems that occurred in the duration ]
  }
}

The history api will provide the user/tool with more data to determine if they should continue on, or stop.

Thoughts?

Generate a report

The tool currently prints the events/logs to the stdout, it needs to generate a report with timestamps which we can use to take a look at the events later.

Monitor apiserver availability

While executing chaos (kraken) scenarios, measure the amount of time the apiserver is unavailable. The goal is zero downtime, but that is not the current reality. While the apiserver is unavailable, cerberus should return a no-go signal and ideally maintain a metric with the amount of time it is unavailable.

This can be tested with calls to https://:<api_port>/healthz. For OpenShift this might look like https://api.mffiedler-511.perf-testing.devcluster.openshift.com:6443/healthz

Cerberus cop enhancement similar to OCP build cop

@jtaleric came up with an idea to add support to tag specific people on slack in case of cluster failures instead of everyone in the channel. The cerberus cop responsible for taking a look to fix/file a bz for the failures on the cluster. We can provide a config with a list of people/slack handles along with specific people assigned as cops for each day of the week for cerberus to read and only ping the active cop for the day.

Add python linter

@aakarshg suggested adding a linter and it is a good practice to use linters to improve the code quality. We need to enable it for every pull request using a CI like travis.

Warn in case of pending csr's

Cerberus needs to check for any pending csr's and warn the users as the nodes needs it to be approved to be part of the cluster ( might be useful to monitor the cluster shutdown scenario in Kraken ). It shouldn't be considered for pass/fail since a scale up process will have pending csr's for a short duration and it is expected.

Incorrect results while checking master NoSchedule taint

When a master node has several taints associated with it, runcommand.invoke("kubectl describe nodes/" + node_name + ' | grep Taints') in get_taint_from_describe would just grep for the first taint thus leading to incorrect results. We need to loop through all the taints to check if the master node has node-role.kubernetes.io/master:NoSchedule taint.

Global debugging messages

How do you enable debug level in the config files for cerberus? Would be great if it is possible and documented

Investigate Kubemark

A cluster with hundreds/thousands of nodes is needed to test and see how Cerberus runs at scale. In discussion with @smalleni about project Sahasra which simulates the compute nodes to stress the OpenStack control plane, we were wondering if doing the same will help us with testing Cerberus. We had an issue open about it to help with testing the control plane/cluster maximums without having to set up a large scale cluster but Kubemark being specific to Kubernetes might not be the right fit.

We need to investigate Kubemark to see if will suffice our need for testing Cerberus at scale by simulating worker nodes to stress the control plane instead of using real hardware: https://github.com/kubernetes/kubernetes/tree/a8128804abbc311328b146c007df2cac09ba7fdf/test/kubemark/pre-existing.

NOTE: This might be an overkill for solving a simple problem :-)

Use SQLAlchemy for ORM

In cerberus/database/client.py it would be a good improvement to use SQLAlchemy for ORM as you would then no longer need to write pure SQL statements as in https://github.com/openshift-scale/cerberus/blob/master/cerberus/database/client.py#L54-L55

Each `Watch` should become its own thread

Each watch in the loop here : https://github.com/openshift-scale/cerberus/blob/master/start_cerberus.py#L99-L199
Should be a thread.

While at small scale, things like node listing and looking through namespaces will return quickly. However, when we have larger deployments, the time it takes to query each of these APIs could take too much time between the intervals.

Add ability to accept regex for namespaces to monitor

The config needs to support specifying regex for namespaces similar to how @paigerube14 is adding it for kraken ( krkn-chaos/krkn#16 ). This way we can just use openshift-* instead of specifying all the namespaces.

Monitor application/ingress route availability (OpenShift specific monitor)

While executing chaos (kraken) scenarios, measure the amount of time the ingress routes to applications are unavailable. The goal is zero downtime, but that is not the current reality.

While the appications routes are unavailable, cerberus should return a no-go signal and ideally maintain a metric with the amount of time they unavailable.

This should be an optional config item which:

sets up a quickstart application (e.g. oc new-app -t nodejs-mongodb-example
retrieves the route for the new application
monitors the availability of the route (200 vs other)

Pip package cerberus

This will allow the users/tools to just use pip to install and setup cerberus thus simplifying the process.

Monitoring url routes throws error when watch_url_routes is empty

An error is thrown in line when the user doesn't provide any url under watch_url_routes. By default, watch_url_routes list is empty in the config.

Enable support to run Cerberus tool on Power (ppc64le)

Task: Enable support to run Cerberus tool on Power (ppc64le)

The scenario is to run Cerberus tests on a host having Power (ppc64le) architecture. This would be beneficial to test clusters in environments having only Power VMs.

The Dockerfile which is currently available, only runs on Intel architecture.
To solve this, I have created a separate Dockerfile for ppc64le so that the tool can be run without any dependency problems.

Collect logs, events and metrics of the failed component

The tool needs to collect logs, metrics and events relevant to the components when there's a failure. This can be achieved by running the following command using the invoke module added in #20: $ oc adm inspect <failed_component_namespace>. Eventually, the collected data can be exposed by the http server along with the go/no-go signal.

Some Refactoring for Readability

Want to do some cleanup of duplicate logging statements so that the start_cerberus.py file is easier to read

The http server port is hardcoded

The http server is used to publish the cerberus status and the port it runs on is hard coded to 8086 right now, we need to parameterize it to be able to run the server on a different port when needed.

Monitor cluster operators

Cerberus need to monitor cluster operators in addition to nodes and other components to check if they are degraded. Clusters operators have 3 conditions ( Available, Progressing, Degraded ), it needs to only check for degraded condition as progressing doesn't count towards a failure.

Verify application pods not scheduled on master nodes

This might be pushing the scope of what cerberus is intended for, but an easy "health check" would be making sure master nodes are not schedule-able for application workloads. See https://bugzilla.redhat.com/show_bug.cgi?id=1827996

Monitor clusteroperator status

In addition to monitoring nodes and pod health, clusteroperator status is another key indicator of cluster health.

Suggestion for cerberus enhancement: add a config option to monitor clusteroperator status and watch for degraded operators. See: oc get clusteroperators -o yaml

SLACK integration

How do we add SLACK_API_TOKEN & SLACK_CHANNEL in the config.yaml file. Any example syntax would be helpful.
Thx

Add ability to use user provided kubeconfig

The tool uses the kubeconfig located at $HOME/.kube/config by default, it needs to support using a kubeconfig present in a different location.

Alert on high latencies

#64 will enable us to get the time taken by various Cerberus checks which are basically calls to the Kube/OpenShift API server. We need to establish limits on the latency for each check after which Cerberus will start logging about the observed high latencies.

Requests tend to take more time on a large scale cluster ( 250 - 2000 nodes ) but we need to establish the thresholds after which Cerberus should start alerting. This will help the user in tuning the cluster in addition to the explaining the reason for Cerberus taking more time to finish the checks in each iteration.

Erroneous collecting "clusterversion" In case of openshift distribution

Good day!
I've tried to deploy Cerberus with distribution config "openshift" and faced the following issue:

2021-05-05 14:24:45,635 [INFO] Starting ceberus
2021-05-05 14:24:45,680 [INFO] Initializing client to talk to the Kubernetes cluster
2021-05-05 14:24:46,831 [INFO] Fetching cluster info
error: the server doesn't have a resource type "clusterversion"
2021-05-05 14:24:49,757 [ERROR] Failed to run kubectl get clusterversion
               _                         
  ___ ___ _ __| |__   ___ _ __ _   _ ___ 
 / __/ _ \ '__| '_ \ / _ \ '__| | | / __|
| (_|  __/ |  | |_) |  __/ |  | |_| \__ \
 \___\___|_|  |_.__/ \___|_|   \__,_|___/
                                         

Traceback (most recent call last):
  File "start_cerberus.py", line 468, in <module>
    main(options.cfg)
  File "start_cerberus.py", line 106, in main
    cluster_version = runcommand.invoke("kubectl get clusterversion")
  File "/root/cerberus/cerberus/invoke/command.py", line 12, in invoke
    return output
UnboundLocalError: local variable 'output' referenced before assignment

Could you please clarify for what purpose start_cerberus.py tries to get some resource with the name clusterversion using kubectl?
https://github.com/cloud-bulldozer/cerberus/blob/eb449aae83f9b331d76c4413d0b6f8ab020e0ee7/start_cerberus.py#L107-L109

Unfortunately, I cannot find any details about the command kubectl get clusterversion and the resource clusterversion.

As a temporary solution, I just switched off this if statement.

Use YAML instead of ini for configuration

It seems better to use a YAML based configuration file when compared to ini as yaml.load() automatically default empty values to None and we can also implement YAML checking using pyyaml or similar.

@chaitanyaenr Thoughts?

Integrate with node problem detector

It may be a good idea to integrate cerberus with node problem detector to report on node level issues

openshift gets no-go and an exception when status of csr is blank

Problem Description
I run cerberus against openshift 4.8. I got a no-go and an exception when the status of csr is blank.

Expected
I've not figured out why the status of my csr is blank, I'll try to reproduce it and add what I found to this issue. However, I red the document https://docs.openshift.com/container-platform/3.11/install_config/redeploying_certificates.html#cert-expiry-approving-csrs, it seems to me in some case the status of the certifications could be blank.

Approve all pending CSRs:

$ oc get csr -o go-template='{{range .items}}{{if not .status}}{{.metadata.name}}{{"\n"}}{{end}}{{end}}' | xargs oc adm certificate approve
I think cerberus should be able to handle this situation, avoid the exception and provide some logs.

Details
Exception I got from cerberus output

2021-05-07 17:58:06,665 [INFO] Encountered issues in cluster. Hence, setting the go/no-go signal to false
2021-05-07 17:58:06,665 [INFO] Exception: 'conditions'

Trackback

Traceback (most recent call last):
  File "start_cerberus.py", line 308, in main
    if "Approved" not in csr['status']['conditions'][0]['type']:

Line 304 to 308

csrs = kubecli.get_csrs()
pending_csr = []
for csr in csrs['items']:
    # find csr status
    if "Approved" not in csr['status']['conditions'][0]['type']:

get_csrs() in client.py

def get_csrs():
    csr_string = runcommand.invoke("oc get csr -o yaml")
    csr_yaml = yaml.load(csr_string, Loader=yaml.FullLoader)
    return csr_yaml

I manually run oc get csr -o yaml against my cluster and got the output, below is a part of it. The status is blank.

% oc get csr -o yaml
apiVersion: v1
items:
- apiVersion: certificates.k8s.io/v1
  kind: CertificateSigningRequest
  metadata:
    creationTimestamp: "2021-05-07T09:00:25Z"
    generateName: csr-
    managedFields:
    - apiVersion: certificates.k8s.io/v1beta1
      fieldsType: FieldsV1
      fieldsV1:
        f:metadata:
          f:generateName: {}
      manager: aws-pod-identity-webhook
      operation: Update
      time: "2021-05-07T09:00:25Z"
    name: csr-4hlx4
    resourceVersion: "84948"
    uid: 5ab98b50-a5a0-4e8d-9d69-9411e3b46156
  spec:
    groups:
    - system:serviceaccounts
    - system:serviceaccounts:openshift-cloud-credential-operator
    - system:authenticated
    request: LS0tLS1CRUdJTiBDRVJUSUZJQ0FURSBSRVFVRVNULS0tLS0KTUlJQi96Q0NBYVlDQVFBd1J6RkZNRU1HQTFVRUF4TThjRzlrTFdsa1pXNTBhWFI1TFhkbFltaHZiMnN1YjNCbApibk5vYVdaMExXTnNiM1ZrTFdOeVpXUmxiblJwWVd3dGIzQmxjbUYwYjNJdWMzWmpNRmt3RXdZSEtvWkl6ajBDCkFRWUlLb1pJemowREFRY0RRZ0FFRUtoOFpGcHZQSEF2MVNnalo3MkZIZk9lZDFaQWUzNWFJTGRSSStqNlU0NGoKRi8xazNsVFdIRUh6M3k4L2xaVDYzbFJPU1ZhbkVueko3L3I5VHFGbmU2Q0IvRENCK1FZSktvWklodmNOQVFrTwpNWUhyTUlIb01JSGxCZ05WSFJFRWdkMHdnZHFDRkhCdlpDMXBaR1Z1ZEdsMGVTMTNaV0pvYjI5cmdqaHdiMlF0CmFXUmxiblJwZEhrdGQyVmlhRzl2YXk1dmNHVnVjMmhwWm5RdFkyeHZkV1F0WTNKbFpHVnVkR2xoYkMxdmNHVnkKWVhSdmNvSThjRzlrTFdsa1pXNTBhWFI1TFhkbFltaHZiMnN1YjNCbGJuTm9hV1owTFdOc2IzVmtMV055WldSbApiblJwWVd3dGIzQmxjbUYwYjNJdWMzWmpna3B3YjJRdGFXUmxiblJwZEhrdGQyVmlhRzl2YXk1dmNHVnVjMmhwClpuUXRZMnh2ZFdRdFkzSmxaR1Z1ZEdsaGJDMXZjR1Z5WVhSdmNpNXpkbU11WTJ4MWMzUmxjaTVzYjJOaGJEQUsKQmdncWhrak9QUVFEQWdOSEFEQkVBaUJSbFZYUG1rNXVUT1BkRG5QK25RN0FPRDJQWTNVbm5jOFpuakk4TWRNdwo4UUlnTTRLMVQ1QnZ6UjJ2Q3M4ck42RCtDZE9jUnhLaytIclV3cFdISEZUT2V6ND0KLS0tLS1FTkQgQ0VSVElGSUNBVEUgUkVRVUVTVC0tLS0tCg==
    signerName: kubernetes.io/legacy-unknown
    uid: 4528a0fc-93cc-426e-a2f8-f618c1097ce6
    usages:
    - digital signature
    - key encipherment
    - server auth
    username: system:serviceaccount:openshift-cloud-credential-operator:pod-identity-webhook
  status: {}

The whole json output oc get csr -o json is attached here [csr.txt](https://github.com/cloud-bulldozer/cerberus/files/6442386/csr.txt

Slack integration

Add support to send a message to a specified slack channel when the status of the monitored components is false. This way the user can jump in to inspect the cluster.

Updation of node_kerneldeadlock_status

https://github.com/openshift-scale/cerberus/blob/master/cerberus/kubernetes/client.py#L59
node_kerneldeadlock_status = "False" needs to be moved within the for loop as the node_kerneldeadlock_status needs to be set to False before checking the different status.conditions for each node.

TravisCI failing

Came across TravisCI failing during recent pr tests --> https://travis-ci.com/github/amitsagtani97/cerberus
Might need to upgrade the python or other dependencies version.

Post slack messages in threads

@jtaleric suggested an enhancement to post Cerberus messages using threads i.e 1 thread for each monitored cluster to help with readability as it can get messy when using the same slack channel for multiple clusters monitored by Cerberus.

Function to monitor any given namespace

A single function to monitor any namespace. This prevents code duplication.

Watching master_schedulable_status should be optional

Cerberus executes kubecli.process_master_taint to watch the schedulable status of master.
https://github.com/cloud-bulldozer/cerberus/blob/master/start_cerberus.py#L212
https://github.com/cloud-bulldozer/cerberus/blob/master/cerberus/kubernetes/client.py#L317

In certain hosted environments, the master node's information is not available. Thus the call to list nodes with label "node-role.kubernetes.io/master" cannot be expected to succeed in all environments,
Call to list nodes with above label: https://github.com/cloud-bulldozer/cerberus/blob/master/start_cerberus.py#L140
Thus we encountered exception in https://github.com/cloud-bulldozer/cerberus/blob/master/cerberus/kubernetes/client.py#L302 while cerberus was trying to check master node's schedulable status.

2021-03-08 10:16:09,925 [INFO] Cerberus is not monitoring nodes, so setting the status to True and assuming that the nodes are ready
2021-03-08 10:16:10,160 [INFO] Encountered issues in cluster. Hence, setting the go/no-go signal to false
2021-03-08 10:16:10,160 [INFO] Exception: 'name'

2021-03-08 10:16:10,361 [INFO] Cerberus is not monitoring nodes, so setting the status to True and assuming that the nodes are ready
2021-03-08 10:16:10,398 [INFO] HTTP requests served: 0

2021-03-08 10:16:10,399 [INFO] Encountered issues in cluster. Hence, setting the go/no-go signal to false
2021-03-08 10:16:10,399 [INFO] Exception: local variable 'custom_checks_imports' referenced before assignment

using container dpoloyments type and container exit due to error in "start_cerberus.py"

I'm using cerberus for openshift and tried deploying it as separate docker container outside of cluster:

Error:
File "start_cerberus.py", line 109, in main
cluster_version = runcommand.invoke("kubectl get clusterversion")

==============complete log ==============

$docker logs -f cerberus
2021-05-08 09:08:38,893 [INFO] Starting ceberus
2021-05-08 09:08:38,904 [INFO] Initializing client to talk to the Kubernetes cluster
2021-05-08 09:08:38,986 [INFO] Fetching cluster info
error: the server doesn't have a resource type "clusterversion"
_
___ ___ _ | | ___ _ __ _ _ ___
/ / _ \ '| '_ \ / _ \ '| | | / |
| (| __/ | | |) | / | | || _
__|| |./ _|| _,|/

2021-05-08 09:08:39,187 [ERROR] Failed to run kubectl get clusterversion
Traceback (most recent call last):
File "start_cerberus.py", line 480, in
main(options.cfg)
File "start_cerberus.py", line 109, in main
cluster_version = runcommand.invoke("kubectl get clusterversion")
File "/root/cerberus/cerberus/invoke/command.py", line 12, in invoke
return output
UnboundLocalError: local variable 'output' referenced before assignment

Slack integration

I set up a containerize Cerberus as follow:

And when slack integration is True I receive this error:

  [INFO] Encountered issues in cluster. Hence, setting the go/no-go signal to false
 [INFO] Exception: local variable 'custom_checks_fail_messages' referenced before assignment

docker run --rm --name=cerberus -p8080:8080 -e SLACK_API_TOKEN=XXXXX' -e SLACK_CHANNEL='XXXX' -v /XXX/2/auth/kubeconfig:/root/.kube/config -v /XXXX/cerberus/config.yaml:/root/cerberus/config/config.yaml quay.io/openshift-scale/cerberus

output of the config.yaml

cerberus:
    distribution: openshift                              # Distribution can be kubernetes or openshift
    kubeconfig_path: ~/.kube/config                      # Path to kubeconfig
    port: 8080                                           # http server port where cerberus status is published
    watch_nodes: True                                    # Set to True for the cerberus to monitor the cluster nodes
    watch_cluster_operators: True                        # Set to True for cerberus to monitor cluster operators
    watch_url_routes:                                    # Route url's you want to monitor, this is a double array with the url and optional authorization parameter
    watch_namespaces:                                    # List of namespaces to be monitored
        -    openshift-etcd
        -    openshift-apiserver
        -    openshift-kube-apiserver
        -    openshift-monitoring
        -    openshift-kube-controller-manager
        -    openshift-machine-api
        -    openshift-kube-scheduler
        -    openshift-ingress
        -    openshift-sdn                               # When enabled, it will check for the cluster sdn and monitor that namespace
    cerberus_publish_status: True                        # When enabled, cerberus starts a light weight http server and publishes the status
    inspect_components: True                            # Enable it only when OpenShift client is supported to run
                                                         # When enabled, cerberus collects logs, events and metrics of failed components

    prometheus_url:                                      # The prometheus url/route is automatically obtained in case of OpenShift, please set it when the distribution is Kubernetes.
    prometheus_bearer_token:                             # The bearer token is automatically obtained in case of OpenShift, please set it when the distribution is Kubernetes. This is needed to authenticate with prometheus.
                                                         # This enables Cerberus to query prometheus and alert on observing high Kube API Server latencies.

    slack_integration: True                            # When enabled, cerberus reports the failed iterations in the slack channel
                                                         # The following env vars needs to be set: SLACK_API_TOKEN ( Bot User OAuth Access Token ) and SLACK_CHANNEL ( channel to send notifications in case of failures )
                                                         # When slack_integration is enabled, a watcher can be assigned for each day. The watcher of the day is tagged while reporting failures in the slack channel. Values are slack member ID's.
    watcher_slack_ID:                                        # (NOTE: Defining the watcher id's is optional and when the watcher slack id's are not defined, the slack_team_alias tag is used if it is set else no tag is used while reporting failures in the slack channel.)
        Monday:
        Tuesday:
        Wednesday:
        Thursday:
        Friday:
        Saturday:
        Sunday:
    slack_team_alias:                                    # The slack team alias to be tagged while reporting failures in the slack channel when no watcher is assigned

    custom_checks:                                       # Relative paths of files conataining additional user defined checks

tunings:
    iterations: 5                                        # Iterations to loop before stopping the watch, it will be replaced with infinity when the daemon mode is enabled
    sleep_time: 60                                       # Sleep duration between each iteration
    kube_api_request_chunk_size: 250                     # Large requests will be broken into the specified chunk size to reduce the load on API server and improve responsiveness.
    daemon_mode: True                                    # Iterations are set to infinity which means that the cerberus will monitor the resources forever
    cores_usage_percentage: 0.5                          # Set the fraction of cores to be used for multiprocessing

database:
    database_path: /tmp/cerberus.db                      # Path where cerberus database needs to be stored
    reuse_database: False                                # When enabled, the database is reused to store the failures`

Monitor the status of containers in a pod

Cerberus needs to check the status of the containers in the pods in addition to the pod status to determine go/no-go. This way, the report will have the details about the particular container failing in a pod in addition to the pod name.

Issue with custom-checks

I would like to contribute to the feature of adding custom-checks, so explored this a bit.

While testing with my custom-check file '/root/cerberus/custom_checks/custom_check_res_usage.py',
Cerberus shows this error message in the report

2021-02-10 01:36:12,589 [INFO] Encountered issues in cluster. Hence, setting the go/no-go signal to false
2021-02-10 01:36:12,590 [INFO] Exception: Empty module name

Issues:

Why this Exception: Empty module name?
It says that the signal is set to 'False', however the failure details are not getting notified in the slack channel that I have enabled.

Can we discuss this? @yashashreesuresh @chaitanyaenr
Please let me know how to take this discussion forward? via email / slack?
Thanks.

Namespace monitoring overly aggresive in reporting failures

Currently the namespace check causes Cerberus to return a False status on any pod failing even if that pod is part of a multi-pod deployment. For example, one of my kube api pods got briefly recyled for some reason triggering the false flag even though the other two pods were up and from a user perspective the cluster was operational.

Are there any options to tune this behavior with regards to the namespace checking to tolerate some failures without reporting a False signal?

Test case in CI to report when a PR is adding more time to the checks

It is important to keep a check on the time taken by each watch/check in Cerberus especially if it's because of increase in API calls. We need a test case in CI which displays the increase in time/delta for a PR when compared to the gold values. The gold values are highly dependent on the cluster size but it should be okay as the CI cluster is always constant. This will help with Cerberus scalability ( #67 ).

krkn-chaos / cerberus Goto Github PK

cerberus's Introduction

Cerberus

Workflow

Installation

What Kubernetes/OpenShift components can Cerberus monitor?

How does Cerberus report cluster health?

Go or no-go signal

Report

Metrics API

Slack integration

Node Problem Detector

Bring your own checks

Alerts

Use cases

Tools consuming Cerberus

Blogs and other useful resources

Contributions

Community

Credits

cerberus's People

Contributors

Stargazers

Watchers

Forkers

cerberus's Issues

Recommend Projects

Recommend Topics

Recommend Org