Giter VIP home page Giter VIP logo

cerberus's Introduction

Cerberus

Guardian of Kubernetes and OpenShift Clusters

Cerberus logo

Cerberus watches the Kubernetes/OpenShift clusters for dead nodes, system component failures/health and exposes a go or no-go signal which can be consumed by other workload generators or applications in the cluster and act accordingly.

Workflow

Cerberus workflow

Installation

Instructions on how to setup, configure and run Cerberus can be found at Installation.

What Kubernetes/OpenShift components can Cerberus monitor?

Following are the components of Kubernetes/OpenShift that Cerberus can monitor today, we will be adding more soon.

Component Description Working
Nodes Watches all the nodes including masters, workers as well as nodes created using custom MachineSets ✔️
Namespaces Watches all the pods including containers running inside the pods in the namespaces specified in the config ✔️
Cluster Operators Watches all Cluster Operators ✔️
Masters Schedulability Watches and warns if masters nodes are marked as schedulable ✔️
Routes Watches specified routes ✔️
CSRs Warns if any CSRs are not approved ✔️
Critical Alerts Warns the user on observing abnormal behavior which might effect the health of the cluster ✔️
Bring your own checks Users can bring their own checks and Ceberus runs and includes them in the reporting as wells as go/no-go signal ✔️

An explanation of all the components that Cerberus can monitor are explained here

How does Cerberus report cluster health?

Cerberus exposes the cluster health and failures through a go/no-go signal, report and metrics API.

Go or no-go signal

When the cerberus is configured to run in the daemon mode, it will continuosly monitor the components specified, runs a light weight http server at http://0.0.0.0:8080 and publishes the signal i.e True or False depending on the components status. The tools can consume the signal and act accordingly.

Report

The report is generated in the run directory and it contains the information about each check/monitored component status per iteration with timestamps. It also displays information about the components in case of failure. Refer report for example.

You can use the "-o <file_path_name>" option to change the location of the created report

Metrics API

Cerberus exposes the metrics including the failures observed during the run through an API. Tools consuming Cerberus can query the API to get a blob of json with the observed failures to scrape and act accordingly. For example, we can query for etcd failures within a start and end time and take actions to determine pass/fail for test cases or report whether the cluster is healthy or unhealthy for that duration.

  • The failures in the past 1 hour can be retrieved in the json format by visiting http://0.0.0.0:8080/history.
  • The failures in a specific time window can be retrieved in the json format by visiting http://0.0.0.0:8080/history?loopback=.
  • The failures between two time timestamps, the failures of specific issues types and the failures related to specific components can be retrieved in the json format by visiting http://0.0.0.0:8080/analyze url. The filters have to be applied to scrape the failures accordingly.

Slack integration

Cerberus supports reporting failures in slack. Refer slack integration for information on how to set it up.

Node Problem Detector

Cerberus also consumes node-problem-detector to detect various failures in Kubernetes/OpenShift nodes. More information on setting it up can be found at node-problem-detector

Bring your own checks

Users can add additional checks to monitor components that are not being monitored by Cerberus and consume it as part of the go/no-go signal. This can be accomplished by placing relative paths of files containing additional checks under custom_checks in config file. All the checks should be placed within the main function of the file. If the additional checks need to be considered in determining the go/no-go signal of Cerberus, the main function can return a boolean value for the same. Having a dict return value of the format {'status':status, 'message':message} shall send signal to Cerberus along with message to be displayed in slack notification. However, it's optional to return a value. Refer to example_check for an example custom check file.

Alerts

Monitoring metrics and alerting on abnormal behavior is critical as they are the indicators for clusters health. Information on supported alerts can be found at alerts.

Use cases

There can be number of use cases, here are some of them:

  • We run tools to push the limits of Kubernetes/OpenShift to look at the performance and scalability. There are a number of instances where system components or nodes start to degrade, which invalidates the results and the workload generator continues to push the cluster until it is unrecoverable.

  • When running chaos experiments on a kubernetes/OpenShift cluster, they can potentially break the components unrelated to the targeted components which means that the chaos experiment won't be able to find it. The go/no-go signal can be used here to decide whether the cluster recovered from the failure injection as well as to decide whether to continue with the next chaos scenario.

Tools consuming Cerberus

  • Benchmark Operator: The intent of this Operator is to deploy common workloads to establish a performance baseline of Kubernetes cluster on your provider. Benchmark Operator consumes Cerberus to determine if the cluster was healthy during the benchmark run. More information can be found at cerberus-integration.

  • Kraken: Tool to inject deliberate failures into Kubernetes/OpenShift clusters to check if it is resilient. Kraken consumes Cerberus to determine if the cluster is healthy as a whole in addition to the targeted component during chaos testing. More information can be found at cerberus-integration.

Blogs and other useful resources

Contributions

We are always looking for more enhancements, fixes to make it better, any contributions are most welcome. Feel free to report or work on the issues filed on github.

More information on how to Contribute

Community

Key Members(slack_usernames): paige, rook, mffiedler, mohit, dry923, rsevilla, ravi

Credits

Thanks to Mary Shakshober ( https://github.com/maryshak1996 ) for designing the logo.

cerberus's People

Contributors

aakarshg avatar amitsagtani97 avatar chaitanyaenr avatar dry923 avatar harshil-codes avatar harshith-umesh avatar jordigilh avatar jtaleric avatar kedark3 avatar lalan7 avatar learnitall avatar mffiedler avatar mohit-sheth avatar paigerube14 avatar ratsuf avatar rsevilla87 avatar sandrobonazzola avatar smalleni avatar yashashreesuresh avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

cerberus's Issues

Cerberus scalability issues

A simple run of Cerberus on a 10 node cluster vs 220 nodes proved that we need to improve the way we run the checks for Cerberus to scale well on a cluster with hundreds/thousands of nodes. The time taken to run the checks has been increasing as we add more checks like @mffiedler mentioned in #53. Here are the observed timings:

10 nodes:
2020-05-22 23:00:17,391 [INFO] -------------------------- Iteration Stats -------------------------------
2020-05-22 23:00:17,392 [INFO] Time taken to run watch_nodes in iteration 1: 0.1057283878326416 seconds
2020-05-22 23:00:17,392 [INFO] Time taken to run watch_cluster_operators in iteration 1: 0.3759939670562744 seconds
2020-05-22 23:00:17,392 [INFO] Time taken to run watch_namespaces in iteration 1: 1.1533699035644531 seconds
2020-05-22 23:00:17,392 [INFO] Time taken to run entire_iteration in iteration 1: 4.368650436401367 seconds
2020-05-22 23:00:17,392 [INFO] --------------------------------------------------------------------------

220 nodes:
2020-05-22 23:13:00,130 [INFO] -------------------------- Iteration Stats -------------------------------
2020-05-22 23:13:00,130 [INFO] Time taken to run watch_nodes in iteration 2: 19.62144660949707 seconds
2020-05-22 23:13:00,131 [INFO] Time taken to run watch_cluster_operators in iteration 2: 1.0622196197509766 seconds
2020-05-22 23:13:00,131 [INFO] Time taken to run watch_namespaces in iteration 2: 68.95069146156311 seconds
2020-05-22 23:13:00,131 [INFO] Time taken to run entire_iteration in iteration 2: 161.81616592407227 seconds
2020-05-22 23:13:00,131 [INFO] --------------------------------------------------------------------------

@portante suggested areas for improvement. Multiprocessing will reduce the timing by using available cores to run checks in parallel ( #60 ) but we also need to take a look at optimizing the code to reduce the number of API calls and loops which iterate over the objects to get the status wherever possible in addition to running checks in parallel like @portante suggested. This issue is to track the observations and discuss ways to make Cerberus scale well on a large and dense cluster.

NOTE: The 10 nodes and 220 nodes clusters were different.

Thoughts?

Test to check the functionality

We need tests in CI to make sure a new commit is not breaking the functionality of cerberus. We can just run the tool with iterations set to 3 and sleep time between each iteration set to 30 seconds instead of daemon mode to see if it's working as expected.

Slack Integration "Exception: name 'thread_ts' is not defined"

I'm integrating cerberus with my slack workspace and after working through things thought I had it working. However today cerberus started returning a no-go signal based on an issue and I was expecting to see it pop up in slack but didn't. Looking at the output for Cerberus I see this error message appearing regularly:

Feb 11 09:30:49 lab-server bash[368944]: 2021-02-11 09:30:49,399 [INFO] Encountered issues in cluster. Hence, setting the go/no-go signal to false
**Feb 11 09:30:49 lab-server bash[368944]: 2021-02-11 09:30:49,399 [INFO] Exception: name 'thread_ts' is not defined**
Feb 11 09:30:49 lab-server bash[369025]: 2021-02-11 09:30:49,756 [INFO] Iteration 205: Node status: True
Feb 11 09:30:50 lab-server bash[369028]: 2021-02-11 09:30:50,071 [INFO] Iteration 205: Cluster Operator status: True
Feb 11 09:30:50 lab-server bash[369027]: 2021-02-11 09:30:50,219 [INFO] Iteration 205: openshift-machine-api: True
Feb 11 09:30:50 lab-server bash[369032]: 2021-02-11 09:30:50,224 [INFO] Iteration 205: openshift-apiserver: True
Feb 11 09:30:50 lab-server bash[369033]: 2021-02-11 09:30:50,225 [INFO] Iteration 205: openshift-kube-controller-manager: True
Feb 11 09:30:50 lab-server bash[369029]: 2021-02-11 09:30:50,240 [INFO] Iteration 205: openshift-kube-apiserver: True
Feb 11 09:30:50 lab-server bash[369024]: 2021-02-11 09:30:50,242 [INFO] Iteration 205: openshift-sdn: True
Feb 11 09:30:50 lab-server bash[369026]: 2021-02-11 09:30:50,253 [INFO] Iteration 205: openshift-etcd: True
Feb 11 09:30:50 lab-server bash[369023]: 2021-02-11 09:30:50,253 [INFO] Iteration 205: openshift-kube-scheduler: True
Feb 11 09:30:50 lab-server bash[369031]: 2021-02-11 09:30:50,254 [INFO] Iteration 205: openshift-ingress: False
Feb 11 09:30:50 lab-server bash[369034]: 2021-02-11 09:30:50,295 [INFO] Iteration 205: openshift-monitoring: True
Feb 11 09:30:50 lab-server bash[368944]: 2021-02-11 09:30:50,296 [INFO] HTTP requests served: 1
Feb 11 09:30:50 lab-server bash[368944]: 2021-02-11 09:30:50,296 [INFO] Iteration 205: Failed pods and components
Feb 11 09:30:50 lab-server bash[368944]: 2021-02-11 09:30:50,296 [INFO] openshift-ingress: ['router-default-7645688499-8j5c4']
Feb 11 09:30:50 lab-server bash[368944]: 2021-02-11 09:30:50,296 [INFO]

I do not have any watcher or team alias defined in my config:

   slack_integration: True
    watcher_slack_ID:                                        
        Monday:
        Tuesday:
        Wednesday:
        Thursday:
        Friday:
        Saturday:
        Sunday:
    slack_team_alias:

Track Kube/OpenShift object restarts

@mffiedler suggested an enhancement for Cerberus to monitor pod restarts. Cerberus will miss a failure if a pod restarts and gets back to running state during the wait time between each iteration. It needs to check if there was restart for each iteration and take it into consideration for setting the go/no-go signal.

One way to implement this would to be keep track of the restart count at the start and stop of each iteration and compare them to know if there was a pod restart during the wait time.

Cerberus errors out when unable to find namespace

When launching the container and it is unable to find the requested namespaces (openshift-sdn and openshift-ovn-kubernetes) Cerberus Errors out. Should this be an error OR should we just be reporting a failed state but continue to start up Cerberus?

2020-04-27 20:23:35,693 [ERROR] Could not find openshift-sdn and openshift-ovn-kubernetes namespaces,         please specify the correct networking namespace in config file
Traceback (most recent call last):
  File "start_cerberus.py", line 223, in <module>
    main(options.cfg)
  File "start_cerberus.py", line 50, in main
    watch_namespaces = [w.replace('openshift-sdn', sdn_namespace) for w in watch_namespaces]
  File "start_cerberus.py", line 50, in <listcomp>
    watch_namespaces = [w.replace('openshift-sdn', sdn_namespace) for w in watch_namespaces]
TypeError: replace() argument 2 must be str, not None

CI for running linters on PR's

We need to enable travis CI to run linters on each PR to make sure it follows the best practices and doesn't break the tool because of things like wrong indentation e.t.c.

Ability to add/enable collections of optional monitors

This might be stretching the original intent of Cerberus, but I see a trend. As we add additional checks, the patterns we could follow are a) make the new check a default and always run it b) give the new check an option in the config or c) introduce the idea of collections of optional checks - or maybe just one optional collection of "verbose health checks" for simplicity.

There are a lot of detailed things that could be monitored on a cluster - whether Cerberus should monitor them is open for discussion (issue #42 ). As new checks are added the monitor loop time grows at least linearly with the number of monitored namespaces and higher when pod checks are included (PR #52 ).

For discussion, should we identify a core set of critical checks and enable some mechanism for optional/verbose checks without adding a config flag for everyone of them?

/cc: @paigerube14 @chaitanyaenr @yashashreesuresh

Monitor kube scheduler

Tool needs to monitor Kube scheduler as it's one of the key component which is essential for cluster functionality.

[RFE] New API for capturing number of failures

Today we have a single signal, go/no-go, however this signal could miss operators/pods flapping. We should provide an additional signal, which accepts a time range. This signal will report the number of failures seen within the time window.

/history?lookback=60

lookback should be in minutes, and allow the user to determine how far back in time they want to know if failures happened.

The return of this should be json

history : {
  duration: int(),
  failures : {
    count: int(),
    issues : [ list of problems that occurred in the duration ]
  }
} 

The history api will provide the user/tool with more data to determine if they should continue on, or stop.

Thoughts?

Generate a report

The tool currently prints the events/logs to the stdout, it needs to generate a report with timestamps which we can use to take a look at the events later.

Monitor apiserver availability

While executing chaos (kraken) scenarios, measure the amount of time the apiserver is unavailable. The goal is zero downtime, but that is not the current reality. While the apiserver is unavailable, cerberus should return a no-go signal and ideally maintain a metric with the amount of time it is unavailable.

This can be tested with calls to https://:<api_port>/healthz. For OpenShift this might look like https://api.mffiedler-511.perf-testing.devcluster.openshift.com:6443/healthz

Cerberus cop enhancement similar to OCP build cop

@jtaleric came up with an idea to add support to tag specific people on slack in case of cluster failures instead of everyone in the channel. The cerberus cop responsible for taking a look to fix/file a bz for the failures on the cluster. We can provide a config with a list of people/slack handles along with specific people assigned as cops for each day of the week for cerberus to read and only ping the active cop for the day.

Add python linter

@aakarshg suggested adding a linter and it is a good practice to use linters to improve the code quality. We need to enable it for every pull request using a CI like travis.

Warn in case of pending csr's

Cerberus needs to check for any pending csr's and warn the users as the nodes needs it to be approved to be part of the cluster ( might be useful to monitor the cluster shutdown scenario in Kraken ). It shouldn't be considered for pass/fail since a scale up process will have pending csr's for a short duration and it is expected.

Incorrect results while checking master NoSchedule taint

When a master node has several taints associated with it, runcommand.invoke("kubectl describe nodes/" + node_name + ' | grep Taints') in get_taint_from_describe would just grep for the first taint thus leading to incorrect results. We need to loop through all the taints to check if the master node has node-role.kubernetes.io/master:NoSchedule taint.

Global debugging messages

How do you enable debug level in the config files for cerberus? Would be great if it is possible and documented

Investigate Kubemark

A cluster with hundreds/thousands of nodes is needed to test and see how Cerberus runs at scale. In discussion with @smalleni about project Sahasra which simulates the compute nodes to stress the OpenStack control plane, we were wondering if doing the same will help us with testing Cerberus. We had an issue open about it to help with testing the control plane/cluster maximums without having to set up a large scale cluster but Kubemark being specific to Kubernetes might not be the right fit.

We need to investigate Kubemark to see if will suffice our need for testing Cerberus at scale by simulating worker nodes to stress the control plane instead of using real hardware: https://github.com/kubernetes/kubernetes/tree/a8128804abbc311328b146c007df2cac09ba7fdf/test/kubemark/pre-existing.

NOTE: This might be an overkill for solving a simple problem :-)

Monitor application/ingress route availability (OpenShift specific monitor)

While executing chaos (kraken) scenarios, measure the amount of time the ingress routes to applications are unavailable. The goal is zero downtime, but that is not the current reality.

While the appications routes are unavailable, cerberus should return a no-go signal and ideally maintain a metric with the amount of time they unavailable.

This should be an optional config item which:

  1. sets up a quickstart application (e.g. oc new-app -t nodejs-mongodb-example
  2. retrieves the route for the new application
  3. monitors the availability of the route (200 vs other)

Pip package cerberus

This will allow the users/tools to just use pip to install and setup cerberus thus simplifying the process.

Enable support to run Cerberus tool on Power (ppc64le)

Task: Enable support to run Cerberus tool on Power (ppc64le)

The scenario is to run Cerberus tests on a host having Power (ppc64le) architecture. This would be beneficial to test clusters in environments having only Power VMs.

The Dockerfile which is currently available, only runs on Intel architecture.
To solve this, I have created a separate Dockerfile for ppc64le so that the tool can be run without any dependency problems.

Collect logs, events and metrics of the failed component

The tool needs to collect logs, metrics and events relevant to the components when there's a failure. This can be achieved by running the following command using the invoke module added in #20: $ oc adm inspect <failed_component_namespace>. Eventually, the collected data can be exposed by the http server along with the go/no-go signal.

The http server port is hardcoded

The http server is used to publish the cerberus status and the port it runs on is hard coded to 8086 right now, we need to parameterize it to be able to run the server on a different port when needed.

Monitor cluster operators

Cerberus need to monitor cluster operators in addition to nodes and other components to check if they are degraded. Clusters operators have 3 conditions ( Available, Progressing, Degraded ), it needs to only check for degraded condition as progressing doesn't count towards a failure.

Monitor clusteroperator status

In addition to monitoring nodes and pod health, clusteroperator status is another key indicator of cluster health.

Suggestion for cerberus enhancement: add a config option to monitor clusteroperator status and watch for degraded operators. See: oc get clusteroperators -o yaml

SLACK integration

How do we add SLACK_API_TOKEN & SLACK_CHANNEL in the config.yaml file. Any example syntax would be helpful.
Thx

Alert on high latencies

#64 will enable us to get the time taken by various Cerberus checks which are basically calls to the Kube/OpenShift API server. We need to establish limits on the latency for each check after which Cerberus will start logging about the observed high latencies.

Requests tend to take more time on a large scale cluster ( 250 - 2000 nodes ) but we need to establish the thresholds after which Cerberus should start alerting. This will help the user in tuning the cluster in addition to the explaining the reason for Cerberus taking more time to finish the checks in each iteration.

Erroneous collecting "clusterversion" In case of openshift distribution

Good day!
I've tried to deploy Cerberus with distribution config "openshift" and faced the following issue:

2021-05-05 14:24:45,635 [INFO] Starting ceberus
2021-05-05 14:24:45,680 [INFO] Initializing client to talk to the Kubernetes cluster
2021-05-05 14:24:46,831 [INFO] Fetching cluster info
error: the server doesn't have a resource type "clusterversion"
2021-05-05 14:24:49,757 [ERROR] Failed to run kubectl get clusterversion
               _                         
  ___ ___ _ __| |__   ___ _ __ _   _ ___ 
 / __/ _ \ '__| '_ \ / _ \ '__| | | / __|
| (_|  __/ |  | |_) |  __/ |  | |_| \__ \
 \___\___|_|  |_.__/ \___|_|   \__,_|___/
                                         

Traceback (most recent call last):
  File "start_cerberus.py", line 468, in <module>
    main(options.cfg)
  File "start_cerberus.py", line 106, in main
    cluster_version = runcommand.invoke("kubectl get clusterversion")
  File "/root/cerberus/cerberus/invoke/command.py", line 12, in invoke
    return output
UnboundLocalError: local variable 'output' referenced before assignment

Could you please clarify for what purpose start_cerberus.py tries to get some resource with the name clusterversion using kubectl?
https://github.com/cloud-bulldozer/cerberus/blob/eb449aae83f9b331d76c4413d0b6f8ab020e0ee7/start_cerberus.py#L107-L109

Unfortunately, I cannot find any details about the command kubectl get clusterversion and the resource clusterversion.

As a temporary solution, I just switched off this if statement.

openshift gets no-go and an exception when status of csr is blank

Problem Description
I run cerberus against openshift 4.8. I got a no-go and an exception when the status of csr is blank.

Expected
I've not figured out why the status of my csr is blank, I'll try to reproduce it and add what I found to this issue. However, I red the document https://docs.openshift.com/container-platform/3.11/install_config/redeploying_certificates.html#cert-expiry-approving-csrs, it seems to me in some case the status of the certifications could be blank.

Approve all pending CSRs:

$ oc get csr -o go-template='{{range .items}}{{if not .status}}{{.metadata.name}}{{"\n"}}{{end}}{{end}}' | xargs oc adm certificate approve
I think cerberus should be able to handle this situation, avoid the exception and provide some logs.

Details
Exception I got from cerberus output

2021-05-07 17:58:06,665 [INFO] Encountered issues in cluster. Hence, setting the go/no-go signal to false
2021-05-07 17:58:06,665 [INFO] Exception: 'conditions'

Trackback

Traceback (most recent call last):
  File "start_cerberus.py", line 308, in main
    if "Approved" not in csr['status']['conditions'][0]['type']:

Line 304 to 308

csrs = kubecli.get_csrs()
pending_csr = []
for csr in csrs['items']:
    # find csr status
    if "Approved" not in csr['status']['conditions'][0]['type']:

get_csrs() in client.py

def get_csrs():
    csr_string = runcommand.invoke("oc get csr -o yaml")
    csr_yaml = yaml.load(csr_string, Loader=yaml.FullLoader)
    return csr_yaml

I manually run oc get csr -o yaml against my cluster and got the output, below is a part of it. The status is blank.

% oc get csr -o yaml
apiVersion: v1
items:
- apiVersion: certificates.k8s.io/v1
  kind: CertificateSigningRequest
  metadata:
    creationTimestamp: "2021-05-07T09:00:25Z"
    generateName: csr-
    managedFields:
    - apiVersion: certificates.k8s.io/v1beta1
      fieldsType: FieldsV1
      fieldsV1:
        f:metadata:
          f:generateName: {}
      manager: aws-pod-identity-webhook
      operation: Update
      time: "2021-05-07T09:00:25Z"
    name: csr-4hlx4
    resourceVersion: "84948"
    uid: 5ab98b50-a5a0-4e8d-9d69-9411e3b46156
  spec:
    groups:
    - system:serviceaccounts
    - system:serviceaccounts:openshift-cloud-credential-operator
    - system:authenticated
    request: LS0tLS1CRUdJTiBDRVJUSUZJQ0FURSBSRVFVRVNULS0tLS0KTUlJQi96Q0NBYVlDQVFBd1J6RkZNRU1HQTFVRUF4TThjRzlrTFdsa1pXNTBhWFI1TFhkbFltaHZiMnN1YjNCbApibk5vYVdaMExXTnNiM1ZrTFdOeVpXUmxiblJwWVd3dGIzQmxjbUYwYjNJdWMzWmpNRmt3RXdZSEtvWkl6ajBDCkFRWUlLb1pJemowREFRY0RRZ0FFRUtoOFpGcHZQSEF2MVNnalo3MkZIZk9lZDFaQWUzNWFJTGRSSStqNlU0NGoKRi8xazNsVFdIRUh6M3k4L2xaVDYzbFJPU1ZhbkVueko3L3I5VHFGbmU2Q0IvRENCK1FZSktvWklodmNOQVFrTwpNWUhyTUlIb01JSGxCZ05WSFJFRWdkMHdnZHFDRkhCdlpDMXBaR1Z1ZEdsMGVTMTNaV0pvYjI5cmdqaHdiMlF0CmFXUmxiblJwZEhrdGQyVmlhRzl2YXk1dmNHVnVjMmhwWm5RdFkyeHZkV1F0WTNKbFpHVnVkR2xoYkMxdmNHVnkKWVhSdmNvSThjRzlrTFdsa1pXNTBhWFI1TFhkbFltaHZiMnN1YjNCbGJuTm9hV1owTFdOc2IzVmtMV055WldSbApiblJwWVd3dGIzQmxjbUYwYjNJdWMzWmpna3B3YjJRdGFXUmxiblJwZEhrdGQyVmlhRzl2YXk1dmNHVnVjMmhwClpuUXRZMnh2ZFdRdFkzSmxaR1Z1ZEdsaGJDMXZjR1Z5WVhSdmNpNXpkbU11WTJ4MWMzUmxjaTVzYjJOaGJEQUsKQmdncWhrak9QUVFEQWdOSEFEQkVBaUJSbFZYUG1rNXVUT1BkRG5QK25RN0FPRDJQWTNVbm5jOFpuakk4TWRNdwo4UUlnTTRLMVQ1QnZ6UjJ2Q3M4ck42RCtDZE9jUnhLaytIclV3cFdISEZUT2V6ND0KLS0tLS1FTkQgQ0VSVElGSUNBVEUgUkVRVUVTVC0tLS0tCg==
    signerName: kubernetes.io/legacy-unknown
    uid: 4528a0fc-93cc-426e-a2f8-f618c1097ce6
    usages:
    - digital signature
    - key encipherment
    - server auth
    username: system:serviceaccount:openshift-cloud-credential-operator:pod-identity-webhook
  status: {}

The whole json output oc get csr -o json is attached here [csr.txt](https://github.com/cloud-bulldozer/cerberus/files/6442386/csr.txt

Slack integration

Add support to send a message to a specified slack channel when the status of the monitored components is false. This way the user can jump in to inspect the cluster.

Post slack messages in threads

@jtaleric suggested an enhancement to post Cerberus messages using threads i.e 1 thread for each monitored cluster to help with readability as it can get messy when using the same slack channel for multiple clusters monitored by Cerberus.

Watching master_schedulable_status should be optional

Cerberus executes kubecli.process_master_taint to watch the schedulable status of master.
https://github.com/cloud-bulldozer/cerberus/blob/master/start_cerberus.py#L212
https://github.com/cloud-bulldozer/cerberus/blob/master/cerberus/kubernetes/client.py#L317

In certain hosted environments, the master node's information is not available. Thus the call to list nodes with label "node-role.kubernetes.io/master" cannot be expected to succeed in all environments,
Call to list nodes with above label: https://github.com/cloud-bulldozer/cerberus/blob/master/start_cerberus.py#L140
Thus we encountered exception in https://github.com/cloud-bulldozer/cerberus/blob/master/cerberus/kubernetes/client.py#L302 while cerberus was trying to check master node's schedulable status.

2021-03-08 10:16:09,925 [INFO] Cerberus is not monitoring nodes, so setting the status to True and assuming that the nodes are ready
2021-03-08 10:16:10,160 [INFO] Encountered issues in cluster. Hence, setting the go/no-go signal to false
2021-03-08 10:16:10,160 [INFO] Exception: 'name'

2021-03-08 10:16:10,361 [INFO] Cerberus is not monitoring nodes, so setting the status to True and assuming that the nodes are ready
2021-03-08 10:16:10,398 [INFO] HTTP requests served: 0

2021-03-08 10:16:10,399 [INFO] Encountered issues in cluster. Hence, setting the go/no-go signal to false
2021-03-08 10:16:10,399 [INFO] Exception: local variable 'custom_checks_imports' referenced before assignment

using container dpoloyments type and container exit due to error in "start_cerberus.py"

I'm using cerberus for openshift and tried deploying it as separate docker container outside of cluster:

Error:
File "start_cerberus.py", line 109, in main
cluster_version = runcommand.invoke("kubectl get clusterversion")

==============complete log ==============

$docker logs -f cerberus
2021-05-08 09:08:38,893 [INFO] Starting ceberus
2021-05-08 09:08:38,904 [INFO] Initializing client to talk to the Kubernetes cluster
2021-05-08 09:08:38,986 [INFO] Fetching cluster info
error: the server doesn't have a resource type "clusterversion"
_
___ ___ _ | | ___ _ __ _ _ ___
/ / _ \ '| '_ \ / _ \ '| | | / |
| (| __/ | | |) | / | | || _
__|| |.
/ _|| _,|/

2021-05-08 09:08:39,187 [ERROR] Failed to run kubectl get clusterversion
Traceback (most recent call last):
File "start_cerberus.py", line 480, in
main(options.cfg)
File "start_cerberus.py", line 109, in main
cluster_version = runcommand.invoke("kubectl get clusterversion")
File "/root/cerberus/cerberus/invoke/command.py", line 12, in invoke
return output
UnboundLocalError: local variable 'output' referenced before assignment

Slack integration

I set up a containerize Cerberus as follow:

And when slack integration is True I receive this error:

  [INFO] Encountered issues in cluster. Hence, setting the go/no-go signal to false
 [INFO] Exception: local variable 'custom_checks_fail_messages' referenced before assignment

docker run --rm --name=cerberus -p8080:8080 -e SLACK_API_TOKEN=XXXXX' -e SLACK_CHANNEL='XXXX' -v /XXX/2/auth/kubeconfig:/root/.kube/config -v /XXXX/cerberus/config.yaml:/root/cerberus/config/config.yaml quay.io/openshift-scale/cerberus

output of the config.yaml

cerberus:
    distribution: openshift                              # Distribution can be kubernetes or openshift
    kubeconfig_path: ~/.kube/config                      # Path to kubeconfig
    port: 8080                                           # http server port where cerberus status is published
    watch_nodes: True                                    # Set to True for the cerberus to monitor the cluster nodes
    watch_cluster_operators: True                        # Set to True for cerberus to monitor cluster operators
    watch_url_routes:                                    # Route url's you want to monitor, this is a double array with the url and optional authorization parameter
    watch_namespaces:                                    # List of namespaces to be monitored
        -    openshift-etcd
        -    openshift-apiserver
        -    openshift-kube-apiserver
        -    openshift-monitoring
        -    openshift-kube-controller-manager
        -    openshift-machine-api
        -    openshift-kube-scheduler
        -    openshift-ingress
        -    openshift-sdn                               # When enabled, it will check for the cluster sdn and monitor that namespace
    cerberus_publish_status: True                        # When enabled, cerberus starts a light weight http server and publishes the status
    inspect_components: True                            # Enable it only when OpenShift client is supported to run
                                                         # When enabled, cerberus collects logs, events and metrics of failed components

    prometheus_url:                                      # The prometheus url/route is automatically obtained in case of OpenShift, please set it when the distribution is Kubernetes.
    prometheus_bearer_token:                             # The bearer token is automatically obtained in case of OpenShift, please set it when the distribution is Kubernetes. This is needed to authenticate with prometheus.
                                                         # This enables Cerberus to query prometheus and alert on observing high Kube API Server latencies.

    slack_integration: True                            # When enabled, cerberus reports the failed iterations in the slack channel
                                                         # The following env vars needs to be set: SLACK_API_TOKEN ( Bot User OAuth Access Token ) and SLACK_CHANNEL ( channel to send notifications in case of failures )
                                                         # When slack_integration is enabled, a watcher can be assigned for each day. The watcher of the day is tagged while reporting failures in the slack channel. Values are slack member ID's.
    watcher_slack_ID:                                        # (NOTE: Defining the watcher id's is optional and when the watcher slack id's are not defined, the slack_team_alias tag is used if it is set else no tag is used while reporting failures in the slack channel.)
        Monday:
        Tuesday:
        Wednesday:
        Thursday:
        Friday:
        Saturday:
        Sunday:
    slack_team_alias:                                    # The slack team alias to be tagged while reporting failures in the slack channel when no watcher is assigned

    custom_checks:                                       # Relative paths of files conataining additional user defined checks

tunings:
    iterations: 5                                        # Iterations to loop before stopping the watch, it will be replaced with infinity when the daemon mode is enabled
    sleep_time: 60                                       # Sleep duration between each iteration
    kube_api_request_chunk_size: 250                     # Large requests will be broken into the specified chunk size to reduce the load on API server and improve responsiveness.
    daemon_mode: True                                    # Iterations are set to infinity which means that the cerberus will monitor the resources forever
    cores_usage_percentage: 0.5                          # Set the fraction of cores to be used for multiprocessing

database:
    database_path: /tmp/cerberus.db                      # Path where cerberus database needs to be stored
    reuse_database: False                                # When enabled, the database is reused to store the failures`

Monitor the status of containers in a pod

Cerberus needs to check the status of the containers in the pods in addition to the pod status to determine go/no-go. This way, the report will have the details about the particular container failing in a pod in addition to the pod name.

Issue with custom-checks

I would like to contribute to the feature of adding custom-checks, so explored this a bit.

While testing with my custom-check file '/root/cerberus/custom_checks/custom_check_res_usage.py',
Cerberus shows this error message in the report

2021-02-10 01:36:12,589 [INFO] Encountered issues in cluster. Hence, setting the go/no-go signal to false
2021-02-10 01:36:12,590 [INFO] Exception: Empty module name

Issues:

  1. Why this Exception: Empty module name?
  2. It says that the signal is set to 'False', however the failure details are not getting notified in the slack channel that I have enabled.

Can we discuss this? @yashashreesuresh @chaitanyaenr
Please let me know how to take this discussion forward? via email / slack?
Thanks.

Namespace monitoring overly aggresive in reporting failures

Currently the namespace check causes Cerberus to return a False status on any pod failing even if that pod is part of a multi-pod deployment. For example, one of my kube api pods got briefly recyled for some reason triggering the false flag even though the other two pods were up and from a user perspective the cluster was operational.

Are there any options to tune this behavior with regards to the namespace checking to tolerate some failures without reporting a False signal?

Test case in CI to report when a PR is adding more time to the checks

It is important to keep a check on the time taken by each watch/check in Cerberus especially if it's because of increase in API calls. We need a test case in CI which displays the increase in time/delta for a PR when compared to the gold values. The gold values are highly dependent on the cluster size but it should be okay as the CI cluster is always constant. This will help with Cerberus scalability ( #67 ).

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.