Giter VIP home page Giter VIP logo

khealth's Introduction

khealth

Docker Image on Quay.io

khealth is a Kubernetes cluster monitoring suite. Its Routines exercise Kubernetes subsystems and send events to Collectors. Collectors collate these events to compute current cluster state. Cluster status is available from Collectors over a simple HTTP API, which is served on a cluster nodeport in the example below.

Quick start

If you have a kubernetes cluster, you can deploy khealth.

cd khealth/
kubectl create -f ./contrib/k8s/khealth-ns.yaml
kubectl --namespace=khealth create -f ./contrib/k8s/khealth-rc.yaml
kubectl --namespace=khealth create -f ./contrib/k8s/khealth-service.yaml

This will create a nodeport service which exposes the following status endpoints.

Command NodePort
rcscheduler 31337

Architecture

A khealth Module is a single command that invokes a set of Routines and a single Collector. The Collector gathers events from the Routines and exposes metrics on its status endpoint.

Directory Layout

cmd/

This is where the Module entrypoint programs live. Each Module should have exactly one main package in an eponymous directory beneath cmd/.

pkg/routines/

Routines are defined in structures that implement the RoutineHandler interface.

type RoutineHandler interface {
  Init() error
  Poll() error
  Cleanup() error
  }

Init is called, and then Poll in a loop. When the TTL expires, Poll terminates, and Cleanup is called. Each iteration of this cycle generates events, which are sent on the Routine's Events channel, usually to a Collector.

The NewRoutine function returns a pointer to a khealth Routine struct. It takes the following arguments:

  • client: the Kubernetes API client
  • pollInterval: how often (in seconds) Poll is called
  • podTTL: how many seconds to loop on Poll before calling Cleanup
  • handler: the RoutineHandler for this routine

pkg/collectors/

type Collector interface {
  Start() error
  Status(w http.ResponseWriter, r *http.Request)
  Terminate() error
}

Collectors must implement the Collector interface and make use of Routines. To wire Routines to a Collector implementation, follow this general pattern:

  • Start : Call Start on all routines this collector uses. Then begin reading events from each routines' Events channel and collating current state.
  • Status: Serialize current state to HTTP response.
  • Terminate: Call SignalTerminate on each routine. SignalTerminate is non-blocking, so before returning you'll want to block until each Routine's Events channel has emitted a nil value. That way, when Terminate returns you can be assured your Routines have all cleaned up.

Included Modules

cmd/rcscheduler/

This module uses a single routine which schedules/unschedules pause pods via a replication controller. The program exposes a single health endpoint which reports the state of the latest event.

Roadmap

  • More routines: We want routines that do everything! Test network latency. Write to disk. Compute fibonacci sequences.

  • Prometheus integration: Collectors expose Prometheus-compatible status endpoints and metrics, providing readymade infrastructure to aggregate statistics from a set of canary pods, designed specifically to exercise Kubernetes cluster resources.

  • Alerting: Use the experimental alertmanager to alert on metrics

Who should use this?

Cluster administrators: Gain insight into your Kubernetes cluster's performance. Monitor health endpoints which report on various testing routines.

Kubernetes developers: A convenient way to "smoke test" a cluster. Feel free to write Modules that exist solely to torture test a cluster and have no business running on the same cluster as production assets. And turn the replica count way up!

khealth's People

Contributors

colhom avatar

Stargazers

frankfanslc avatar Kaito Iwatsuki avatar tristan avatar Yusuf Ozturk avatar Nicolas Quiceno B avatar qoqɯoɹ avatar aland-zhang avatar  avatar Alex Tan Hong Pin avatar Sebastian Liu avatar  avatar Rodrigo Cosme avatar Yann avatar Mikal avatar Adrian van Dongen avatar Christian Grabowski avatar Derek Gottlieb avatar Timo Reimann avatar Justin Garrison avatar Josh Mize avatar Andy Boyett avatar Pranav Kulkarni avatar abdul dakkak avatar Jesse Nelson avatar Luke Heidecke avatar rob boll avatar

Watchers

Greg Kroah-Hartman avatar Alex Polvi avatar Joe Bowers avatar Mikal avatar James Cloos avatar Yusuf Ozturk avatar  avatar Nick Owens avatar Yicheng Qin avatar Eugene Yakubovich avatar  avatar Jenessa avatar  avatar

khealth's Issues

rcscheduler does not recover from ungraceful shutdown

If a cluster is shut down ungracefully and is started again, the khealth controller's rcscheduler container comes back in an irrecoverable state. The health endpoint will forever report

init error: replicationControllers "khealth-rcscheduler" already exists

We should probably have rscheduler attempt to tear down any existing khealth-rcscheduler pods/rc/services before starting the init loop.

Commands need to catch signals and cleanup

Currently, before a "graceful exit" via signal, rcscheduler does not cleanup it's routine.

It would be nice if executables caught SIGTERM,SIGHUP, etc and attempted to cleanup the routine before exiting.

This means that deleting your rcscheduler pod will not leave a bunch of orphaned canary pods on the cluster under "normal operating conditions"

Tests

There is currently 0% testing coverage.

Brainstorm on a few scenarios:

  1. Point at a namespace with no resources available
  2. Run another entity which randomly deletes khealth's k8s resources
  3. Synthetically Intermittent network access to api server

rcscheduler canary pods get stuck in pending state

kubectl output

NAME                        READY     STATUS    RESTARTS   AGE
khealth-dvgm0               1/1       Running   0          20h
khealth-rcscheduler-4n2ps   0/1       Pending   0          1m
khealth-rcscheduler-6vcl4   1/1       Running   0          1m
khealth-rcscheduler-qzbcx   1/1       Running   0          1m

This happens quite often on a v1.1.2 cluster running on aws. This behavior doesn't pop up until rcscheduler has been running for about a day.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.