Giter VIP home page Giter VIP logo

kubernetes-mixin's People

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

kubernetes-mixin's Issues

Split control-plane vs non-control plane alerts

We include the kubernetes-mixin for monitoring in the kube-prometheus stack, and a common point of frustration is that all alerts are always shipped, even on Kubernetes clusters that are managed like GKE or AKS. For those clusters it is often not possible to retrieve the metrics necessary to monitor the control plane components.

While it would be possible to hand pick or filter alerts, my feeling is that it could be beneficial to split alerts into the two groups also for a world, where a single Prometheus server is not sufficient to monitor an entire cluster, or in multi-tenant Kubernetes environments. In these scenarios we are seeing people assign a Prometheus server per tenant (typically made up of one or more namespaces), and the responsibility of that tenant is not to monitor the Kubernetes cluster itself, but primarily the workload.

This would not be a breaking change, as the entrypoint (as in the .libsonnet file imported by people) for the alerting rules would stay the same.

@tomwilkie @metalmatze

Add unit tests for alerts

With the next release of Promeheus the promtool will have the ability to unit test alerts.

We should also write tests at least for the most complex ones and start running these tests in our CI.
What do you think about integrating the tests into the alert jsonnet objects as well?

/cc @tomwilkie @brancz @codesome

Amend Memory Utilisation Recording Rule

For calculating utilisation as a function of available over total we do:

           record: ':node_memory_utilisation:',
            expr: |||
              1 -
              sum(node_memory_MemFree{%(nodeExporterSelector)s} + node_memory_Cached{%(nodeExporterSelector)s} + node_memory_Buffers{%(nodeExporterSelector)s})
              /
              sum(node_memory_MemTotal{%(nodeExporterSelector)s})
            ||| % $._config,
          },

Spoketo SuperQ about measuring memory and according to him "MemFree + Cached + Buffers is a somewhat obsolete set of metrics, there was a post about that somewhere burried on LKML."

Recommendation is to just use MemFree. Making a note here to remind me to dig out the LKML post and make a PR if necessary.

Control plane dashboards

The kube-prometheus stack used to have useful dashboards for the Kubernetes control plane. It would be nice to re-introduce those. The jsonnet definitions of those can be found last in this commit of the kubernetes-grafana package.

@metalmatze @tomwilkie

KubeVersionMismatch on Amazon EKS

Hi,

I found the following github issue, but noticed a corresponding issue hadn't been opened here and I am running into the same problem. prometheus-operator/prometheus-operator#1977

kubernetes_build_info

kubernetes_build_info{buildDate="2018-12-06T01:35:29Z",compiler="gc",endpoint="https-metrics",gitCommit="753b2dbc622f5cc417845f0ff8a77f539a4213ea",gitTreeState="clean",gitVersion="v1.11.5",goVersion="go1.10.3",instance="10.12.11.142:10250",job="kubelet",major="1",minor="11",namespace="kube-system",node="ip-10-86-10-142.us-west-2.compute.internal",platform="linux/amd64",service="prometheus-operator-kubelet"}	1

kubernetes_build_info{buildDate="2018-12-06T01:35:29Z",compiler="gc",endpoint="https-metrics",gitCommit="753b2dbc622f5cc417845f0ff8a77f539a4213ea",gitTreeState="clean",gitVersion="v1.11.5",goVersion="go1.10.3",instance="10.12.13.100:10250",job="kubelet",major="1",minor="11",namespace="kube-system",node="ip-10-86-12-100.us-west-2.compute.internal",platform="linux/amd64",service="prometheus-operator-kubelet"}	1

kubernetes_build_info{buildDate="2018-12-06T01:35:29Z",compiler="gc",endpoint="https-metrics",gitCommit="753b2dbc622f5cc417845f0ff8a77f539a4213ea",gitTreeState="clean",gitVersion="v1.11.5",goVersion="go1.10.3",instance="10.12.12.127:10250",job="kubelet",major="1",minor="11",namespace="kube-system",node="ip-10-86-14-127.us-west-2.compute.internal",platform="linux/amd64",service="prometheus-operator-kubelet"}	1

kubernetes_build_info{buildDate="2018-12-06T23:13:14Z",compiler="gc",endpoint="https",gitCommit="6bad6d9c768dc0864dab48a11653aa53b5a47043",gitTreeState="clean",gitVersion="v1.11.5-eks-6bad6d",goVersion="go1.10.3",instance="10.12.55.98:443",job="apiserver",major="1",minor="11+",namespace="default",platform="linux/amd64",service="kubernetes"}	1

kubernetes_build_info{buildDate="2018-12-06T23:13:14Z",compiler="gc",endpoint="https",gitCommit="6bad6d9c768dc0864dab48a11653aa53b5a47043",gitTreeState="clean",gitVersion="v1.11.5-eks-6bad6d",goVersion="go1.10.3",instance="10.12.55.131:443",job="apiserver",major="1",minor="11+",namespace="default",platform="linux/amd64",service="kubernetes"}	1

More clearly, the gitVersions don't match
sum by (gitVersion) (kubernetes_build_info)

{gitVersion="v1.11.5"}	3
{gitVersion="v1.11.5-eks-6bad6d"}	2

Measuring CPU Utilisation

In rules.libsonnet we declare that CPU time is time not spent idle as per below.

            // CPU utilisation is % CPU is not idle.
            record: ':node_cpu_utilisation:avg1m',
            expr: |||
              1 - avg(rate(node_cpu{%(nodeExporterSelector)s,mode="idle"}[1m]))
            ||| % $._config,
          },

There are two initial aspects that I would like to clarify for the purposes of accurately measuring CPU time for the purposes of a utilisation metric.

  • Should we consider mode="iowait" as part of active CPU time?
  • I understand (from limited reading of the man page) that guest time is "time spent running a virtual CPU" which is added into user time. Would that mean that we are double counting guest time with our current rule?

Storage specific panels under k8s/Compute Resources/Pod

Currently there are no storage specific metrics displayed under any of the k8s dashboards. This issue is being raised to incorporate storage specific panels under k8s/Compute Resource/Pod. The proposed changes are attached for reference here.

screenshot from 2018-11-27 19-38-34

Flaky Kubernetes API latency alert

We are seeing inconsistent alerts firing on the Kubernetes API latency alert, this is because list requests inherently depend on the amount of items being returned by the list request.

I would propose that we either ignore "list" requests all together in terms of latency like we do for other verbs already:

cluster_quantile:apiserver_request_latencies:histogram_quantile{%(kubeApiserverSelector)s,quantile="0.99",subresource!="log",verb!~"^(?:WATCH|WATCHLIST|PROXY|CONNECT)$"} > 1

Or at least treat it separately with different thresholds.

Let me know what you think.

@tomwilkie @metalmatze

Remove kube-state-metrics labels form Kubernetes workload alerts

It is often confusing for users, when there are alerts about Kubernetes workloads (deployments, daemonsets, statefulsets, etc) and it seems at first sight that it is coming from the kube-state-metrics target. We should probably drop any labels that identify kube-state-metrics and just leave the actual contextual information like the object name and namespace.

My hunch is that this would need to be configurable. I understand that for example in the Kausal ksonnet-prometheus package this would be the instance label, but in most other setups out there (such as the default Prometheus configuration from the Prometheus repo and the Prometheus Operator) this will be labels with the respective Kubernetes resource in it (pod/service/namespace/etc). It's also reasonable that people can do this however they like.

@tomwilkie @metalmatze

Support for generic resources in prometheus-operator?

Hi! Is there a plan to expose the kube_pod_container_resource_requests field? This will allow monitoring requests other than CPU and memory, particularly I need GPU requests.
(I'm looking here: https://github.com/kubernetes/kube-state-metrics/blob/master/Documentation/pod-metrics.md)

Also for node I'd like to have access to kube_node_status_allocatable and kube_node_status_capacity, to also see the GPUs. (https://github.com/kubernetes/kube-state-metrics/blob/master/Documentation/node-metrics.md)

Thanks!

Evaluate Grafana Kubernetes plugin dashboards

In prometheus-operator/prometheus-operator#1544 @jalberto reported, that the Kubernetes dashboards of the Grafana Kubernetes plugin are of high quality and would add a lot of value to be added to the stack. I briefly looked at some of the dashboards and I think there are some elements that would certainly be valuable to transfer into the kubernetes monitoring mixin.

My personal opinion on the Grafana Kubernetes plugin is that it does too much as I practically have to give it a certificate with cluster admin rights in my Kubernetes cluster, which isn't necessary with this monitoring mixin. Nevertheless the dashboards seem useful.

@tomwilkie @metalmatze

Cannot use rules on mixed kube-state-metrics/node-exporter deployments

In our setup, we have two clusters:

  1. One of them is under our control and we have node-exporter (NE) deployed and scraped from the Prometheus running in that cluster.
  2. But we also have a second cluster where no NE is deployed. The Prometheus in that "foreign" cluster is federated from our Prometheus, adding a cluster label to all imported metrics.

In both clusters, kube-state-metrics (KSM) is deployed.

This leads us with a situation where in our Prometheus we now have KSM+NE metrics about our nodes and only KSM metrics about the foreign nodes. This creates a discrepancy between the "nodes as seen by KSM" and "nodes as seen by NE". As a result, the rules break because Prometheus gets confused about the grouping labels.

For our usecase, we fixed this by restricting the two "base recording rules", :kube_pod_info_node_count: and node_namespace_pod:kube_pod_info: to only count KSM metrics without a cluster label (directly inside the generated YAML, for testing purposes):

 - name: node.rules
   rules:
-  - expr: sum(min(kube_pod_info) by (node))
+  - expr: sum(min(kube_pod_info{cluster=""}) by (node))
     record: ':kube_pod_info_node_count:'
   - expr: |
-      max(label_replace(kube_pod_info{job="kube-state-metrics"}, "pod", "$1", "pod", "(.*)")) by (node, namespace, pod)
+      max(label_replace(kube_pod_info{cluster="",job="kube-state-metrics"}, "pod", "$1", "pod", "(.*)")) by (node, namespace, pod)
     record: 'node_namespace_pod:kube_pod_info:'
   - expr: |
       count by (node) (sum by (node, cpu) (

This seems to have fixed the problem. I was wondering if we can/should submit a PR to introduce a new config variable to the mixins to allow people to customize the selection of KSM metrics.

"mapping values are not allowed in this context" error from promtool

I was running promtool over the generated rule files and ran into the following. This is likely a bug in jsonnet's yaml generation, but do you have any thoughts?

Note the ''s around the record: value.

$ cat /tmp/mapping.rules.yaml 
groups:
  - name: "node.rules"
    rules:
      - record: ':kube_pod_info_node_count:'
        expr: sum(min by(node) (kube_pod_info))

$ promtool check rules /tmp/mapping.rules.yaml 
Checking /tmp/mapping.rules.yaml
  SUCCESS: 1 rules found
$ cat /tmp/mapping-bad.rules.yaml 
groups:
  - name: "node.rules"
    rules:
      - record: :kube_pod_info_node_count:
        expr: sum(min by(node) (kube_pod_info))

$ promtool check rules /tmp/mapping-bad.rules.yaml 
Checking /tmp/mapping-bad.rules.yaml
  FAILED:
yaml: line 3: mapping values are not allowed in this context

pods.libsonnet dashboard is broken

  • CPU Usage doesn't work, and if 'fixed' it just shows an incrementally increasing line

  • Left Y axis in Memory Usage graph is wrong - it's 'short' instead of 'bytes'

the above assuming that the manifest provided in 'prometheus-operator/contrib/kube-prometheus/manifests' are synced with the jsonnet here

alerts/KubePersistentVolumeFullInFourDays: Filter based on exported_namespace not namespace

The kubelet_volume_stats_used_bytes metric exposed by the kubelet will always carry the namespace="kube-system label. We inject the prefixedNamespaceSelector into the KubePersistentVolumeFullInFourDays to restrict the scope of the alert by Kubernetes namespace. prefixedNamespaceSelector uses the namespace key. Instead we should use the exported_namespace label key.

What are your thoughts?

KubePersistentVolumeFullInFourDays:

{
            alert: 'KubePersistentVolumeFullInFourDays',
            expr: |||
              (
                kubelet_volume_stats_used_bytes{%(prefixedNamespaceSelector)s%(kubeletSelector)s}
                  /
                kubelet_volume_stats_capacity_bytes{%(prefixedNamespaceSelector)s%(kubeletSelector)s}
              ) > 0.85
              and
              predict_linear(kubelet_volume_stats_available_bytes{%(prefixedNamespaceSelector)s%(kubeletSelector)s}[%(volumeFullPredictionSampleTime)s], 4 * 24 * 3600) < 0
            ||| % $._config,
            'for': '5m',
            labels: {
              severity: 'critical',
            },
            annotations: {
              message: 'Based on recent sampling, the PersistentVolume claimed by {{ $labels.persistentvolumeclaim }} in Namespace {{ $labels.namespace }} is expected to fill up within four days. Currently {{ $value }} bytes are available.',
            },
},

Origin: https://bugzilla.redhat.com/show_bug.cgi?id=1634302

CPUThrottlingHigh false positives

Hi,

since the Alert CPUThrottlingHigh got added, it is firing in my Cluster for a lot of pods. As most of the affected pods are not even at their assigned CPU limit, I assume the expression for the alert is wrong (either miscalculation or [what seems to be more likely] container_cpu_cfs_throttled_periods_total includes different types of throttle).

This needs further investigating to be sure where this comes from, but like this the alert is not useful. (With about 250 pods running and a 25% limit I observe >100 alerts, 50% limit ~20 alerts.)

Modifying Existing Alerts

Adding new alerts is nice and simple by extending the prometheus_alerts+:: structure, but as it is a list, adding labels or annotations to existing alerts isn't.

For example wanting to modify the message annotation to link to playbooks or something. Speaking to Tom, if this was a dictionary rather than a list, this would then become possible.

Dashboards not generated

# I am on a fresh git clone
$ git reset --hard
HEAD is now at 297f40f Merge pull request #21 from richerve/fix/remove-POD-resources-pod

$ git status
On branch master
Your branch is up to date with 'origin/master'.

nothing to commit, working tree clean

# I remove leftovers from my previous experiments...
$ rm -rf dashboards_out

# I do not understand why this fails...
$ make dashboards_out
jsonnet -J vendor -m dashboards_out lib/dashboards.jsonnet
RUNTIME ERROR: Field does not exist: grafanaDashboards
		object <anonymous>
	lib/dashboards.jsonnet:1:20-66	thunk <dashboards>
	lib/dashboards.jsonnet:5:32-41	thunk <o>
	std.jsonnet:955:28	
	std.jsonnet:955:9-36	function <anonymous>
	lib/dashboards.jsonnet:5:15-42	thunk <a>
	lib/dashboards.jsonnet:(3:1)-(6:1)	function <anonymous>
	lib/dashboards.jsonnet:(3:1)-(6:1)	
make: *** [dashboards_out] Error 1

# ... but this seem to fix it.
$ jb install
Cloning into 'vendor/.tmp/jsonnetpkg-grafonnet-master635056242'...
remote: Counting objects: 961, done.
remote: Compressing objects: 100% (3/3), done.
remote: Total 961 (delta 0), reused 0 (delta 0), pack-reused 958
Receiving objects: 100% (961/961), 300.41 KiB | 886.00 KiB/s, done.
Resolving deltas: 100% (544/544), done.
Already on 'master'
Your branch is up to date with 'origin/master'.

# Works without any errors now
$ make dashboards_out
make: 'dashboards_out' is up to date.

# But nothing gets generated, why?
$ ls -la dashboards_out/
total 0
drwxr-xr-x   2 lvlcek  staff   64 Jun 13 12:37 .
drwxr-xr-x  17 lvlcek  staff  544 Jun 13 12:37 ..

Allow scoping alerts to certain namespaces

We have two use cases for scoping alerts to certain namespaces:

  1. There is a Prometheus server that collects all cluster wide metrics from kubelets, cAdvisor, and kube-state-metrics, for a cluster that has multiple independent users/tenants, therefore it adds significant cognitive overhead for SREs responsible offering the cluster as a service to their users throughout the entire pipeline (Prometheus alerts page, Alerts fired against Alertmanager, list of alerts in Alertmanager), when all they care about are the alerts for the cluster components and infrastructure.

  2. Different users/tenants may have different configurations of the "application" specific alerts of this repository.

@metalmatze @tomwilkie Do you think this is something we should optionally allow defining? I think the default behavior should continue to be what we have today.

sum(container_memory_usage_bytes{...}) rule doubles values

the sum(container* ...) rules are duplicates of data provided by cAdvisor within kubelet, but they are reported in the same record names, albeit with different labels.

The label selectors in the rules in the default rules file collects both the NodeExporter(I think?) records as well as the kubelet cAdvisor records. This results in values that are exactly double reality.

I think the solution here is to just use the service="kubelet" and container_name!="" label selectors, and there is no need for a sum()

Originally posted here:
prometheus-operator/prometheus-operator#2302

What did you do?;
Installed Prometheus chart and friends via Helm in a K8s cluster created by Kubeadm 1.11

What did you expect to see?
Correct values aggregated by the rules:


record: pod_name:container_memory_usage_bytes:sum expr: sum   by(pod_name) (container_memory_usage_bytes{container_name!="POD",pod_name!=""}) | OK |   | 16.737s ago | 17.95ms
-- | -- | -- | -- | --
record: pod_name:container_spec_cpu_shares:sum expr: sum   by(pod_name) (container_spec_cpu_shares{container_name!="POD",pod_name!=""}) | OK |   | 16.719s ago | 14.89ms
record: pod_name:container_cpu_usage:sum expr: sum   by(pod_name) (rate(container_cpu_usage_seconds_total{container_name!="POD",pod_name!=""}[5m])) | OK |   | 16.704s ago | 19.75ms
record: pod_name:container_fs_usage_bytes:sum expr: sum   by(pod_name) (container_fs_usage_bytes{container_name!="POD",pod_name!=""})

If the rules were changed to just use the output from Kubelet, a sum() would not be necessary. This would require setting {service="kubelet", container_name!=""}

What did you see instead? Under which circumstances? : In addition to NodeExporter(I think?) exporting data under these record names, kubelet also reports data under these record names, albeit with different labels. Kublet reports the exact sum of all containers in the Pod.. so the above rules report a value that is exactly double the actual value.

Environment

  • Prometheus Operator version:

    Image ID: docker-pullable://quay.io/coreos/prometheus-operator@sha256:faa9f8a9045092b9fe311016eb3888e2c2c824eb2b4029400f188a765b97648a

  • Kubernetes version information:

Client Version: version.Info{Major:"1", Minor:"12", GitVersion:"v1.12.4", GitCommit:"f49fa022dbe63faafd0da106ef7e05a29721d3f1", GitTreeState:"clean", BuildDate:"2018-12-14T07:10:00Z", GoVersion:"go1.10.4", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"11", GitVersion:"v1.11.0", GitCommit:"91e7b4fd31fcd3d5f436da26c980becec37ceefe", GitTreeState:"clean", BuildDate:"2018-06-27T20:08:34Z", GoVersion:"go1.10.2", Compiler:"gc", Platform:"linux/amd64"}
  • Kubernetes cluster kind:

    kubeadm on bare metal

  • Manifests:

https://github.com/coreos/prometheus-operator/blob/master/contrib/kube-prometheus/manifests/prometheus-rules.yaml#L22

  • Prometheus Operator Logs:

not relevant

Storage specific panels under k8s/Compute Resources/Namespace

Currently there are no storage specific metrics displayed under any of the k8s dashboards. This issue is being raised to incorporate storage specific panels under k8s/Compute Resource/Namespace. The proposed changes are attached for reference here.

screenshot from 2018-11-27 19-36-35

Warning many-to-many matching not allowed

level=warn ts=2018-11-09T12:24:37.99045291Z caller=manager.go:343 component="rule manager" group=kubernetes msg="Evaluating rule failed" rule="record: namespace_name:kube_pod_container_resource_requests_memory_bytes:sum\nexpr: sum by(namespace, label_name) (sum by(namespace, pod) (kube_pod_container_resource_requests_memory_bytes{job=\"kube-state-metrics\"})\n  * on(namespace, pod) group_left(label_name) label_replace(kube_pod_labels{job=\"kube-state-metrics\"},\n  \"pod_name\", \"$1\", \"pod\", \"(.*)\"))\n" err="many-to-many matching not allowed: matching labels must be unique on one side"

I got this warning in prometheus version v2.3.2
i've change the expressions of kube_pod_container_resource_requests_memory_bytes, kube_pod_container_resource_requests_cpu_cores and node_num_cpu adding ignoring instead on
This is the code:

    - record: node:node_num_cpu:sum
      expr: count by (node) (sum by (node, cpu) (node_cpu_seconds_total{job="node-exporter"} * ignoring (namespace, pod) group_left(node) node_namespace_pod:kube_pod_info:))

    - record: "namespace_name:kube_pod_container_resource_requests_memory_bytes:sum"
      expr: sum by (namespace, label_name) (sum(kube_pod_container_resource_requests_memory_bytes{job="kube-state-metrics"}) by (namespace, pod) * ignoring (namespace, pod) group_left(label_name) label_replace(kube_pod_labels{job="kube-state-metrics"}, "pod_name", "$1", "pod", "(.*)"))

    - record: "namespace_name:kube_pod_container_resource_requests_cpu_cores:sum"
      expr: sum by (namespace, label_name) (sum(kube_pod_container_resource_requests_cpu_cores{job="kube-state-metrics"} and on(pod) kube_pod_status_scheduled{condition="true"}) by (namespace, pod) * ignoring (namespace, pod) group_left(label_name) label_replace(kube_pod_labels{job="kube-state-metrics"}, "pod_name", "$1", "pod", "(.*)"))

I know that the ignoring operator will remove the labels that is inside of brackets, but, the warning it's solved now, but i'm not sure that it's working, i'm trying to testing this alerts

Can someone validate it to me?

Storage specific panels under k8s/Compute Resources/Cluster

Currently there are no storage specific metrics displayed under any of the k8s dashboards. This issue is being raised to incorporate storage specific panels under k8s/Compute Resource/Cluster. The proposed changes are attached for reference here.

screenshot from 2018-11-27 19-33-34

node_exporter v0.16 metric name changes

node_exporter renamed several metric names in v0.16.0. See https://github.com/prometheus/node_exporter/blob/master/CHANGELOG.md#0160--2018-05-15

I've compiled a list of metric names in use by this project and have provided what they should be changed to:

node_cpu -> node_cpu_seconds_total
node_memory_MemTotal -> node_memory_MemTotal_bytes
node_memory_Buffers -> node_memory_Buffers_bytes
node_memory_Cached -> node_memory_Cached_bytes
node_memory_MemFree -> node_memory_MemFree_bytes
node_memory_MemTotal -> node_memory_MemTotal_bytes
node_disk_bytes_read -> node_nfsd_disk_bytes_read_total
node_disk_bytes_written -> node_nfsd_disk_bytes_written_total
node_disk_io_time_ms -> node_nfsd_disk_bytes_written_total
node_filesystem_size -> node_filesystem_size_bytes
node_filesystem_avail -> node_filesystem_avail_bytes
node_filesystem_size -> node_filesystem_size_bytes
node_network_receive_bytes -> node_network_receive_bytes_total
node_network_transmit_bytes -> node_network_transmit_bytes_total
node_network_receive_drop -> node_network_receive_drop_total
node_network_transmit_drop -> node_network_transmit_drop_total
node_boot_time -> node_boot_time_seconds
node_disk_io_time_ms -> node_disk_io_time_seconds_total
node_disk_io_time_weighted -> node_disk_io_time_weighted_seconds_total

Make range configurable in dashboard queries

e. .g Dashboards like "K8s / Compute Resources / Cluster" uses query sum(irate(container_cpu_usage_seconds_total[1m])) by (namespace)
That creates hidden dependency that scrape job can't run less frequent then sub 1m.

I think range should be configurable

Readme instructions broken

➜  prometheus git:(master) ✗ ks registry add kausal https://github.com/kausalco/public
➜  prometheus git:(master) ✗ ks pkg install kausal/prometheus-ksonnet
ERROR GET https://api.github.com/repos/kausalco/public/contents//prometheus-ksonnet/parts.yaml?ref=6c037aa65f54edadbdcebd6fc0a2ecf167f19109: 404 Not Found []
➜  prometheus git:(master) ✗ ks version
ksonnet version: 0.8.0
jsonnet version: v0.9.5
client-go version: v1.6.8-beta.0+$Format:%h$

Prometheus alert

Can you please help me in writing a promql query to trigger an alert when node is in unschedulable state. Thanks in advance !!

KubeCPUOvercommit Firing Artificially

Problem

The recording rule for namespace_name:kube_pod_container_resource_requests_cpu_cores:sum appears to include requests for Pods which aren't running (e.g. those which have been Evicted, Completed, etc). This means the alert for KubeCPUOvercommit fires artificially.

In one of our clusters it indicates we've requested 5x the amount of CPU available, yet we're still perfectly within this and still able to schedule workloads successfully.

What I'd expect

The sum for this rule should only consider requests for pods which are actually holding onto resources.

This probably isn't perfect but it does produce a more reasonable result for the amount of requested CPU.

sum(kube_pod_container_resource_requests_cpu_cores) by(pod) 
and on(pod) 
(kube_pod_status_scheduled{condition="true"})

How to debug KubeAPILatencyHigh?

Hi,

Does anyone have experiences on debugging KubeAPILatencyHigh?

I have a kubernetes cluster which always alert KubeAPILatencyHigh (almost at all times). But I don't know how to debug it, can anyone share some experiences on it.

Here are the information about the kubernetes cluster:

bootstrap tool: kubeadm v1.11.3
master: 1 x (4 cpus, 8G ram, 80G ssd) virtual machine
nodes: 20 x (64 cpus, 256G ram, 2t ssd)
network: flannel v0.10.0
dns: coredns x 3 replica, no autoscale.
total pods: ~300

example cpu usage of master:

$ mpstat 1 5
Linux 4.4.0-138-generic 	11/26/2018 	_x86_64_	(4 CPU)

04:52:11 PM  CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest  %gnice   %idle
04:52:12 PM  all    3.05    0.00    2.03    0.00    0.00    0.76    0.00    0.00    0.00   94.16
04:52:13 PM  all    5.08    0.00    1.78    0.25    0.00    0.25    0.00    0.00    0.00   92.64
04:52:14 PM  all    2.56    0.00    2.05    0.26    0.00    0.26    0.00    0.00    0.00   94.88
04:52:15 PM  all    3.87    0.00    2.58    0.00    0.00    0.26    0.00    0.00    0.00   93.30
04:52:16 PM  all    4.85    0.26    3.32    0.26    0.00    0.26    0.26    0.00    0.00   90.82
Average:     all    3.88    0.05    2.35    0.15    0.00    0.36    0.05    0.00    0.00   93.16

example kube-api log:

I1127 00:24:51.038592       1 trace.go:76] Trace[38993630]: "Get /api/v1/namespaces/kube-system/endpoints/kube-controller-manager" (started: 2018-11-27 00:24:50.293798654 +0000 UTC m=+1142201.756391166) (total time: 744.706806ms):
Trace[38993630]: [744.609062ms] [744.602896ms] About to write a response
I1127 00:24:51.039953       1 trace.go:76] Trace[1687150485]: "Get /api/v1/namespaces/ingress-nginx/configmaps/ingress-controller-leader-nginx" (started: 2018-11-27 00:24:50.386678665 +0000 UTC m=+1142201.849271164) (total time: 653.229158ms):
Trace[1687150485]: [653.135664ms] [653.13091ms] About to write a response
I1127 00:24:58.694129       1 trace.go:76] Trace[1818362417]: "Get /api/v1/namespaces/ingress-nginx/configmaps/ingress-controller-leader-nginx" (started: 2018-11-27 00:24:54.426466538 +0000 UTC m=+1142205.889059018) (total time: 4.267603386s):
Trace[1818362417]: [4.267485115s] [4.267480859s] About to write a response
I1127 00:24:58.699821       1 trace.go:76] Trace[341574261]: "Get /api/v1/namespaces/default" (started: 2018-11-27 00:24:57.166795151 +0000 UTC m=+1142208.629387685) (total time: 1.53298777s):
Trace[341574261]: [1.532910089s] [1.532906249s] About to write a response
I1127 00:24:58.700395       1 trace.go:76] Trace[2008873298]: "Get /api/v1/namespaces/ingress-nginx/secrets/nginx-ingress-serviceaccount-token-22th5" (started: 2018-11-27 00:24:57.434586023 +0000 UTC m=+1142208.897178525) (total time: 1.265775612s):
Trace[2008873298]: [1.265702198s] [1.265695779s] About to write a response
I1127 00:24:58.700869       1 trace.go:76] Trace[1072363034]: "Get /api/v1/namespaces/kube-system/secrets/cronjob-controller-token-fh2qf" (started: 2018-11-27 00:24:57.14580432 +0000 UTC m=+1142208.608396862) (total time: 1.555031827s):
Trace[1072363034]: [1.554907629s] [1.554903089s] About to write a response
I1127 00:24:58.701159       1 trace.go:76] Trace[336371808]: "Get /api/v1/namespaces/kube-system/endpoints/kube-scheduler" (started: 2018-11-27 00:24:56.254123782 +0000 UTC m=+1142207.716716427) (total time: 2.446951069s):
Trace[336371808]: [2.446725016s] [2.446719549s] About to write a response
I1127 00:24:58.701375       1 trace.go:76] Trace[1790922014]: "Get /api/v1/namespaces/ingress-nginx/secrets/nginx-ingress-serviceaccount-token-22th5" (started: 2018-11-27 00:24:56.827729065 +0000 UTC m=+1142208.290321563) (total time: 1.873614851s):
Trace[1790922014]: [1.873510708s] [1.873503473s] About to write a response
I1127 00:24:58.701774       1 trace.go:76] Trace[691272094]: "Get /api/v1/namespaces/kube-system/endpoints/kube-controller-manager" (started: 2018-11-27 00:24:55.06580382 +0000 UTC m=+1142206.528396298) (total time: 3.635896447s):
Trace[691272094]: [3.635808899s] [3.635802985s] About to write a response
I1127 00:24:58.701874       1 trace.go:76] Trace[818045190]: "Get /api/v1/namespaces/ingress-nginx/secrets/nginx-ingress-serviceaccount-token-22th5" (started: 2018-11-27 00:24:56.474983542 +0000 UTC m=+1142207.937576130) (total time: 2.226805588s):
Trace[818045190]: [2.226744453s] [2.226735053s] About to write a response
I1127 00:24:58.702327       1 trace.go:76] Trace[1614989265]: "Get /api/v1/namespaces/ingress-nginx/secrets/nginx-ingress-serviceaccount-token-22th5" (started: 2018-11-27 00:24:56.149572743 +0000 UTC m=+1142207.612165251) (total time: 2.552722949s):
Trace[1614989265]: [2.552662992s] [2.552657858s] About to write a response
I1127 00:24:58.702737       1 trace.go:76] Trace[1447455717]: "Get /api/v1/namespaces/ingress-nginx/secrets/nginx-ingress-serviceaccount-token-22th5" (started: 2018-11-27 00:24:55.372165695 +0000 UTC m=+1142206.834758232) (total time: 3.33054224s):
Trace[1447455717]: [3.330453031s] [3.330444801s] About to write a response
I1127 00:24:58.703571       1 trace.go:76] Trace[1945409431]: "Get /api/v1/namespaces/ingress-nginx/secrets/nginx-ingress-serviceaccount-token-22th5" (started: 2018-11-27 00:24:55.283078173 +0000 UTC m=+1142206.745670715) (total time: 3.420351935s):
Trace[1945409431]: [3.420292916s] [3.420286689s] About to write a response
I1127 00:24:58.703958       1 trace.go:76] Trace[1226849696]: "Get /api/v1/namespaces/kube-system/secrets/node-problem-detector-token-jppnn" (started: 2018-11-27 00:24:54.577116058 +0000 UTC m=+1142206.039708599) (total time: 4.126802508s):
Trace[1226849696]: [4.126740738s] [4.126734225s] About to write a response
I1127 00:24:58.721163       1 trace.go:76] Trace[994305666]: "GuaranteedUpdate etcd3: *core.Node" (started: 2018-11-27 00:24:55.094594747 +0000 UTC m=+1142206.557187342) (total time: 3.626534954s):
Trace[994305666]: [3.626430436s] [3.624376327s] Transaction committed
I1127 00:24:58.724937       1 trace.go:76] Trace[970948525]: "GuaranteedUpdate etcd3: *core.Node" (started: 2018-11-27 00:24:57.045435379 +0000 UTC m=+1142208.508027980) (total time: 1.679259404s):
Trace[970948525]: [1.67896677s] [1.676837608s] Transaction committed
I1127 00:24:58.725494       1 trace.go:76] Trace[1332645750]: "GuaranteedUpdate etcd3: *core.Node" (started: 2018-11-27 00:24:57.076805959 +0000 UTC m=+1142208.539398517) (total time: 1.648639757s):
Trace[1332645750]: [1.648314409s] [1.646355104s] Transaction committed
I1127 00:24:58.727250       1 trace.go:76] Trace[1944527049]: "GuaranteedUpdate etcd3: *core.Node" (started: 2018-11-27 00:24:57.151655982 +0000 UTC m=+1142208.614248610) (total time: 1.575550502s):
Trace[1944527049]: [1.575378497s] [1.572909147s] Transaction committed
I1127 00:24:58.727857       1 trace.go:76] Trace[283895130]: "GuaranteedUpdate etcd3: *core.Node" (started: 2018-11-27 00:24:57.097471349 +0000 UTC m=+1142208.560063960) (total time: 1.630274405s):
Trace[283895130]: [1.630169627s] [1.628395238s] Transaction committed

KubePersistentVolumeFullInFourDays alert flaps for prometheus persistent volumes

We are seeing this alert fire and resolve multiple times per day:
screen shot 2018-09-26 at 8 47 51 pm

I believe this is partially because we're creating a bunch of short-running jobs/pods which creates a bunch of new series in prometheus, but I also think this alert might be a little too sensitive trying to predict 4 days growth off 1 hour (especially using a simple linear model).

If I set the prediction based on the last 24h, we wouldn't have any alerts:
screen shot 2018-09-26 at 8 46 12 pm

I'm not sure how the alert would behave in the first 24 hours with the above example.

Empty graphs in USE Method dashboards

All graphs beside Disk Utilization are empty.
Looks like queries using group_left are all returning no results.
Example:

record: node:node_cpu_utilisation:avg1m
expr: 1
  - avg by(node) (rate(node_cpu{job="node-exporter",mode="idle"}[1m])
  * on(namespace, pod) group_left(node) node_namespace_pod:kube_pod_info:)

Improve KubePersistentVolumeFullInFourDays; only trigger on > 85 %

I suggest only triggering KubePersistentVolumeFullInFourDays alert when disk usage is above 85 %.

Reasoning: After the alert fired and an engineer fixed the issue, the alert should resolve immediately.


Similar thing done by @brancz in prometheus-operator/prometheus-operator#1857:

diff --git a/contrib/kube-prometheus/jsonnet/kube-prometheus/alerts/node.libsonnet b/contrib/kube-prometheus/jsonnet/kube-prometheus/alerts/node.libsonnet
index 5c24f09f..27039f4e 100644
--- a/contrib/kube-prometheus/jsonnet/kube-prometheus/alerts/node.libsonnet
+++ b/contrib/kube-prometheus/jsonnet/kube-prometheus/alerts/node.libsonnet
@@ -7,11 +7,10 @@
           {
             alert: 'NodeDiskRunningFull',
             annotations: {
-              description: 'device {{$labels.device}} on node {{$labels.instance}} is running full within the next 24 hours (mounted at {{$labels.mountpoint}})',
-              summary: 'Node disk is running full within 24 hours',
+              message: 'Device {{ $labels.device }} of node-exporter {{ $labels.namespace }}/{{ $labels.pod }} is running full within the next 24 hours.',
             },
             expr: |||
-              predict_linear(node_filesystem_free{%(nodeExporterSelector)s,mountpoint!~"^/etc/(?:resolv.conf|hosts|hostname)$"}[6h], 3600 * 24) < 0 and on(instance) up{%(nodeExporterSelector)s}
+              (node:node_filesystem_usage: > 0.85) and (predict_linear(node:node_filesystem_avail:[6h], 3600 * 24) < 0)
             ||| % $._config,
             'for': '30m',
             labels: {
@@ -21,11 +20,10 @@
           {
             alert: 'NodeDiskRunningFull',
             annotations: {
-              description: 'device {{$labels.device}} on node {{$labels.instance}} is running full within the next 2 hours (mounted at {{$labels.mountpoint}})',
-              summary: 'Node disk is running full within 2 hours',
+              message: 'Device {{ $labels.device }} of node-exporter {{ $labels.namespace }}/{{ $labels.pod }} is running full within the next 2 hours.',
             },
             expr: |||
-              predict_linear(node_filesystem_free{%(nodeExporterSelector)s,mountpoint!~"^/etc/(?:resolv.conf|hosts|hostname)$"}[30m], 3600 * 2) < 0 and on(instance) up{%(nodeExporterSelector)s}
+              (node:node_filesystem_usage: > 0.85) and (predict_linear(node:node_filesystem_avail:[30m], 3600 * 2) < 0)
             ||| % $._config,
             'for': '10m',
             labels: {

Originating from https://bugzilla.redhat.com/show_bug.cgi?id=1632762.

Add Job and CronJob rules

Hello, as mentioned in kubernetes slack on #monitoring-mixin channel, I don't see any thing equivalent to:

I've never encountered ksonnet before, so I'm not sure if I can translate that job file in a timely fashion. I'm also not sure whether it should be added to this existing file or if it warrants putting it in a completely separate file for jobs. I would appreciate any guidance or suggestions.

groups:
- name: job.rules
  rules:
  - alert: CronJobRunning
    expr: time() -kube_cronjob_next_schedule_time > 3600
    for: 1h
    labels:
      severity: warning
    annotations:
      description: CronJob {{$labels.namespaces}}/{{$labels.cronjob}} is taking more than 1h to complete
      summary: CronJob didn't finish after 1h

  - alert: JobCompletion
    expr: kube_job_spec_completions - kube_job_status_succeeded  > 0
    for: 1h
    labels:
      severity: warning
    annotations:
      description: Job completion is taking more than 1h to complete
        cronjob {{$labels.namespaces}}/{{$labels.job}}
      summary: Job {{$labels.job}} didn't finish to complete after 1h

  - alert: JobFailed
    expr: kube_job_status_failed  > 0
    for: 1h
    labels:
      severity: warning
    annotations:
      description: Job {{$labels.namespaces}}/{{$labels.job}} failed to complete
      summary: Job failed

KubeCPUOvercommit is firing when using Jobs for batch processing

We have a use case for a namespace in kubernetes where we submit a large number of Jobs to our cluster without expecting immediate scheduling.

We've tested these workloads with a recently deployed kube-prometheus stack and have found out that the KubeCPUOvercommit fires whenever we submit a a large batch of jobs at once. I believe there are 2 issues here:

  • Pods stick around even when a Job completes successfully. kube-state-metrics will continue reporting metrics about that pod and this alert is assuming that those resource requests are still valid.
  • Even with auto-scaling enabled, we can reach peaks where we have more resource requests with pending jobs than available with whatever maximum number of nodes we've set. In our use case, we're ok with the scheduler working through all the Jobs as resources are freed up.

Here's the definition for the metric that KubeCPUOvercommit relies on.

    - expr: |
        sum by (namespace, label_name) (
          sum(kube_pod_container_resource_requests_cpu_cores{job="kube-state-metrics"} and on(pod) kube_pod_status_scheduled{condition="true"}) by (namespace, pod)                                                                                                                  
        * on (namespace, pod) group_left(label_name)
          label_replace(kube_pod_labels{job="kube-state-metrics"}, "pod_name", "$1", "pod", "(.*)")
        )
      record: namespace_name:kube_pod_container_resource_requests_cpu_cores:sum

For the first issue, I believe kube_pod_container_resource_requests_cpu_cores will continue showing resource requests and kube_pod_status_scheduled{condition="true"} will have a value of 1 for any pod associated with a finished job. Perhaps we can join on another metric like kube_pod_status_phase{phase=~"Running|Pending"}

KubeAPIErrorsHigh flapping

The alert KubeAPIErrorsHigh jumps quite often to 100% for us, even though the error count is quite small. There seems to be a label mismatch between 5xx and 2xx return codes (compare screenshot) and the alert does not ignore the code label. Thus everytime an error happens this will jump to 100% for us:
screenshot 2019-01-28 at 19 24 25

In this example the alerting jumped to 100% errors, even though the actual error percentage was <1%

Grafana dashboard "K8s / Compute Resources / Pod" should report real memory usage without cache

What did you do?
Deployed prometheus-operator including the built-in Grafana instance and the example dashboards.

What did you expect to see?
Useful memory metrics to monitor, debug and tune pod memory usage.

What did you see instead? Under which circumstances?
Pod memory is reported including caches, which can go up and down with available system memory and is not useful at all for the mentioned purposes.

The pod memory graph should clearly separate between memory that the pod uses and memory that can be reclaimed at any time. This is particularly relevant, as there is also a metric in the dashboard that counts use versus limit. This metric will be mostly useless for deployment tuning, unless cache memory is subtracted first.

It would be best if the graph and/or memory quota table below clearly separate total usage from cache, perhaps with a stacked graph. At the very least, it should be made clear that the shown container_memory_usage_bytes metric also includes container_memory_cache, similar to how the Linux command free reports buffer/cache and freely available memory seprately.

It was also suggested that container_memory_rss be used, but I think usage - cache is what's relevant for limits.

Environment

  • Prometheus Operator version:
v2.5.0

Cannot deploy using ksonnet

I followed the steps in README.md to create an application and default application. I then installed prometheus-ksonnet running jb install github.com/kausalco/public/prometheus-ksonnet and made the suggested changes to main.jsonnet but when I run ks apply default I get this error:

ERROR find objects: C:\tmp\use-monitoring\vendor/prometheus-ksonnet/lib/nginx.libsonnet:26:21 Text block not terminated with |||

      'nginx.conf': |||

I'm using the latest release of ksonnet I found on github (0.12.0)
I'm running this from a windows machine.

Generic storage specific alerts

The following alerts are being proposed at generic storage level

PVCHardLimitNearingFull - warning (80%), critical (90%)

maps to requests.storage (Across all persistent volume claims, the sum of storage requests cannot exceed this value) storage resource quota as specified in https://kubernetes.io/docs/concepts/policy/resource-quotas/#enabling-resource-quota.
As this is cluster-wide, having some kind of alert to indicate when PVC storage requests (capacity) is running out is important to know so action (e.g. adding capacity, reclaiming space, etc.) can be taken. Filling up to maximum capacity is usually not a good idea as this can lead to potential undesirable situations, e.g. performance degradation, instability, etc. If underlying storage is on AWS or any public cloud provider, this usually means expanding the underlying volume (and possibly restarting some instances), reclaiming space, or other data offload technique. The same is true for Gluster (OCS) and Ceph. For on-prom storage subsystems, this may mean ordering additional disks in order to support the expansion, as well as procurement process (which may or may not be applicable in the public cloud). Note: in a single OCP cluster, this typically involves multiple storage subsystems, and in this particular scenario, this could mean expansion in 1 or more storage subsystems.
80% utilization is meant as an early warning to start taking action to prevent severe issues
90% utilization is much more severe/critical requiring more immediate action by the admin/operator.
The label of alert “PVCHardLimitNearingFull” is suggested as with the words “hard limit” as requests is confusing to users, and the CPU and Memory quota terminology differs from the Persistent Storage quota terminology (though the Ephemeral Storage quota terminology seems more aligned with the CPU and Memory quota terminology).

StorageClass.PVCHardLimitNearingFull - warning (80%), critical (90%)

maps to .storageclass.storage.k8s.io/requests.storage (Across all persistent volume claims associated with the storage-class-name, the sum of storage requests cannot exceed this value) storage resource quota as specified in https://kubernetes.io/docs/concepts/policy/resource-quotas/#enabling-resource-quota
This is similar to the requests.storage and differs in that this is in-context of a storage class. As this is tied to a single storage provisioner, which basically is either underlying storage on a public cloud provider or storage subsystem (e.g. Gluster/OCS, Ceph, AWS EBS, etc.), once again, the admin is having to take action:
expand the storage (which may or may not be a disruptive operation), go through a procurement process (if applicable)
figuring out ways to offload the existing storage (reclamation, archiving, deleting data, migrate to something that's bigger, etc.). If data is getting offloaded, once again, the admin has to communicate with the users to let them know or have the users take action.

StorageClass.PVCCountNearing Full - warning (80%), critical (90%)

maps to .storageclass.storage.k8s.io/persistentvolumeclaims (Across all persistent volume claims associated with the storage-class-name, the total number of persistent volume claims that can exist in the namespace) storage resource quota as specified in https://kubernetes.io/docs/concepts/policy/resource-quotas/#enabling-resource-quota
This is less worrying but nevertheless still relevant as this refers to the count of PVCs. If one runs out, the user will be unable to make requests.
The 80% is just a warning to the admin/operator to either increase the allotted number, or to look into reclamation (if not automatic), or to ask users to to remove unneeded PVCs.
90% just means it’s more urgent, and more likelihood that the developer/consumer is going to experience issues with requesting storage if the issue is not addressed.

Namespace.PVCCountNearingFull - warning (80%), critical (90%).

This maps to persistentvolumeclaims

Namespace.EphemeralStorageLimitNearingFull - warning (80%), critical (90%)

This maps to maps to limits.ephemeral-storage

NodeDiskRunningFull

This should apply to any node (not just Device of node-exporter Namespace/Pod) and when it will be full

This relates to filesystems (see https://github.com/coreos/prometheus-operator/blob/master/contrib/kube-prometheus/manifests/prometheus-rules.yaml), though the alert label says it’s about a disk (which I found confusing).

For filesystems, utilization beyond 90% is usually not good, but the suggestion is to keep the threshold at 85% like with the existing alert since this is to kick-in after the kubelet garbage collection, which kicks in somewhere at 80-85% (default per https://kubernetes.io/docs/concepts/cluster-administration/kubelet-garbage-collection/#container-collection).

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.