Giter VIP home page Giter VIP logo

grafana-dashboards-kubernetes's Introduction

grafana-dashboards-kubernetes

logo

Table of contents

Description

This repository contains a modern set of Grafana dashboards for Kubernetes.
They are inspired by many other dashboards from kubernetes-mixin and grafana.com.

More information about them in my article: A set of modern Grafana dashboards for Kubernetes

You can also download them on Grafana.com.

Releases

This repository follows semantic versioning for releases.
It relies on conventional commits to automate releases using semantic-release.

Features

These dashboards are made and tested for the kube-prometheus-stack chart, but they should work well with others as soon as you have kube-state-metrics and prometheus-node-exporter installed on your Kubernetes cluster.

They are not backward compatible with older Grafana versions because they try to take advantage of Grafana's newest features like:

They also have a Prometheus Datasource variable so they will work on a federated Grafana instance.

As an example, here's how the Kubernetes / Views / Global dashboard looks like:

screenshot

Dashboards

File name Description Screenshot
k8s-addons-prometheus.json Dashboard for Prometheus. LINK
k8s-addons-trivy-operator.json Dashboard for the Trivy Operator from Aqua Security. LINK
k8s-system-api-server.json Dashboard for the API Server Kubernetes component. LINK
k8s-system-coredns.json Show information on the CoreDNS Kubernetes component. LINK
k8s-views-global.json Global level view dashboard for Kubernetes. LINK
k8s-views-namespaces.json Namespaces level view dashboard for Kubernetes. LINK
k8s-views-nodes.json Nodes level view dashboard for Kubernetes. LINK
k8s-views-pods.json Pods level view dashboard for Kubernetes. LINK

Installation

In most cases, you will need to clone this repository (or your fork):

git clone https://github.com/dotdc/grafana-dashboards-kubernetes.git
cd grafana-dashboards-kubernetes

If you plan to deploy these dashboards using ArgoCD, ConfigMaps or Terraform, you will also need to enable and configure the dashboards sidecar on the Grafana Helm chart to get the dashboards loaded in your Grafana instance:

# kube-prometheus-stack values
grafana:
  sidecar:
    dashboards:
      enabled: true
      defaultFolderName: "General"
      label: grafana_dashboard
      labelValue: "1"
      folderAnnotation: grafana_folder
      searchNamespace: ALL
      provider:
        foldersFromFilesStructure: true

Install manually

On the WebUI of your Grafana instance, put your mouse over the + sign on the left menu, then click on Import.
Once you are on the Import page, you can upload the JSON files one by one from your local copy using the Upload JSON file button.

Install via grafana.com

On the WebUI of your Grafana instance, put your mouse over the + sign on the left menu, then click on Import.
Once you are on the Import page, you can put the grafana.com dashboard ID (see table below) under Import via grafana.com then click on the Load button. Repeat for each dashboard.

Grafana.com dashboard id list:

Dashboard ID
k8s-addons-prometheus.json 19105
k8s-addons-trivy-operator.json 16337
k8s-system-api-server.json 15761
k8s-system-coredns.json 15762
k8s-views-global.json 15757
k8s-views-namespaces.json 15758
k8s-views-nodes.json 15759
k8s-views-pods.json 15760

Install with ArgoCD

If you are using ArgoCD, this will deploy the dashboards in the default project of ArgoCD:

kubectl apply -f argocd-app.yml

You will also need to enable and configure the Grafana dashboards sidecar as described in Installation.

Install with Helm values

If you use the official Grafana helm chart or kube-prometheus-stack, you can install the dashboards directly using the dashboardProviders & dashboards helm chart values.

Depending on your setup, add or merge the following block example to your helm chart values.
The example is for kube-prometheus-stack, for the official Grafana helm chart, remove the first line (grafana:), and reduce the indentation level of the entire block.

grafana:
  # Provision grafana-dashboards-kubernetes
  dashboardProviders:
    dashboardproviders.yaml:
      apiVersion: 1
      providers:
      - name: 'grafana-dashboards-kubernetes'
        orgId: 1
        folder: 'Kubernetes'
        type: file
        disableDeletion: true
        editable: true
        options:
          path: /var/lib/grafana/dashboards/grafana-dashboards-kubernetes
  dashboards:
    grafana-dashboards-kubernetes:
      k8s-system-api-server:
        url: https://raw.githubusercontent.com/dotdc/grafana-dashboards-kubernetes/master/dashboards/k8s-system-api-server.json
        token: ''
      k8s-system-coredns:
        url: https://raw.githubusercontent.com/dotdc/grafana-dashboards-kubernetes/master/dashboards/k8s-system-coredns.json
        token: ''
      k8s-views-global:
        url: https://raw.githubusercontent.com/dotdc/grafana-dashboards-kubernetes/master/dashboards/k8s-views-global.json
        token: ''
      k8s-views-namespaces:
        url: https://raw.githubusercontent.com/dotdc/grafana-dashboards-kubernetes/master/dashboards/k8s-views-namespaces.json
        token: ''
      k8s-views-nodes:
        url: https://raw.githubusercontent.com/dotdc/grafana-dashboards-kubernetes/master/dashboards/k8s-views-nodes.json
        token: ''
      k8s-views-pods:
        url: https://raw.githubusercontent.com/dotdc/grafana-dashboards-kubernetes/master/dashboards/k8s-views-pods.json
        token: ''

Install as ConfigMaps

Grafana dashboards can be provisioned as Kubernetes ConfigMaps if you configure the dashboard sidecar available on the official Grafana Helm Chart.

To build the ConfigMaps and output them on STDOUT :

kubectl kustomize .

Note: no namespace is set by default, you can change that in the kustomization.yaml file.

To build and deploy them directly on your Kubernetes cluster :

kubectl apply -k . -n monitoring

You will also need to enable and configure the Grafana dashboards sidecar as described in Installation.

Note: you can change the namespace if needed.

Install as ConfigMaps with Terraform

If you use Terraform to provision your Kubernetes resources, you can convert the generated ConfigMaps to Terraform code using tfk8s.

To build and convert ConfigMaps to Terraform code :

kubectl kustomize . | tfk8s

You will also need to enable and configure the Grafana dashboards sidecar as described in Installation.

Note: no namespace is set by default, you can change that in the kustomization.yaml file.

Known issue(s)

Broken panels due to a too-high resolution

A user reported in #50 that some panels were broken because the default value of the $resolution variable was too low. The root cause hasn't been identified precisely, but he was using Grafana Agent & Grafana Mimir. Changing the $resolution variable to a higher value (a lower resolution) will likely solve the issue. To make the fix permanent, you can configure the Scrape interval in your Grafana Datasource to a working value for your setup.

Broken panels on k8s-views-nodes when a node changes its IP address

To make this dashboard more convenient, there's a small variable hack to display node instead of instance. Because of that, some panels could lack data when a node changes its IP address as reported in #102.

No easy fix for this scenario yet, but it should be a corner case for most people. Feel free to reopen the issue if you have ideas to fix this.

Broken panels on k8s-views-nodes due to the nodename label

The k8s-views-nodes dashboard will have many broken panels if the node label from kube_node_info doesn't match the nodename label from node_uname_info.

This situation can happen on certain deployments of the node exporter running inside Kubernetes(e.g. via a DaemonSet), where nodename takes a different value than the node name as understood by the Kubernetes API.

Below are some ways to relabel the metric to force the nodename label to the appropriate value, depending on the way the collection agent is deployed:

Directly through the Prometheus configuration file

Assuming the node exporter job is defined through kubernetes_sd_config, you can take advantage of the internal discovery labels and fix this by adding the following relabeling rule to the job:

# File: prometheus.yaml
scrape_configs:
- job_name: node-exporter
  relabel_configs:
  # Add this
  - action: replace
    source_labels: [ __meta_kubernetes_pod_node_name]
    target_label: nodename

Through a ServiceMonitor

If using the Prometheus operator or the Grafana agent in operator mode, the scrape job should instead be configured via a ServiceMonitor that will dynamically edit the Prometheus configuration file. In that case, the relabeling has a slightly different syntax:

# File: service-monitor.yaml
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
  endpoints:
  - port: http-metrics
    relabelings:
    # Add this
    - action: replace
      sourceLabels: [ __meta_kubernetes_node_name]
      targetLabel: nodename

As a convenience, if using the kube-prometheus-stack helm chart, this added rule can be directly specified in your values.yaml:

# File: kube-prometheus-stack-values.yaml
prometheus-node-exporter:
  prometheus:
    monitor:
      relabelings:
      - action: replace
        sourceLabels: [__meta_kubernetes_pod_node_name]
        targetLabel: nodename

With Grafana Agent Flow mode

The Grafana Agent can bundle its own node_exporter. In that case, relabeling can be done this way:

prometheus.exporter.unix {
}

prometheus.scrape "node_exporter" {
  targets = prometheus.exporter.unix.targets
  forward_to = [prometheus.relabel.node_exporter.receiver]

  job_name = "node-exporter"
}

prometheus.relabel "node_exporter" {
  forward_to = [prometheus.remote_write.sink.receiver]

  rule {
    replacement = env("HOSTNAME")
    target_label = "nodename"
  }

  rule {
    # The default job name is "integrations/node_exporter" and needs to be replaced
    replacement = "node-exporter"
    target_label = "job"
  }
}

The HOSTNAME environment variable is injected by default by the Grafana Agent helm chart

Contributing

Feel free to contribute to this project:

  • Give a GitHub ⭐ if you like it
  • Create an Issue to make a feature request, report a bug or share an idea.
  • Create a Pull Request if you want to share code or anything useful to this project.

grafana-dashboards-kubernetes's People

Contributors

alexintech avatar beliys avatar chewie avatar clementnuss avatar cmergenthaler avatar danic-git avatar dotdc avatar elmariofredo avatar fcecagno avatar felipewnp avatar ffppmm avatar geekofalltrades avatar hoangphuocbk avatar jcpunk avatar jkroepke avatar k1rk avatar kongfei605 avatar marcofranssen avatar miracle2k avatar prasadkris avatar rcattin avatar reefland avatar superq avatar tlemarchand avatar uhthomas avatar vladimir-babichev avatar william-lp avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

grafana-dashboards-kubernetes's Issues

[bug] default resolution is too low

Describe the bug

The default resolution of 30s is too low and renders some dashboards with "No Data". This is likely because I'm using Grafana Mimir, as opposed to a standard Prometheus install.

image

How to reproduce?

  1. Collect metrics with Grafana Mimir.
  2. Load the dashboard.

Expected behavior

Changing the resolution from 30s to 1m shows the data as expected.

image

Additional context

No response

[bug] Trivy Dashboard Templating Failed to upgrade legacy queries Datasource prometheus was not found

Describe the bug

The trivy dashboard breaks since this commit 4b52d9c on our clusters.

How to reproduce?

No response

Expected behavior

The dashboard continues to work like on the commit before. Other dashboards don't seem to have this issue.

Additional context

Is there any chance it is because of the missing cluster label on trivy metrics? Should we configure a specific setting to include this cluster label on the trivy operator?

[enhancement] Add support for monitoring node runtime & system resource usage

Describe the enhancement you'd like

I'd like the nodes dashboard to show the runtime and system resource usage, as are exported by kubelet.

Additional context

This requires that the cAdvisor metrics for cgroup slices aren't being dropped. For this to work with Kube Prometheus Stack the kubelet ServiceMonitor cAdvisorMetricRelabelings value needs to be overridden to keep the required values.

[bug] node dashboard shows no values

Describe the bug

We've deployed kube-prometheus-stack via flux:

flux get helmreleases -n monitoring 
NAME                    REVISION        SUSPENDED       READY   MESSAGE                                                                                                        
kube-prometheus-stack   58.2.2          False           True    Helm upgrade succeeded for release monitoring/kube-prometheus-stack.v6 with chart [email protected]
loki-stack              2.10.2          False           True    Helm install succeeded for release monitoring/loki-stack.v1 with chart [email protected]                   

The Grafana dashboards have been installed with Helm values as described. However, we're not able to see any metrics for the node dashboard despite changing the Helm values:

# File: kube-prometheus-stack-values.yaml
prometheus-node-exporter:
  prometheus:
    monitor:
      relabelings:
      - action: replace
        sourceLabels: [__meta_kubernetes_pod_node_name]
        targetLabel: nodename

How to reproduce?

No response

Expected behavior

No response

Additional context

kubectl get po -n monitoring 
NAME                                                       READY   STATUS    RESTARTS   AGE
kube-prometheus-stack-grafana-75c985bc44-5g7sm             3/3     Running   0          7m34s
kube-prometheus-stack-kube-state-metrics-c4dbc548d-l5tcl   1/1     Running   0          17m
kube-prometheus-stack-operator-7846887766-98vvj            1/1     Running   0          17m
kube-prometheus-stack-prometheus-node-exporter-5x97x       1/1     Running   0          17m
kube-prometheus-stack-prometheus-node-exporter-97dbf       1/1     Running   0          17m
kube-prometheus-stack-prometheus-node-exporter-hz4zf       1/1     Running   0          17m
loki-stack-0                                               1/1     Running   0          17m
loki-stack-promtail-bc95r                                  1/1     Running   0          17m
loki-stack-promtail-fpnh9                                  1/1     Running   0          17m
loki-stack-promtail-z64hg                                  1/1     Running   0          17m
prometheus-kube-prometheus-stack-prometheus-0              2/2     Running   0          17m

[bug] node dashboard only shows latest instance

Describe the bug

Some panels are using node to filter, and others are using a hidden instance variable ( label_values(node_uname_info{nodename=~"(?i:($node))"}, instance)). If a node changes its IP, then some panels will look normal and others will be missing data.

image

How to reproduce?

  1. Collect node metrics.
  2. Change IP of node.
  3. Observe the node dashboard has some unaffected panels, and others which only show the latest 'instance'.

Expected behavior

It should probably show all instances of a node.

Additional context

No response

[bug] some panels not displyaing correctly on white background

Describe the bug

Hi,

I'm using those dashboards with Grafana using the light theme (easier on my eyes), and some panels are not displaying properly. e.g.:

image

this can be fixed by setting the color mode of the panel to None instead of Value

Screenshot 2022-09-30 at 07 13 22

How to reproduce?

  1. turn on light mode for Grafana.
  2. check the Kubernetes / Views / Global panel

Expected behavior

the text/values should be readable even with the light theme

Additional context

No response

[bug] created_by variable is not refreshed on Time Range Change

Describe the bug

Hi,
on the "Kubernetes / Views / Namespaces" Dashboard exists a Variable "created_by" that is filled ONLY on dashboard loading. If I change to yesterday, PODs created are not shown. The only thing to be changed is in the variable "properties the refresh from 1 => 2:

        "refresh": 1, // Bug
        "refresh": 2, // Correct Value

Regards Philipp

How to reproduce?

Always

Expected behavior

created_by should be "refilled" on every Time Range Change

Additional context

No response

[bug] CoreDNS Dashboard No Data

Describe the bug

Hi, and thanks for the good set of Dashboards @dotdc !

I'm having some trouble with the CoreDNS dashboard.

Several graphs and statuses don't show any data, displaying the "No Data" placeholder.

I've noticed that the filter for CoreDNS is a job and not a pod.

At least in my EKS, the CoreDNS is a daemonset and not a job.

image

Is there something I could do or change?

Thanks =D !

How to reproduce?

No response

Expected behavior

No response

Additional context

No response

[bug] should use last non-null value, rather than mean

Describe the bug

3/4 of these guages use the mean, rather than the last non-null value. This can cause strangeness like incorrect reporting of current cpu requests and limits. They should also be consistent.

Current:

image

Last *:

image

How to reproduce?

  1. Observe the global view
  2. Change some cpu requests and limits
  3. Observe incorrect reporting of cpu requests and limits

Expected behavior

Should probably use "Last *" rather than "Mean" for calculating the value.

Additional context

No response

deployment view

Describe the enhancement you'd like

Currently there are views for:

  • global
  • namespace
  • nodes
  • pods

It would be nice to have a view that would show the status of the deployments (number or replicas, ...)

Additional context

No response

[bug] Global Network Utilization

Describe the bug

On my simple test cluster, I have no issues with the Global Netowrk Utilization, but on my production cluster that does cluster and host networking the numbers are crazy:

image

No way I have sustained rates like that. I think this is related to the metric:

sum(rate(container_network_receive_bytes_total[$__rate_interval]))

If I look at rate(container_network_receive_bytes_total[30s]), I get:

{id="/", interface="cni0", job="kubernetes-cadvisor"} | 2041725438.15131
{id="/", interface="enp1s0", job="kubernetes-cadvisor"} | 4821605692.45648
{id="/", interface="flannel.1", job="kubernetes-cadvisor"} | 337125370.2678834

I'm not sure what to actually look at here. I tried sum(rate(node_network_receive_bytes_total[$__rate_interval])) and I get a reasonable traffic graph:

image

This is 5 nodes, pretty much at idle. Showing I/O by instance:

image

Here is BTOP+ on k3s01 running for a bit, lines up very will with data above:
image

How to reproduce?

No response

Expected behavior

No response

Additional context

No response

[bug] incorrect node count with grafana agent

Describe the bug

Then metric up{job="node-exporter"} does not exist with Grafana agent, and so the total number of nodes is reported as 0.

image

How to reproduce?

  1. Use Grafana Agent
  2. Load dashboard

Expected behavior

Should show the total number of nodes (5 in this case).

Additional context

No response

[enhancement] Windows support

Describe the enhancement you'd like

I have some cluster with Windows nodes enabled. I would like to ask if I can add windows support or do you think it out of context here?

Unlike kubernetes-mixin, which have separate dashboard, I would like to add the Windows queries into the existing one. Thats possible by using queries with OR, e.g.:

sum(container_memory_working_set_bytes{cluster="$cluster",namespace=~"$namespace", image!="", pod=~"${created_by}.*"}) by (pod)
OR
<WINDOWS Query>

Additional context

Since I'm running multiple OS hybrid clusters, I would like to add PRs for windows pods here. I'm not expecting that the maintainers here provide support for Windows. Before start to work here, I would like to know if its getting accepted?

[bug] CPU dashboard can report negative values

Describe the bug

image

How to reproduce?

I don't know

Expected behavior

The dashboard should not produce negative CPU usage values.

Additional context

I adjusted some resource limits, which caused some pods to restart.

[bug] broken panels on k8s-views-nodes in specific cases

Describe the bug

The k8s-views-nodes.json dashboard will have many broken panels in specific Kubernetes setups.
This is currently the case on OKE.

Apparently, this happens when the node label from kube_node_info doesn't match the nodename label from node_uname_info.

Here's some extracted metrics from a broken setup where the labels differ.

TL;DR: node="k8s-wrk-002" and nodename="kind-kube-prometheus-stack-worker2".

kube_node_info:

{
    __name__="kube_node_info",
    container="kube-state-metrics",
    container_runtime_version="containerd://1.6.19-46-g941215f49",
    endpoint="http", 
    instance="10.27.3.148:8080", 
    internal_ip="172.18.0.2", 
    job="kube-state-metrics", 
    kernel_version="6.2.12-arch1-1", 
    kubelet_version="v1.26.3", 
    kubeproxy_version="v1.26.3", 
    namespace="monitoring",
    node="k8s-wrk-002",
    os_image="Ubuntu 22.04.2 LTS",
    pod="kube-prometheus-stack-kube-state-metrics-6df68756d8-zvd58",
    pod_cidr="10.27.1.0/24",
    provider_id="kind://docker/kind-kube-prometheus-stack/kind-kube-prometheus-stack-worker2", 
    service="kube-prometheus-stack-kube-state-metrics", 
    system_uuid="8422f117-6154-45bd-97c0-e3dec80a3f60"
}

node_uname_info:

{
    __name__="node_uname_info", 
    container="node-exporter", 
    domainname="(none)", 
    endpoint="http-metrics", 
    instance="172.18.0.2:9100", 
    job="node-exporter", 
    machine="x86_64", 
    namespace="monitoring", 
    nodename="kind-kube-prometheus-stack-worker2", 
    pod="kube-prometheus-stack-prometheus-node-exporter-qvn22", 
    release="6.2.12-arch1-1", 
    service="kube-prometheus-stack-prometheus-node-exporter", 
    sysname="Linux", 
    version="#1 SMP PREEMPT_DYNAMIC Thu, 20 Apr 2023 16:11:55 +0000"
}

This issue will continue the discussion started in #41

@fcecagno @Chewie

How to reproduce?

You can use https://github.com/dotdc/kind-lab, that will create a kind cluster with renamed nodes.

# Create the kind cluster
./start.sh

# Export configuration
export KUBECONFIG="$(pwd)/kind-kubeconfig.yml"

# Expose Grafana
kubectl port-forward svc/kube-prometheus-stack-grafana -n monitoring 3000:80

Open http://localhost:3000

login: admin
password: prom-operator

Open broken dashboard:

http://localhost:3000/d/k8s_views_nodes/kubernetes-views-nodes?orgId=1&refresh=30s

Expected behavior

Dashboard should work with a relabel_configs like suggested @Chewie.
The solution should be described in https://github.com/dotdc/grafana-dashboards-kubernetes#known-issues

Additional context

No response

[bug] exclude iowait, steal, idle from CPU uages

Describe the bug

Based on

The CPU modes idle, iowait, steal should be excluded from the CPU utilization.

How to reproduce?

No response

Expected behavior

No response

Additional context

Per the iostat man page:

%idle
Show the percentage of time that the CPU or CPUs were idle and the
system did not have an outstanding disk I/O request.

%iowait
Show the percentage of time that the CPU or CPUs were idle during
which the system had an outstanding disk I/O request.

%steal
Show the percentage of time spent in involuntary wait by the
virtual CPU or CPUs while the hypervisor was servicing another
virtual processor.

[bug] Fix node_* metrics on k8s-views-global.json

Describe the bug

Currently, there is no job label selector in k8s-views-global.json.

History:

  • A hardcoded job label was set to node-exporter in #36 for node_* metrics
  • They were after removed in #49 to work with Grafana Agent

Adding a job variable for node_* metrics should fix the issue.

@uhthomas @tlemarchand Can you both try the version in #110 to make sure it works on your side ?

[bug] Failed to display node metrics

Describe the bug

This is the way variables are configured on k8s-views-nodes.json:

...
node = label_values(kube_node_info, node)
instance = label_values(node_uname_info{nodename=~"(?i:($node))"}, instance)

In OKE, kube_node_info looks like this:

{__name__="kube_node_info", container="kube-state-metrics", container_runtime_version="cri-o://1.25.1-111.el7", endpoint="http", instance="10.244.0.40:8080", internal_ip="10.0.107.39", job="kube-state-metrics", kernel_version="5.4.17-2136.314.6.2.el7uek.x86_64", kubelet_version="v1.25.4", kubeproxy_version="v1.25.4", namespace="monitoring", node="10.0.107.39", os_image="Oracle Linux Server 7.9", pod="monitoring-kube-state-metrics-6fcd4d745c-txg2k", pod_cidr="10.244.1.0/25", provider_id="ocid1.instance.oc1.sa-saopaulo-1.xxx", service="monitoring-kube-state-metrics", system_uuid="d6462364-95bf-4122-a3ab-xxx"}

And node_uname_info looks like this:

node_uname_info{container="node-exporter", domainname="(none)", endpoint="http-metrics", instance="10.0.107.39:9100", job="node-exporter", machine="x86_64", namespace="monitoring", nodename="oke-cq2bxmvtqca-nsdfwre7l3a-seqv6owhq3a-0", pod="monitoring-prometheus-node-exporter-n6pzv", release="5.4.17-2136.314.6.2.el7uek.x86_64", service="monitoring-prometheus-node-exporter", sysname="Linux", version="#2 SMP Fri Dec 9 17:35:27 PST 2022"}

For this example, node=10.0.107.39, but when I query node_uname_info{nodename=~"(?i:($node))"}, it doesn't return anything, because nodename doesn't match the internal IP address of the node.
As a result, no node metrics is displayed.

How to reproduce?

No response

Expected behavior

No response

Additional context

Modifying the filter https://github.com/dotdc/grafana-dashboards-kubernetes/blob/master/dashboards/k8s-views-nodes.json#L3747-L3772 to use node_uname_info{instance="$node:9100"} fixes the issue.

Some metrics are missing.

Beautiful dashboards. Some of the panels show no data, and I've seen this before (Kubernetes LENS). In reviewing the JSON query it is referencing attributes or keys that are not included with cAdvisor metrics (that I have). For examples, your Global dashboard:

grafana_missing_metrics

When I look at the CPU Utilization by namespace and inspect the JSON query it is based on container_cpu_usage_seconds_total. When I look in my Prometheus it does not have image=, here is a random one that was on the top of the query:

container_cpu_usage_seconds_total{cpu="total", endpoint="https-metrics", id="/kubepods/besteffort/pod03202a32-75a1-4a64-8692-1e73fd26eca3", instance="192.168.10.217:10250", job="kubelet", metrics_path="/metrics/cadvisor", namespace="democratic-csi", node="k3s03", pod="democratic-csi-nfs-node-sqxp9", service="kube-prometheus-stack-kubelet"}

I'm using K3s based on Kubernetes 1.23 on bare metal with containerd, no docker runtime. I have no idea if this is from containerd, kublet, cAdivsor issue or just expected as part of life when you don't use docker runtime.

If you have any suggestions, be much appreciated.

[enhancement] cluster variable support

Thanks for very nice dashboards.

One thing missing is a variable "cluster" maybe. Having multiple clusters it is useful to limit scope to a single cluster. A multi-select variable accepting all and queries adding "cluster=~"$cluster".

[bug] Node metrics names on AWS EKS nodes mismatch

Describe the bug

The metrics for kube_node_info & node_uname_info produce different names for nodes, resulting in the Node dashboard not working.

Eg:

node_uname_info:

  • nodename="ip-10-10-11-100.ec2.internal"

kube_node_info

  • node="ip-10-10-10-110.us-east-2.compute.internal"

Node exporter version: 1.3.1
Kube state metrics version: 2.5.0

I acknowledge this is not a bug on the dashboard itself but rather the naming standards on the different metric exporters.

However just wanted to know if other aws eks users are experiencing the same issue before I start manually editing the dashboard in an attempt to get the dashboards working.

Thanks

How to reproduce?

No response

Expected behavior

No response

Additional context

No response

Question: How I should export dashboard json

Hi,

I'm prepare #79 and I have some trouble to export the JSON file a grafana instance.

If I import a dashboard and export again without any modifications, I get a lot of changes:

For example this commit does not contain any changes, from a lot of changes of JSON level: jkroepke@706315b

Thats how I export the JSON

image

What the recommend way? If the mention approch is the correct one, would it be possible to import and export all dashboard the keep my PR clean as possible? Otherwise, I had tons of non related changes.

[bug] Dashboard kubernetes-views-pods shows unexpected values for memory requests / limits

Describe the bug

First of all: amazing dashboards...Thanks a ton :)

The panel "Resources by container" in the "kubernetes-views-pods" uses the metrics
kube_pod_container_resource_requests{namespace="$namespace", pod="$pod", unit="byte"}
kube_pod_container_resource_usage{namespace="$namespace", pod="$pod", unit="byte"}

Unfortunately this leads to unexpected values as the label "resource" in these metrics can have the values "memory" and "ephemeral_storage" and counts them together.

How to reproduce?

No response

Expected behavior

The metrics should probably be:
kube_pod_container_resource_requests{namespace="$namespace", pod="$pod", unit="byte", resource="memory"}
kube_pod_container_resource_usage{namespace="$namespace", pod="$pod", unit="byte", resource="memory"}

Additional context

No response

[bug] `kube-prometheus-stack` installation steps broken

This worked for me in the past, but I am building a new k3s cluster and I can't install it with the previous documentation: https://github.com/dotdc/grafana-dashboards-kubernetes#install-with-helm-values.

The error I get is a little specific to me since I am using terraform:

│ Error: unable to build kubernetes objects from release manifest: unable to decode "": json: cannot unmarshal number into Go struct field ObjectMeta.metadata.labels of type string
│
│   with module.monitoring.helm_release.prometheus-stack,
│   on ../modules/monitoring/main.tf line 2, in resource "helm_release" "prometheus-stack":
│    2: resource "helm_release" "prometheus-stack" {

I tried to read https://github.com/prometheus-community/helm-charts/blob/main/charts/kube-prometheus-stack/values.yaml, but looks like a few things have changed.

How to reproduce?

No response

Expected behavior

No response

Additional context

No response

Total pod RAM request usage & Total Pod RAM limit usage gauge is showing wrong value

Describe the bug

First of all I want to thank you for your effort for creating amazing Grafana dashboard for K8s I have deployed Prometheus helm chart stack and passed the dashboard provider value to values.yaml, everything went smooth except one issue that I am facing in /kubernetes/view/pods, which total pod RAM request usage and Total RAM limit usage gauge is showing wrong value as you can see in the below screenshot, I wonder if someone can help me to fix it.

image

image

How to reproduce?

No response

Expected behavior

No response

Additional context

No response

All dashboards with cluster variable is broken in VictoriaMetrics [bug]

Describe the bug

Popup message in grafana when opening dashboards:

Templating
Failed to upgrade legacy queries Datasource prometheus was not found

Previsious version working fine

How to reproduce?

Install VictoriaMetrics as prometheus datasource and try open namespace dashboard

Expected behavior

Dashboards works correctly

Additional context

No response

Issues with node_cpu_seconds_total

I tested the latest changes, and still not right...

Panel CPU Utilization by Node "expr": "avg by (node) (1-rate(node_cpu_seconds_total{mode=\"idle\"}[$__rate_interval]))",yields:

image

Seems to be the total of all nodes? It is not picking up the multiple nodes, It should look like:
image

Panel CPU Utilization by namespace is still dark and using old metric: "expr": "sum(rate(container_cpu_usage_seconds_total{image!=\"\"}[$__rate_interval])) by (namespace)", I did try something like above "avg by (namespace) (1-rate(node_cpu_seconds_total{mode=\"idle\"}[$__rate_interval]))" that is not right, only got one namespace listed:

image

Both Memory Utilization Panels are still dark based on container_memory_working_set_bytes when I use your unmodified files.

https://github.com/dotdc/grafana-dashboards-kubernetes/blob/master/dashboards/trivy

Describe the bug

i am using the dashboard ( grafana-dashboards-kubernetes/dashboards/trivy ) but I am not getting any values for 'CVE vulnerabilities in All namespace(s)' and 'Other vulnerabilities in All namespace(s)', I have enabled OPERATOR_METRICS_VULN_ID_ENABLED= true in my trivy deployment and I am using the latest version of trivy operator and prometheus. could you please help

How to reproduce?

1.install latest trivy-operator and try to use the grafana dashboard

Expected behavior

show cve values

Additional context

No response

[bug] Wrong query on the Network - Bandwidth panel

Describe the bug

On Kubernetes / Views / Pods dashboard on Network - Bandwidth panel wrong query of Transmitted

It is

- sum(rate(container_network_receive_bytes_total{namespace="$namespace", pod="$pod"}[$__rate_interval]))

Should be

- sum(rate(container_network_transmit_bytes_total{namespace="$namespace", pod="$pod"}[$__rate_interval]))

How to reproduce?

No response

Expected behavior

No response

Additional context

https://github.com/dotdc/grafana-dashboards-kubernetes/blob/master/dashboards/k8s-views-pods.json#L1417

[bug] Namespace dashboard shows double resource usage

Describe the bug

The cumulative resource usage in the namespace seems to be 1.25 cpu and 2.5Gi (I changed the two graphs to stack), but it appears as 2.5 cpu and 5Gi respectively.

image

I imagine the queries need the label selector image!="".

How to reproduce?

N/A

Expected behavior

N/A

Additional context

N/A

Publish tag to make update automation possible

Describe the enhancement you'd like

As a renovate user (but this applies to all similar tools), I would like to leverage our system to upgrade automatically our dashboards.

Currently we have no solution to be automatically notified about an update or a change from this project. A solution based on git tag could do the job perfectly.

Tags don't have to be semantic or logical, a simple tag every months is a perfect and valid solution.

Additional context

… nothing specific. Let me know if you have any question.

PS: Your dashboards are really amazing, thank you for this work!

suggest lower cardinality variables for the pod dashboard[bug]

Describe the bug

When in a cluster with a lot of churn on pods, the high cardinality pod metrics cause queries to fail due to the large number of series returns. For instance I doubled the max returned label sets in victoriametrics to 60k and I still fail when trying to use the pod dashboard:

2024-04-22T18:17:33.527Z	warn	VictoriaMetrics/app/vmselect/main.go:231	error in "/api/v1/series?start=1713806220&end=1713809880&match%5B%5D=%7B__name__%3D%22kube_pod_info%22%7D": cannot fetch time series for "filters=[{__name__=\"kube_pod_info\"}], timeRange=[2024-04-22T17:17:00Z..2024-04-22T18:18:00Z]": cannot find metric names: error when searching for metricIDs in the current indexdb: the number of matching timeseries exceeds 60000; either narrow down the search or increase -search.max* command-line flag values at vmselect; see https://docs.victoriametrics.com/#resource-usage-limits

How to reproduce?

Have a cluster with a lot of pods being created...

Expected behavior

No response

Additional context

I have a fix suggestion that seems to work fine for me. It involves changing the namespace and job queries to not query "all pods" for labels. Like this:

namespace: label_values(kube_namespace_created{cluster="$cluster"},namespace)
job: label_values(kube_pod_info{namespace="$namespace", cluster="$cluster"},job)

Running pods panel in Global dashboard

Currently, "Running Pods" panel uses the expression sum(kube_pod_container_info), which sums the containers, but not the pods. I believe the metric kube_pod_info would be the best for this panel.

Should be updated here:

"targets": [
{
"datasource": {
"type": "prometheus",
"uid": "${datasource}"
},
"expr": "sum(kube_pod_container_info)",
"interval": "",
"legendFormat": "",
"refId": "A"
}
],
"title": "Running Pods",
"type": "stat"

P.S. Thank you for the dashboards, they look awesome!

View Pods Dashboard Feature Requests / Issues

RAM Usage Request Gauge
My understanding of requests is that this should closely match the actual. Being 90% of Request is not a bad condition, that is a good condition. I think GREEN should be +/- 20% of the request value. 20% beyond that either side yellow, and the rest is red as being signification under or over request is not ideal. As it is now if you estimate the request perfectly it shows RED like an error condition and that is not the case. Only the LIMIT gauge should be like this (as you get OOM killed),

image
I think that is wrong, to be stable at 90% of request should get me a gold star :)

I'm not sure if CPU Request needs that as well. If so maybe its GREEN range is wider?!?


Resource by container
Could you add the Actual Usage for CPU and Memory between Request/Limits for each? That would be helpful to show where actual is between the two values.
image


I think CPU Usage by container and Memory Usage by Container should be renamed to by pod as if you select a Pod with multiple containers, you do not get a graph with multiple plot lines which you would expect if it was by container.


NOTE: I played with adding resource requests and limits as plot lines for CPU Usage by Container and Memory Usage by Container and looks good for pods with a single container. But once I selected a pod with multiple containers and thus multiple requests/limits it become confusing mess. Don't have the Grafana skills to isolate them properly. But maybe you have some ideas to make that work right.

[bug] Trivy Operator Dashboard: The Prometheus data source variable is not used everywhere

Describe the bug

There are panels in the Trivy Operator dashboard which do not properly use the Prometheus data source variable.

How to reproduce?

  1. Import the dashboard
  2. Change between Prometheus data sources in the global variable filter
  3. See that the "Vulnerability count per image and severity in $namespace namespace" panel does not pick up the Prometheus data source correctly

Expected behavior

The global Prometheus data source variable should be applied to all panels.

Additional context

Here are the places I spotted where the Prometheus data source variable is not used:

https://github.com/dotdc/grafana-dashboards-kubernetes/blob/master/dashboards/k8s-addons-trivy-operator.json#L785
https://github.com/dotdc/grafana-dashboards-kubernetes/blob/master/dashboards/k8s-addons-trivy-operator.json#L882

Metrics missing in K8s Environment

I've opened a new issue because this one is not in the k3s environment but k8s.

I see some metrics missing, probably because my installation could be incomplete.
I've deployed the k8s cluster, with two masters and three workers nodes. Grafana and prometheus are deployed with "almost" de default settings.

i5Js@nanoserver:~/K3s/K8s/grafana/grafana-dashboards-kubernetes/dashboards$ k get svc -n grafana
NAME      TYPE        CLUSTER-IP     EXTERNAL-IP   PORT(S)   AGE
grafana   ClusterIP   <ip>   <none>        80/TCP    18h
i5Js@nanoserver:~/K3s/K8s/grafana/grafana-dashboards-kubernetes/dashboards$ k get svc -n prometheus
NAME                            TYPE        CLUSTER-IP       EXTERNAL-IP   PORT(S)    AGE
prometheus-alertmanager         ClusterIP   <ip>     <none>        80/TCP     21h
prometheus-kube-state-metrics   ClusterIP   <ip>     <none>        8080/TCP   21h
prometheus-node-exporter        ClusterIP   <ip>     <none>        9100/TCP   21h
prometheus-pushgateway          ClusterIP   <ip>    <none>        9091/TCP   21h
prometheus-server               ClusterIP   <ip>    <none>        80/TCP     21h

I've created the datasource using the prometheus-server ip, and some of the metrics works and some don't:

Screenshot 2022-07-02 at 10 08 38

Screenshot 2022-07-02 at 10 10 16

I'm completely sure that those issues are because my environment because I see that your dashboards work fine, but, can you help me troubleshoot?

Thanks,

[bug] "FS - Device Errors" query in Nodes dashboard is not scoped

Describe the bug

In k8s-views-nodes.json, the "FS - Device Errors" query is sum(node_filesystem_device_error) by (mountpoint), which aggregates data from the entire datasource.

How to reproduce?

No response

Expected behavior

{instance="$instance"} should be added to the query.

Additional context

No response

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.