Describe the bug <a target="_blank" rel="noopener noreferrer" href

Again, appreciate your help. Enjoy your winter break <a class="user-mention notranslat

I don't really understand why. <a target="_blank" rel="noopener nore

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

Hi, I also read <a class="issue-link js-issue-link" data-error-text=

Thanks for the comprehensive write up <a class="user-mention notranslate" data-hoverca

At least the queries from <a class="issue-link js-issue-link" data-error-text="Failed

They are taking measurements at different points in time </blockquote

[bug] CPU dashboard can report negative values,about dotdc/grafana-dashboards-kubernetes

Comments (35)

uhthomas commented on August 24, 2024 1

Again, appreciate your help. Enjoy your winter break @dotdc! 😄

from grafana-dashboards-kubernetes.

uhthomas commented on August 24, 2024

I don't really understand why.

from grafana-dashboards-kubernetes.

uhthomas commented on August 24, 2024

Looks normal when switching the query to avg(rate(node_cpu_seconds_total{mode!="idle"}[$__rate_interval]))

versus

from grafana-dashboards-kubernetes.

dotdc commented on August 24, 2024

Hi @uhthomas,
Thank you for opening this issue.
The values are quite different, did you compare them with other system tools or the metrics from the metrics-server (kubectl top)?

I will need to make some tests before approving your PR.

from grafana-dashboards-kubernetes.

jkroepke commented on August 24, 2024

Hi,

I also read #80 and saw that the values are different between #80 and main branch. Based on this, I do a research how other parties are doing CPU calculation:

Dashboard to test on your own

{
  "__inputs": [
    {
      "name": "DS_PROMETHEUS",
      "label": "Prometheus",
      "description": "",
      "type": "datasource",
      "pluginId": "prometheus",
      "pluginName": "Prometheus"
    }
  ],
  "__elements": {},
  "__requires": [
    {
      "type": "grafana",
      "id": "grafana",
      "name": "Grafana",
      "version": "10.2.2"
    },
    {
      "type": "datasource",
      "id": "prometheus",
      "name": "Prometheus",
      "version": "1.0.0"
    },
    {
      "type": "panel",
      "id": "table",
      "name": "Table",
      "version": ""
    }
  ],
  "annotations": {
    "list": [
      {
        "builtIn": 1,
        "datasource": {
          "type": "grafana",
          "uid": "-- Grafana --"
        },
        "enable": true,
        "hide": true,
        "iconColor": "rgba(0, 211, 255, 1)",
        "name": "Annotations & Alerts",
        "type": "dashboard"
      }
    ]
  },
  "editable": true,
  "fiscalYearStartMonth": 0,
  "graphTooltip": 0,
  "id": null,
  "links": [],
  "liveNow": false,
  "panels": [
    {
      "datasource": {
        "type": "prometheus",
        "uid": "${DS_PROMETHEUS}"
      },
      "fieldConfig": {
        "defaults": {
          "color": {
            "mode": "thresholds"
          },
          "custom": {
            "align": "auto",
            "cellOptions": {
              "type": "auto"
            },
            "filterable": true,
            "inspect": false
          },
          "decimals": 3,
          "mappings": [],
          "thresholds": {
            "mode": "absolute",
            "steps": [
              {
                "color": "green",
                "value": null
              },
              {
                "color": "red",
                "value": 80
              }
            ]
          },
          "unit": "percentunit"
        },
        "overrides": [
          {
            "matcher": {
              "id": "byName",
              "options": "Value"
            },
            "properties": [
              {
                "id": "unit"
              }
            ]
          }
        ]
      },
      "gridPos": {
        "h": 23,
        "w": 24,
        "x": 0,
        "y": 0
      },
      "id": 1,
      "options": {
        "cellHeight": "sm",
        "footer": {
          "countRows": false,
          "fields": "",
          "reducer": [
            "sum"
          ],
          "show": false
        },
        "frameIndex": 0,
        "showHeader": true
      },
      "pluginVersion": "10.2.2",
      "targets": [
        {
          "datasource": {
            "type": "prometheus",
            "uid": "${DS_PROMETHEUS}"
          },
          "editorMode": "code",
          "exemplar": false,
          "expr": "avg(1-rate(node_cpu_seconds_total{mode=\"idle\"}[$__rate_interval])) by (instance) ",
          "format": "table",
          "instant": true,
          "legendFormat": "__auto",
          "range": false,
          "refId": "main"
        },
        {
          "datasource": {
            "type": "prometheus",
            "uid": "${DS_PROMETHEUS}"
          },
          "editorMode": "code",
          "exemplar": false,
          "expr": "avg(rate(node_cpu_seconds_total{mode!=\"idle\"}[$__rate_interval])) by (instance)",
          "format": "table",
          "hide": false,
          "instant": true,
          "legendFormat": "__auto",
          "range": false,
          "refId": "PR"
        },
        {
          "datasource": {
            "type": "prometheus",
            "uid": "${DS_PROMETHEUS}"
          },
          "editorMode": "code",
          "exemplar": false,
          "expr": "avg(irate(node_cpu_seconds_total{mode!=\"idle\"}[$__rate_interval])) by(instance)",
          "format": "table",
          "hide": false,
          "instant": true,
          "legendFormat": "__auto",
          "range": false,
          "refId": "node_exporter_full"
        },
        {
          "datasource": {
            "type": "prometheus",
            "uid": "${DS_PROMETHEUS}"
          },
          "editorMode": "code",
          "exemplar": false,
          "expr": "sum by (instance) (avg by (mode, instance) (rate(node_cpu_seconds_total{mode!=\"idle\"}[$__rate_interval])))",
          "format": "table",
          "hide": false,
          "instant": true,
          "legendFormat": "__auto",
          "range": false,
          "refId": "Prometheus Alerts"
        },
        {
          "datasource": {
            "type": "prometheus",
            "uid": "${DS_PROMETHEUS}"
          },
          "editorMode": "code",
          "exemplar": false,
          "expr": "sum by (instance) (rate(node_cpu_seconds_total{mode!=\"idle\",mode!=\"iowait\",mode!=\"steal\"}[$__rate_interval]))",
          "format": "table",
          "hide": false,
          "instant": true,
          "legendFormat": "__auto",
          "range": false,
          "refId": "kubernetes-mixin"
        },
        {
          "datasource": {
            "type": "prometheus",
            "uid": "${DS_PROMETHEUS}"
          },
          "editorMode": "code",
          "exemplar": false,
          "expr": "          sum by (instance) (\n            (1 - sum without (mode) (rate(node_cpu_seconds_total{mode=~\"idle|iowait|steal\"}[$__rate_interval])))\n          / ignoring(cpu) group_left\n            count without (cpu, mode) (node_cpu_seconds_total{mode=\"idle\"})\n          )",
          "format": "table",
          "hide": false,
          "instant": true,
          "legendFormat": "__auto",
          "range": false,
          "refId": "node-mixin D"
        },
        {
          "datasource": {
            "type": "prometheus",
            "uid": "${DS_PROMETHEUS}"
          },
          "editorMode": "code",
          "exemplar": false,
          "expr": "1 - avg by (instance) (\n sum without (mode) (rate(node_cpu_seconds_total{mode=~\"idle|iowait|steal\"}[$__rate_interval]))\n)",
          "format": "table",
          "hide": false,
          "instant": true,
          "legendFormat": "__auto",
          "range": false,
          "refId": "node-mixin R"
        },
        {
          "datasource": {
            "type": "prometheus",
            "uid": "${DS_PROMETHEUS}"
          },
          "editorMode": "code",
          "exemplar": false,
          "expr": "sum by (instance) (sum without(mode) (avg without (cpu) (rate(node_cpu_seconds_total{mode!=\"idle\"}[$__rate_interval]))))",
          "format": "table",
          "hide": false,
          "instant": true,
          "legendFormat": "__auto",
          "range": false,
          "refId": "node-mixin A"
        }
      ],
      "title": "Panel Title",
      "transformations": [
        {
          "id": "merge",
          "options": {}
        },
        {
          "id": "organize",
          "options": {
            "excludeByName": {
              "Time": true
            },
            "indexByName": {
              "Time": 0,
              "Value #PR": 2,
              "Value #Prometheus Alerts": 4,
              "Value #kubernetes-mixin": 9,
              "Value #main": 8,
              "Value #node-mixin A": 5,
              "Value #node-mixin D": 6,
              "Value #node-mixin R": 7,
              "Value #node_exporter_full": 3,
              "instance": 1
            },
            "renameByName": {}
          }
        }
      ],
      "type": "table"
    }
  ],
  "refresh": "",
  "schemaVersion": 38,
  "tags": [],
  "templating": {
    "list": []
  },
  "time": {
    "from": "now-5m",
    "to": "now"
  },
  "timepicker": {},
  "timezone": "",
  "title": "CPU test",
  "uid": "f551a6d1-ff6e-45b1-a7a0-84cf70124b75",
  "version": 3,
  "weekStart": ""
}

main branch

avg(1-rate(node_cpu_seconds_total{mode="idle"}[$__rate_interval])) by (instance)

PR

avg(rate(node_cpu_seconds_total{mode!="idle"}[$__rate_interval])) by (instance)

node_exporter full dashboard is using

avg(irate(node_cpu_seconds_total{mode!="idle"}[$__rate_interval])) by(instance)

Awesome Prometheus alerts:

sum by (cluster, instance) (avg by (mode, cluster, instance) (rate(node_cpu_seconds_total{mode!="idle"}[$__rate_interval])))

kubernetes-mixin

Note: 100% = 1 Core

sum by (cluster,instance) (rate(node_cpu_seconds_total{mode!="idle",mode!="iowait",mode!="steal"}[$__rate_interval]))

node-mixin (official node_exporter)

recording rule

1 - avg without (cpu) (
 sum without (mode) (rate(node_cpu_seconds_total{mode=~"idle|iowait|steal"}[$__rate_interval]))
)

alerting rule

sum without(mode) (avg without (cpu) (rate(node_cpu_seconds_total{mode!="idle"}[2m])))

dashboard

          sum by (instance) (
            (1 - sum without (mode) (rate(node_cpu_seconds_total{mode=~"idle|iowait|steal"}[$__rate_interval])))
          / ignoring(cpu) group_left
            count without (cpu, mode) (node_cpu_seconds_total{mode="idle"})
          )

I could only do a test with an small subset of nodes:

% k top node
NAME                               CPU(cores)   CPU%   MEMORY(bytes)   MEMORY%   
aks-opsstack-21479518-vmss000000   251m         13%    3573Mi          28%       
aks-opsstack-21479518-vmss000001   600m         31%    7863Mi          63%

But the values from #80 are different compare to kubectl top

from grafana-dashboards-kubernetes.

uhthomas commented on August 24, 2024

Thanks for the comprehensive write up @jkroepke! Given most dashboards use mode!="idle", it looks like this change is probably the right thing to do then? It's also what is recommended by Tigera as linked in the PR.

With respect to kubectl top, I think it's known that these values can differ. I believe it's just the different time intervals and the way the metrics are collected?

from grafana-dashboards-kubernetes.

jkroepke commented on August 24, 2024

At least the queries from #81 give me incorrect results

# k top node
NAME                               CPU(cores)   CPU%   MEMORY(bytes)   MEMORY%   
aks-opsstack-21479518-vmss000000   232m         12%    3619Mi          29%       
aks-opsstack-21479518-vmss000001   613m         32%    7920Mi          64%

In my post before, I do more a brain dump of my findings. Maybe #81 is wrongly interpret and to solve the issue from @uhthomas, I figure out some alternatives.

Since only he has the issue, he should have to tests some queries.

from grafana-dashboards-kubernetes.

uhthomas commented on August 24, 2024

I am not sure your evaluation is fair. They are taking measurements at different points in time and does not mean the query in #81 is incorrect. If it were, then all the dashboards you linked in your initial comment would also be wrong, which I don't believe is true.

from grafana-dashboards-kubernetes.

uhthomas commented on August 24, 2024

I think the numbers look a bit weird because they are averaged, maybe not properly. The usage across cores varies quite widely:

Averaged by core:

Averaged by all:

from grafana-dashboards-kubernetes.

jkroepke commented on August 24, 2024

They are taking measurements at different points in time

Compared to kubectl top, yes.

But the queries from the dashboard based on the same datapoints. The dashboard on main branch show me 30%-35% CPU usage which is way more as the reported 4.2% from #81. The CPU on the system has a constant usage between 30%-35%. over hours. 4% is not possible. This is why, I declare the query as incorrect.

If it were, then all the dashboards you linked in your initial comment would also be wrong, which I don't believe is true.

I would say 5 of 8 queries are true-ish. All values based on the exact same datapoints from different instances.

from grafana-dashboards-kubernetes.

uhthomas commented on August 24, 2024

@jkroepke I do see what you mean. If you read my previous comment, it may best to calculate average cpu usage by core (avg(sum by (cpu) (...))). I can make this change in the PR and it should be more accurate.

The original query vs the query avg(sum by (cpu) (rate(node_cpu_seconds_total{mode!="idle"}[$__rate_interval]))):

The other option is to change the graph to measure different CPU modes, or even cores? That wasn't really its intent though I guess.

from grafana-dashboards-kubernetes.

jkroepke commented on August 24, 2024

If you read my previous comment, it may best to calculate average cpu usage by core ...

Yeah, on the Averaged by all dashboard, you do avg(rate(node_cpu_seconds_total{mode!="idle"}[$__rate_interval])) by (instance)

This mean, an average of all CPU modes. If you have user 30% system 0% and iowait 0%, you have 30/3=10% CPU usage.

The query was used on the node dashboard.

from grafana-dashboards-kubernetes.

uhthomas commented on August 24, 2024

@jkroepke Agree. Would you be able to test the most recent changes in #81?

from grafana-dashboards-kubernetes.

jkroepke commented on August 24, 2024

It looks much better now. 👍

from grafana-dashboards-kubernetes.

jkroepke commented on August 24, 2024

FYI: While I do some research for the queries, I found prometheus/node_exporter#2194.

I created a separate issue for this: #86

from grafana-dashboards-kubernetes.

dotdc commented on August 24, 2024

This is an interesting discussion, I didn't have time to make some benchs, but It looks promising!

I tried the latest version and still see a big difference in the resulting values on my side (~ x3).
I will need to deep dive to make sure we get it right (most probably in January).

from grafana-dashboards-kubernetes.

jkroepke commented on August 24, 2024

@dotdc please ensure that you are using the latest version of #81 because some queries where adjusted.

from grafana-dashboards-kubernetes.

dotdc commented on August 24, 2024

@dotdc please ensure that you are using the latest version of #81 because some queries where adjusted.

Yes it was the latest, CPU usage is 3x higher on the new version.
I'll need to check/compare to find which query is the closest to the reality.

from grafana-dashboards-kubernetes.

uhthomas commented on August 24, 2024

@dotdc please ensure that you are using the latest version of #81 because some queries where adjusted.

Yes it was the latest, CPU usage is 3x higher on the new version.

I'll need to check/compare to find which query is the closest to the reality.

Would you be able to attach a screenshot? The amended query should be more accurate.

from grafana-dashboards-kubernetes.

dotdc commented on August 24, 2024

As you can see, the values are quite different (at least on my side).
Comparing the results with trusted system tools or software can help I think.
I'm pretty sure I did that a long time ago, and it looked good to me, but maybe it's wrong...

We should definitely take the time to get this right.

PS: I don't think I will have time to look further before January 🎄 🥳

from grafana-dashboards-kubernetes.

uhthomas commented on August 24, 2024

Thanks for your help @dotdc - that is interesting. I would be eager to see the individual values for the different modes on your cores. I wonder if the system is busy in the other idle states iowait and steal?

from grafana-dashboards-kubernetes.

dotdc commented on August 24, 2024

Something like this?

from grafana-dashboards-kubernetes.

uhthomas commented on August 24, 2024

Yes, exactly, but maybe with distinct colours for the values?

from grafana-dashboards-kubernetes.

dotdc commented on August 24, 2024

This is the best I can do right now:

from grafana-dashboards-kubernetes.

dotdc commented on August 24, 2024

Thanks, you too!

from grafana-dashboards-kubernetes.

jkroepke commented on August 24, 2024

This is the best I can do right now:

Could you please this, without excluding iowait and steal? Or better: the same graph, but only with the both modes. Are they mentionable values?

from grafana-dashboards-kubernetes.

uhthomas commented on August 24, 2024

This query could possibly be helpful? If there is a lot of cpu time spent on steal or iowait, then it would make sense that comparing to just idle would produce a wildly different graph.

avg by (mode) (rate(node_cpu_seconds_total{mode=~"steal|iowait"}))

I have a feeling the new query is a more accurate representation of actual CPU usage - but the graphs shown here look very suspect.

The same graph on my cluster also shows discrepancies (as expected), but not to such huge degrees.

(new on bottom)

from grafana-dashboards-kubernetes.

uhthomas commented on August 24, 2024

Please also see these dashboards side-by-side. The first one is the original, the second matches everything but "idle" and the third one is the current query which matches everything but "idle", "iowait" and "steal". The final graph shows CPU usage across the whole cluster by namespace. There is a clear discrepancy, and the third graph seems the most accurate to me.

For context, there are 20 allocatable cpus on the cluster. I do not see how 50% utilisation could ever make sense.

The following is the same as the original image, but with stacked cpu usage to demonstrate that a value of 50% is unrealistic.

from grafana-dashboards-kubernetes.

uhthomas commented on August 24, 2024

This final image may also be helpful. It shows there was a spike in iowait, which is not currently accounted for.

from grafana-dashboards-kubernetes.

dotdc commented on August 24, 2024

Hi @uhthomas,

I made a limited number of additional tests this morning.
Your query is good on the nodes dashboard, but the differences from my previous screenshots remains on the global view.

I've managed to get closer to your values by dividing the result by the number of nodes.

avg(sum by (cpu) (rate(node_cpu_seconds_total{mode!~"idle|steal|iowait", cluster="$cluster"}[$__rate_interval]))) / count(count by (node) (kube_node_info{cluster="$cluster"}))

This should work for clusters that have homogenous nodes flavors across nodepools, but I have concerns on clusters that have heterogenous nodepools/flavors.

Could you double-check this on your setup?
Also, do you have a cluster with different node flavors to see how this behaves?

Screenshot:

from grafana-dashboards-kubernetes.

uhthomas commented on August 24, 2024

@dotdc I am currently running a single node Kubernetes cluster, I was not aware of this limitation. I imagine what's happening here is it should be sum by (node, cpu). I can fix this when I get back later, and should resolve the issue you're seeing 😄

from grafana-dashboards-kubernetes.

uhthomas commented on August 24, 2024

Would you be able to test it for me in the meantime?

from grafana-dashboards-kubernetes.

uhthomas commented on August 24, 2024

I've updated the PR @dotdc

from grafana-dashboards-kubernetes.

dotdc commented on August 24, 2024

This was great, thank you both @uhthomas & @jkroepke !

from grafana-dashboards-kubernetes.

dotdc commented on August 24, 2024

🎉 This issue has been resolved in version 1.1.0 🎉

The release is available on GitHub release

Your semantic-release bot 📦🚀

from grafana-dashboards-kubernetes.

[bug] CPU dashboard can report negative values about grafana-dashboards-kubernetes HOT 35 CLOSED

Comments (35)

main branch

PR

node_exporter full dashboard is using

Awesome Prometheus alerts:

kubernetes-mixin

node-mixin (official node_exporter)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent