Giter VIP home page Giter VIP logo

Comments (35)

uhthomas avatar uhthomas commented on August 24, 2024 1

Again, appreciate your help. Enjoy your winter break @dotdc! 😄

from grafana-dashboards-kubernetes.

uhthomas avatar uhthomas commented on August 24, 2024

I don't really understand why.

image

from grafana-dashboards-kubernetes.

uhthomas avatar uhthomas commented on August 24, 2024

Looks normal when switching the query to avg(rate(node_cpu_seconds_total{mode!="idle"}[$__rate_interval]))

image

versus

image

from grafana-dashboards-kubernetes.

dotdc avatar dotdc commented on August 24, 2024

Hi @uhthomas,
Thank you for opening this issue.
The values are quite different, did you compare them with other system tools or the metrics from the metrics-server (kubectl top)?

I will need to make some tests before approving your PR.

from grafana-dashboards-kubernetes.

jkroepke avatar jkroepke commented on August 24, 2024

Hi,

I also read #80 and saw that the values are different between #80 and main branch. Based on this, I do a research how other parties are doing CPU calculation:

image
Dashboard to test on your own
{
  "__inputs": [
    {
      "name": "DS_PROMETHEUS",
      "label": "Prometheus",
      "description": "",
      "type": "datasource",
      "pluginId": "prometheus",
      "pluginName": "Prometheus"
    }
  ],
  "__elements": {},
  "__requires": [
    {
      "type": "grafana",
      "id": "grafana",
      "name": "Grafana",
      "version": "10.2.2"
    },
    {
      "type": "datasource",
      "id": "prometheus",
      "name": "Prometheus",
      "version": "1.0.0"
    },
    {
      "type": "panel",
      "id": "table",
      "name": "Table",
      "version": ""
    }
  ],
  "annotations": {
    "list": [
      {
        "builtIn": 1,
        "datasource": {
          "type": "grafana",
          "uid": "-- Grafana --"
        },
        "enable": true,
        "hide": true,
        "iconColor": "rgba(0, 211, 255, 1)",
        "name": "Annotations & Alerts",
        "type": "dashboard"
      }
    ]
  },
  "editable": true,
  "fiscalYearStartMonth": 0,
  "graphTooltip": 0,
  "id": null,
  "links": [],
  "liveNow": false,
  "panels": [
    {
      "datasource": {
        "type": "prometheus",
        "uid": "${DS_PROMETHEUS}"
      },
      "fieldConfig": {
        "defaults": {
          "color": {
            "mode": "thresholds"
          },
          "custom": {
            "align": "auto",
            "cellOptions": {
              "type": "auto"
            },
            "filterable": true,
            "inspect": false
          },
          "decimals": 3,
          "mappings": [],
          "thresholds": {
            "mode": "absolute",
            "steps": [
              {
                "color": "green",
                "value": null
              },
              {
                "color": "red",
                "value": 80
              }
            ]
          },
          "unit": "percentunit"
        },
        "overrides": [
          {
            "matcher": {
              "id": "byName",
              "options": "Value"
            },
            "properties": [
              {
                "id": "unit"
              }
            ]
          }
        ]
      },
      "gridPos": {
        "h": 23,
        "w": 24,
        "x": 0,
        "y": 0
      },
      "id": 1,
      "options": {
        "cellHeight": "sm",
        "footer": {
          "countRows": false,
          "fields": "",
          "reducer": [
            "sum"
          ],
          "show": false
        },
        "frameIndex": 0,
        "showHeader": true
      },
      "pluginVersion": "10.2.2",
      "targets": [
        {
          "datasource": {
            "type": "prometheus",
            "uid": "${DS_PROMETHEUS}"
          },
          "editorMode": "code",
          "exemplar": false,
          "expr": "avg(1-rate(node_cpu_seconds_total{mode=\"idle\"}[$__rate_interval])) by (instance) ",
          "format": "table",
          "instant": true,
          "legendFormat": "__auto",
          "range": false,
          "refId": "main"
        },
        {
          "datasource": {
            "type": "prometheus",
            "uid": "${DS_PROMETHEUS}"
          },
          "editorMode": "code",
          "exemplar": false,
          "expr": "avg(rate(node_cpu_seconds_total{mode!=\"idle\"}[$__rate_interval])) by (instance)",
          "format": "table",
          "hide": false,
          "instant": true,
          "legendFormat": "__auto",
          "range": false,
          "refId": "PR"
        },
        {
          "datasource": {
            "type": "prometheus",
            "uid": "${DS_PROMETHEUS}"
          },
          "editorMode": "code",
          "exemplar": false,
          "expr": "avg(irate(node_cpu_seconds_total{mode!=\"idle\"}[$__rate_interval])) by(instance)",
          "format": "table",
          "hide": false,
          "instant": true,
          "legendFormat": "__auto",
          "range": false,
          "refId": "node_exporter_full"
        },
        {
          "datasource": {
            "type": "prometheus",
            "uid": "${DS_PROMETHEUS}"
          },
          "editorMode": "code",
          "exemplar": false,
          "expr": "sum by (instance) (avg by (mode, instance) (rate(node_cpu_seconds_total{mode!=\"idle\"}[$__rate_interval])))",
          "format": "table",
          "hide": false,
          "instant": true,
          "legendFormat": "__auto",
          "range": false,
          "refId": "Prometheus Alerts"
        },
        {
          "datasource": {
            "type": "prometheus",
            "uid": "${DS_PROMETHEUS}"
          },
          "editorMode": "code",
          "exemplar": false,
          "expr": "sum by (instance) (rate(node_cpu_seconds_total{mode!=\"idle\",mode!=\"iowait\",mode!=\"steal\"}[$__rate_interval]))",
          "format": "table",
          "hide": false,
          "instant": true,
          "legendFormat": "__auto",
          "range": false,
          "refId": "kubernetes-mixin"
        },
        {
          "datasource": {
            "type": "prometheus",
            "uid": "${DS_PROMETHEUS}"
          },
          "editorMode": "code",
          "exemplar": false,
          "expr": "          sum by (instance) (\n            (1 - sum without (mode) (rate(node_cpu_seconds_total{mode=~\"idle|iowait|steal\"}[$__rate_interval])))\n          / ignoring(cpu) group_left\n            count without (cpu, mode) (node_cpu_seconds_total{mode=\"idle\"})\n          )",
          "format": "table",
          "hide": false,
          "instant": true,
          "legendFormat": "__auto",
          "range": false,
          "refId": "node-mixin D"
        },
        {
          "datasource": {
            "type": "prometheus",
            "uid": "${DS_PROMETHEUS}"
          },
          "editorMode": "code",
          "exemplar": false,
          "expr": "1 - avg by (instance) (\n sum without (mode) (rate(node_cpu_seconds_total{mode=~\"idle|iowait|steal\"}[$__rate_interval]))\n)",
          "format": "table",
          "hide": false,
          "instant": true,
          "legendFormat": "__auto",
          "range": false,
          "refId": "node-mixin R"
        },
        {
          "datasource": {
            "type": "prometheus",
            "uid": "${DS_PROMETHEUS}"
          },
          "editorMode": "code",
          "exemplar": false,
          "expr": "sum by (instance) (sum without(mode) (avg without (cpu) (rate(node_cpu_seconds_total{mode!=\"idle\"}[$__rate_interval]))))",
          "format": "table",
          "hide": false,
          "instant": true,
          "legendFormat": "__auto",
          "range": false,
          "refId": "node-mixin A"
        }
      ],
      "title": "Panel Title",
      "transformations": [
        {
          "id": "merge",
          "options": {}
        },
        {
          "id": "organize",
          "options": {
            "excludeByName": {
              "Time": true
            },
            "indexByName": {
              "Time": 0,
              "Value #PR": 2,
              "Value #Prometheus Alerts": 4,
              "Value #kubernetes-mixin": 9,
              "Value #main": 8,
              "Value #node-mixin A": 5,
              "Value #node-mixin D": 6,
              "Value #node-mixin R": 7,
              "Value #node_exporter_full": 3,
              "instance": 1
            },
            "renameByName": {}
          }
        }
      ],
      "type": "table"
    }
  ],
  "refresh": "",
  "schemaVersion": 38,
  "tags": [],
  "templating": {
    "list": []
  },
  "time": {
    "from": "now-5m",
    "to": "now"
  },
  "timepicker": {},
  "timezone": "",
  "title": "CPU test",
  "uid": "f551a6d1-ff6e-45b1-a7a0-84cf70124b75",
  "version": 3,
  "weekStart": ""
}

main branch

avg(1-rate(node_cpu_seconds_total{mode="idle"}[$__rate_interval])) by (instance) 

PR

avg(rate(node_cpu_seconds_total{mode!="idle"}[$__rate_interval])) by (instance)

node_exporter full dashboard is using

avg(irate(node_cpu_seconds_total{mode!="idle"}[$__rate_interval])) by(instance)

Awesome Prometheus alerts:

sum by (cluster, instance) (avg by (mode, cluster, instance) (rate(node_cpu_seconds_total{mode!="idle"}[$__rate_interval])))

kubernetes-mixin

Note: 100% = 1 Core

sum by (cluster,instance) (rate(node_cpu_seconds_total{mode!="idle",mode!="iowait",mode!="steal"}[$__rate_interval]))

node-mixin (official node_exporter)

recording rule

1 - avg without (cpu) (
 sum without (mode) (rate(node_cpu_seconds_total{mode=~"idle|iowait|steal"}[$__rate_interval]))
)

alerting rule

sum without(mode) (avg without (cpu) (rate(node_cpu_seconds_total{mode!="idle"}[2m])))

dashboard

          sum by (instance) (
            (1 - sum without (mode) (rate(node_cpu_seconds_total{mode=~"idle|iowait|steal"}[$__rate_interval])))
          / ignoring(cpu) group_left
            count without (cpu, mode) (node_cpu_seconds_total{mode="idle"})
          )

I could only do a test with an small subset of nodes:

image
% k top node
NAME                               CPU(cores)   CPU%   MEMORY(bytes)   MEMORY%   
aks-opsstack-21479518-vmss000000   251m         13%    3573Mi          28%       
aks-opsstack-21479518-vmss000001   600m         31%    7863Mi          63% 

But the values from #80 are different compare to kubectl top


from grafana-dashboards-kubernetes.

uhthomas avatar uhthomas commented on August 24, 2024

Thanks for the comprehensive write up @jkroepke! Given most dashboards use mode!="idle", it looks like this change is probably the right thing to do then? It's also what is recommended by Tigera as linked in the PR.

With respect to kubectl top, I think it's known that these values can differ. I believe it's just the different time intervals and the way the metrics are collected?

from grafana-dashboards-kubernetes.

jkroepke avatar jkroepke commented on August 24, 2024

At least the queries from #81 give me incorrect results

PR

image

vs

# k top node
NAME                               CPU(cores)   CPU%   MEMORY(bytes)   MEMORY%   
aks-opsstack-21479518-vmss000000   232m         12%    3619Mi          29%       
aks-opsstack-21479518-vmss000001   613m         32%    7920Mi          64%   

In my post before, I do more a brain dump of my findings. Maybe #81 is wrongly interpret and to solve the issue from @uhthomas, I figure out some alternatives.

Since only he has the issue, he should have to tests some queries.

from grafana-dashboards-kubernetes.

uhthomas avatar uhthomas commented on August 24, 2024

I am not sure your evaluation is fair. They are taking measurements at different points in time and does not mean the query in #81 is incorrect. If it were, then all the dashboards you linked in your initial comment would also be wrong, which I don't believe is true.

from grafana-dashboards-kubernetes.

uhthomas avatar uhthomas commented on August 24, 2024

I think the numbers look a bit weird because they are averaged, maybe not properly. The usage across cores varies quite widely:

image

Averaged by core:

image

Averaged by all:

image

from grafana-dashboards-kubernetes.

jkroepke avatar jkroepke commented on August 24, 2024

They are taking measurements at different points in time

Compared to kubectl top, yes.

But the queries from the dashboard based on the same datapoints. The dashboard on main branch show me 30%-35% CPU usage which is way more as the reported 4.2% from #81. The CPU on the system has a constant usage between 30%-35%. over hours. 4% is not possible. This is why, I declare the query as incorrect.

If it were, then all the dashboards you linked in your initial comment would also be wrong, which I don't believe is true.

I would say 5 of 8 queries are true-ish. All values based on the exact same datapoints from different instances.

from grafana-dashboards-kubernetes.

uhthomas avatar uhthomas commented on August 24, 2024

@jkroepke I do see what you mean. If you read my previous comment, it may best to calculate average cpu usage by core (avg(sum by (cpu) (...))). I can make this change in the PR and it should be more accurate.

The original query vs the query avg(sum by (cpu) (rate(node_cpu_seconds_total{mode!="idle"}[$__rate_interval]))):

image

The other option is to change the graph to measure different CPU modes, or even cores? That wasn't really its intent though I guess.

from grafana-dashboards-kubernetes.

jkroepke avatar jkroepke commented on August 24, 2024

If you read my previous comment, it may best to calculate average cpu usage by core ...

Yeah, on the Averaged by all dashboard, you do avg(rate(node_cpu_seconds_total{mode!="idle"}[$__rate_interval])) by (instance)

This mean, an average of all CPU modes. If you have user 30% system 0% and iowait 0%, you have 30/3=10% CPU usage.

The query was used on the node dashboard.

from grafana-dashboards-kubernetes.

uhthomas avatar uhthomas commented on August 24, 2024

@jkroepke Agree. Would you be able to test the most recent changes in #81?

from grafana-dashboards-kubernetes.

jkroepke avatar jkroepke commented on August 24, 2024

It looks much better now. 👍

from grafana-dashboards-kubernetes.

jkroepke avatar jkroepke commented on August 24, 2024

FYI: While I do some research for the queries, I found prometheus/node_exporter#2194.

I created a separate issue for this: #86

from grafana-dashboards-kubernetes.

dotdc avatar dotdc commented on August 24, 2024

This is an interesting discussion, I didn't have time to make some benchs, but It looks promising!

I tried the latest version and still see a big difference in the resulting values on my side (~ x3).
I will need to deep dive to make sure we get it right (most probably in January).

from grafana-dashboards-kubernetes.

jkroepke avatar jkroepke commented on August 24, 2024

@dotdc please ensure that you are using the latest version of #81 because some queries where adjusted.

from grafana-dashboards-kubernetes.

dotdc avatar dotdc commented on August 24, 2024

@dotdc please ensure that you are using the latest version of #81 because some queries where adjusted.

Yes it was the latest, CPU usage is 3x higher on the new version.
I'll need to check/compare to find which query is the closest to the reality.

from grafana-dashboards-kubernetes.

uhthomas avatar uhthomas commented on August 24, 2024

@dotdc please ensure that you are using the latest version of #81 because some queries where adjusted.

Yes it was the latest, CPU usage is 3x higher on the new version.

I'll need to check/compare to find which query is the closest to the reality.

Would you be able to attach a screenshot? The amended query should be more accurate.

from grafana-dashboards-kubernetes.

dotdc avatar dotdc commented on August 24, 2024

image

As you can see, the values are quite different (at least on my side).
Comparing the results with trusted system tools or software can help I think.
I'm pretty sure I did that a long time ago, and it looked good to me, but maybe it's wrong...

We should definitely take the time to get this right.

PS: I don't think I will have time to look further before January 🎄 🥳

from grafana-dashboards-kubernetes.

uhthomas avatar uhthomas commented on August 24, 2024

Thanks for your help @dotdc - that is interesting. I would be eager to see the individual values for the different modes on your cores. I wonder if the system is busy in the other idle states iowait and steal?

from grafana-dashboards-kubernetes.

dotdc avatar dotdc commented on August 24, 2024

Something like this?

image

from grafana-dashboards-kubernetes.

uhthomas avatar uhthomas commented on August 24, 2024

Yes, exactly, but maybe with distinct colours for the values?

from grafana-dashboards-kubernetes.

dotdc avatar dotdc commented on August 24, 2024

This is the best I can do right now:

image

from grafana-dashboards-kubernetes.

dotdc avatar dotdc commented on August 24, 2024

Thanks, you too!

from grafana-dashboards-kubernetes.

jkroepke avatar jkroepke commented on August 24, 2024

This is the best I can do right now:

image

Could you please this, without excluding iowait and steal? Or better: the same graph, but only with the both modes. Are they mentionable values?

from grafana-dashboards-kubernetes.

uhthomas avatar uhthomas commented on August 24, 2024

This query could possibly be helpful? If there is a lot of cpu time spent on steal or iowait, then it would make sense that comparing to just idle would produce a wildly different graph.

avg by (mode) (rate(node_cpu_seconds_total{mode=~"steal|iowait"}))

image

I have a feeling the new query is a more accurate representation of actual CPU usage - but the graphs shown here look very suspect.

The same graph on my cluster also shows discrepancies (as expected), but not to such huge degrees.

(new on bottom)

image

from grafana-dashboards-kubernetes.

uhthomas avatar uhthomas commented on August 24, 2024

Please also see these dashboards side-by-side. The first one is the original, the second matches everything but "idle" and the third one is the current query which matches everything but "idle", "iowait" and "steal". The final graph shows CPU usage across the whole cluster by namespace. There is a clear discrepancy, and the third graph seems the most accurate to me.

image

For context, there are 20 allocatable cpus on the cluster. I do not see how 50% utilisation could ever make sense.

image

The following is the same as the original image, but with stacked cpu usage to demonstrate that a value of 50% is unrealistic.

image

from grafana-dashboards-kubernetes.

uhthomas avatar uhthomas commented on August 24, 2024

This final image may also be helpful. It shows there was a spike in iowait, which is not currently accounted for.

image

from grafana-dashboards-kubernetes.

dotdc avatar dotdc commented on August 24, 2024

Hi @uhthomas,

I made a limited number of additional tests this morning.
Your query is good on the nodes dashboard, but the differences from my previous screenshots remains on the global view.

I've managed to get closer to your values by dividing the result by the number of nodes.

avg(sum by (cpu) (rate(node_cpu_seconds_total{mode!~"idle|steal|iowait", cluster="$cluster"}[$__rate_interval]))) / count(count by (node) (kube_node_info{cluster="$cluster"}))

This should work for clusters that have homogenous nodes flavors across nodepools, but I have concerns on clusters that have heterogenous nodepools/flavors.

Could you double-check this on your setup?
Also, do you have a cluster with different node flavors to see how this behaves?

Screenshot:
image

from grafana-dashboards-kubernetes.

uhthomas avatar uhthomas commented on August 24, 2024

@dotdc I am currently running a single node Kubernetes cluster, I was not aware of this limitation. I imagine what's happening here is it should be sum by (node, cpu). I can fix this when I get back later, and should resolve the issue you're seeing 😄

from grafana-dashboards-kubernetes.

uhthomas avatar uhthomas commented on August 24, 2024

Would you be able to test it for me in the meantime?

from grafana-dashboards-kubernetes.

uhthomas avatar uhthomas commented on August 24, 2024

I've updated the PR @dotdc

from grafana-dashboards-kubernetes.

dotdc avatar dotdc commented on August 24, 2024

This was great, thank you both @uhthomas & @jkroepke !

from grafana-dashboards-kubernetes.

dotdc avatar dotdc commented on August 24, 2024

🎉 This issue has been resolved in version 1.1.0 🎉

The release is available on GitHub release

Your semantic-release bot 📦🚀

from grafana-dashboards-kubernetes.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.