Giter VIP home page Giter VIP logo

Comments (13)

nabbdl avatar nabbdl commented on August 17, 2024 1

@jtlyk thank you for the update. I’m currently using the « cluster monitoring operator » provided with OpenShift so I can’t use jsonnet to disable the rule. The only thing I can do for now is to completely disable « etcd monitoring ». Or maybe the cluster-monitoring operator will update itself and will take into account the modification ??

from cluster-monitoring-operator.

nabbdl avatar nabbdl commented on August 17, 2024

To be more specific the query regarding this alert always retrieve "100" as a value. I suspect that the query is false.
Current query is :

100 * sum by(job, instance, grpc_service, grpc_method) (rate(grpc_server_handled_total{grpc_code!="OK",job=~".*etcd.*"}[5m])) / sum by(job, instance, grpc_service, grpc_method) (rate(grpc_server_handled_total{job=~".*etcd.*"}[5m])) > 5

in my opinion it should be (remove *100 at the begining) :

sum by(job, instance, grpc_service, grpc_method) (rate(grpc_server_handled_total{grpc_code!="OK",job=~".*etcd.*"}[5m])) / sum by(job, instance, grpc_service, grpc_method) (rate(grpc_server_handled_total{job=~".*etcd.*"}[5m])) > 5

from cluster-monitoring-operator.

benhwebster avatar benhwebster commented on August 17, 2024

I also see this on OpenShift 3.11 after enabling etcd monitoring.

from cluster-monitoring-operator.

metalmatze avatar metalmatze commented on August 17, 2024

The current query seems correct to me. It alerts on more than 5% of requests failing.
On my personal cluster I can see that there's a watch method pending for 4min.

{grpc_method="Watch",grpc_service="etcdserverpb.Watch",instance="10.135.73.45",job="etcd"}

Is it the same for you? Maybe we should ignore the watches here.

from cluster-monitoring-operator.

nabbdl avatar nabbdl commented on August 17, 2024

Tested your query, result is empty. For me the strange thing is that I have the same alert on all the OpenShifts Cluster I have installed. and the query result for EtcdHighNumberOfFailedGRPCRequests gives alway a value of 100
image

from cluster-monitoring-operator.

zot24 avatar zot24 commented on August 17, 2024

Hi @nabbdl just wondering if you still seeing that error and if you resolve the mystery? I'm facing the same error here and not sure why is happening.

from cluster-monitoring-operator.

nabbdl avatar nabbdl commented on August 17, 2024

Hi @zot24. Unfortunately, I'm still seeing the same error.

from cluster-monitoring-operator.

zot24 avatar zot24 commented on August 17, 2024

@nabbdl jtlyk I have been doing some research and after reading a lot of comments I think I'll just ignore those alert for now poseidon/typhoon#175 there are a bunch of issues regarding this error message, but pretty much what's going on it's summarized in this issue etcd-io/etcd#10289 and there is still not a fix for it.

In more detail I think this is the offended line https://github.com/gyuho/etcd/blob/0cf9382024da6132cb5f0778c3fb43e4a6c88afd/etcdserver/api/v3rpc/util.go#L111

from cluster-monitoring-operator.

zot24 avatar zot24 commented on August 17, 2024

If you using jsonnet you could add the following to suppress that rule for now:

{
  prometheusAlerts+:: {
    groups: std.map(
      function(group)
        if group.name == 'etcd' then
          group {
            rules: std.filter(
              function(rule)
                rule.alert != 'etcdHighNumberOfFailedGRPCRequests',
              group.rules
            ),
          }
        else
          group,
      super.groups
    ),
  },
}

from cluster-monitoring-operator.

zot24 avatar zot24 commented on August 17, 2024

@nabbdl jtlyk this just got merge #340

from cluster-monitoring-operator.

benhwebster avatar benhwebster commented on August 17, 2024

You could do what I did in my clusters and create a silence for those alerts in alertmanager, but it does look like they may be backporting it currently: #383

from cluster-monitoring-operator.

paulfantom avatar paulfantom commented on August 17, 2024

Fix backported in #383

/close

from cluster-monitoring-operator.

openshift-ci-robot avatar openshift-ci-robot commented on August 17, 2024

@paulfantom: Closing this issue.

In response to this:

Fix backported in #383

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

from cluster-monitoring-operator.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.