I enabled "cluster monitoring operator" on several OKD 3.11 clusters. Everything is wo

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

EtcdHighNumberOfFailedGRPCRequests about cluster-monitoring-operator HOT 13 CLOSED

openshift commented on August 17, 2024

EtcdHighNumberOfFailedGRPCRequests

from cluster-monitoring-operator.

Comments (13)

nabbdl commented on August 17, 2024 1

@jtlyk thank you for the update. I’m currently using the « cluster monitoring operator » provided with OpenShift so I can’t use jsonnet to disable the rule. The only thing I can do for now is to completely disable « etcd monitoring ». Or maybe the cluster-monitoring operator will update itself and will take into account the modification ??

from cluster-monitoring-operator.

nabbdl commented on August 17, 2024

To be more specific the query regarding this alert always retrieve "100" as a value. I suspect that the query is false.
Current query is :

100 * sum by(job, instance, grpc_service, grpc_method) (rate(grpc_server_handled_total{grpc_code!="OK",job=~".*etcd.*"}[5m])) / sum by(job, instance, grpc_service, grpc_method) (rate(grpc_server_handled_total{job=~".*etcd.*"}[5m])) > 5

in my opinion it should be (remove *100 at the begining) :

sum by(job, instance, grpc_service, grpc_method) (rate(grpc_server_handled_total{grpc_code!="OK",job=~".*etcd.*"}[5m])) / sum by(job, instance, grpc_service, grpc_method) (rate(grpc_server_handled_total{job=~".*etcd.*"}[5m])) > 5

from cluster-monitoring-operator.

benhwebster commented on August 17, 2024

I also see this on OpenShift 3.11 after enabling etcd monitoring.

from cluster-monitoring-operator.

metalmatze commented on August 17, 2024

The current query seems correct to me. It alerts on more than 5% of requests failing.
On my personal cluster I can see that there's a watch method pending for 4min.

{grpc_method="Watch",grpc_service="etcdserverpb.Watch",instance="10.135.73.45",job="etcd"}

Is it the same for you? Maybe we should ignore the watches here.

from cluster-monitoring-operator.

nabbdl commented on August 17, 2024

Tested your query, result is empty. For me the strange thing is that I have the same alert on all the OpenShifts Cluster I have installed. and the query result for EtcdHighNumberOfFailedGRPCRequests gives alway a value of 100

from cluster-monitoring-operator.

zot24 commented on August 17, 2024

Hi @nabbdl just wondering if you still seeing that error and if you resolve the mystery? I'm facing the same error here and not sure why is happening.

from cluster-monitoring-operator.

nabbdl commented on August 17, 2024

Hi @zot24. Unfortunately, I'm still seeing the same error.

from cluster-monitoring-operator.

zot24 commented on August 17, 2024

@nabbdl jtlyk I have been doing some research and after reading a lot of comments I think I'll just ignore those alert for now poseidon/typhoon#175 there are a bunch of issues regarding this error message, but pretty much what's going on it's summarized in this issue etcd-io/etcd#10289 and there is still not a fix for it.

In more detail I think this is the offended line https://github.com/gyuho/etcd/blob/0cf9382024da6132cb5f0778c3fb43e4a6c88afd/etcdserver/api/v3rpc/util.go#L111

from cluster-monitoring-operator.

zot24 commented on August 17, 2024

If you using jsonnet you could add the following to suppress that rule for now:

{
  prometheusAlerts+:: {
    groups: std.map(
      function(group)
        if group.name == 'etcd' then
          group {
            rules: std.filter(
              function(rule)
                rule.alert != 'etcdHighNumberOfFailedGRPCRequests',
              group.rules
            ),
          }
        else
          group,
      super.groups
    ),
  },
}

from cluster-monitoring-operator.

zot24 commented on August 17, 2024

@nabbdl jtlyk this just got merge #340

from cluster-monitoring-operator.

benhwebster commented on August 17, 2024

You could do what I did in my clusters and create a silence for those alerts in alertmanager, but it does look like they may be backporting it currently: #383

from cluster-monitoring-operator.

paulfantom commented on August 17, 2024

Fix backported in #383

/close

from cluster-monitoring-operator.

openshift-ci-robot commented on August 17, 2024

@paulfantom: Closing this issue.

In response to this:

Fix backported in #383

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

from cluster-monitoring-operator.

EtcdHighNumberOfFailedGRPCRequests about cluster-monitoring-operator HOT 13 CLOSED

Comments (13)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent