I have a GKE Autopilot cluster. While testing an HPA configuration I've been trying to

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

<div class="highlight highlight-source-yaml notranslate position-relative overflow-auto" dir="auto"

<div class="snippet-clipboard-content notranslate position-relative overflow-auto" data-snippet-clip

Error while sending request to Stackdriver googleapi: Error 503 about k8s-stackdriver HOT 17 CLOSED

mathe-matician commented on August 16, 2024 1

Error while sending request to Stackdriver googleapi: Error 503

from k8s-stackdriver.

Comments (17)

CatherineF-dev commented on August 16, 2024 1

HorizontalPodAutoscaler

from k8s-stackdriver.

CatherineF-dev commented on August 16, 2024

Hi, could you paste your HPA configuration after removing some credential information?

from k8s-stackdriver.

mathe-matician commented on August 16, 2024

@CatherineF-dev to clarify, the HorizontalPodAutoscaler K8s object config? Or the cluster's kube-system/antrea-controller-horizontal-autoscaler config?

from k8s-stackdriver.

mathe-matician commented on August 16, 2024

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: my-hpa
  namespace: my-app
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: my-dep
  minReplicas: 1
  maxReplicas: 8
  metrics:
    - type: Pods
      pods:
        metric:
          name: percent_procs_busy (custom metric)
        target:
          type: AverageValue
          averageValue: 0.4

from k8s-stackdriver.

CatherineF-dev commented on August 16, 2024

This can show HPA status and detailed error messages. What are the errors?

kubectl describe hpa my-hpa -n my-app

from k8s-stackdriver.

mathe-matician commented on August 16, 2024

Name:                                                my-hpa
Namespace:                                           my-app
Labels:                                              <none>
Annotations:                                         <none>
CreationTimestamp:                                   Wed, 17 Jan 2024 13:25:27 -0600
Reference:                                           Deployment/my-dep
Metrics:                                             ( current / target )
  "percent_procs_busy" on pods:  0 / 400m
Min replicas:                                        2
Max replicas:                                        8
Deployment pods:                                     2 current / 2 desired
Conditions:
  Type            Status  Reason            Message
  ----            ------  ------            -------
  AbleToScale     True    ReadyForNewScale  recommended size matches current size
  ScalingActive   True    ValidMetricFound  the HPA was able to successfully calculate a replica count from pods metric percent_procs_busy
  ScalingLimited  True    TooFewReplicas    the desired replica count is less than the minimum replica count
Events:           <none>

Disregard the minReplicas as 2 here as I was just testing starting with more replicas to see if it was a service issue with my-dep (which doesn't seem to be the case).

from k8s-stackdriver.

CatherineF-dev commented on August 16, 2024

So HPA is working? Or it has another issue?

from k8s-stackdriver.

mathe-matician commented on August 16, 2024

There aren't any issues with the HPA itself; it gets the metrics and scales accordingly. The issue is when it does scale the replicas up I seem to hit the above errors I posted:

E0106 12:33:07.179132       1 stackdriver.go:60] Error while sending request to Stackdriver googleapi: Error 503: Deadline expired before operation could complete., backendError
W0108 12:40:03.550532       1 stackdriver.go:91] Error while fetching metric descriptors for kubedns: googleapi: Error 500: Internal error encountered. Please retry after a few seconds. If internal errors persist, contact support at https://cloud.google.com/support/docs., backendError
E0108 23:36:07.153899       1 stackdriver.go:60] Error while sending request to Stackdriver googleapi: Error 503: Authentication backend unavailable., backendError
E0110 19:03:07.171679       1 stackdriver.go:60] Error while sending request to Stackdriver googleapi: Error 503: Deadline expired before operation could complete., backendError
E0112 02:16:07.167267       1 stackdriver.go:60] Error while sending request to Stackdriver googleapi: Error 503: Deadline expired before operation could complete., backendError

These errors are from the kube-system/kube-dns pods.

These errors seemed to happen consistently when I was testing my HPA the other day. Is there anything I can do on my end to prevent these errors from happening?

from k8s-stackdriver.

CatherineF-dev commented on August 16, 2024

Could you search in cloud logging with keyword Error 503: Authentication backend unavailable to see whether another pod is raising this error?

This repo contains custom-metrics-stackdriver-adapter and HPA is working.

from k8s-stackdriver.

mathe-matician commented on August 16, 2024

Searching for Error while sending request to Stackdriver googleapi gives me errors from prometheus-to-sd-exporter and prometheus-to-sd:

"Error while sending request to Stackdriver googleapi: Error 503: Deadline expired before operation could complete., backendError"

Searching for Authentication backend unavailable gives me errors from prometheus / gcm_exporter:

{
    "caller": "export.go:940",
    "component": "gcm_exporter",
    "err": "rpc error: code = Unavailable desc = Authentication backend unavailable.",
    "level": "error",
    "msg": "send batch",
    "size": 200,
    "ts": "2024-01-18T22:36:35.913Z"
}

As well as gke-metrics-agent:

2024-01-19T05:39:01.645Z	error	exporterhelper/queued_retry.go:165	Exporting failed. Try enabling retry_on_failure config option.	{"kind": "exporter", "name": "googlecloud", "error": "rpc error: code = Unavailable desc = Authentication backend unavailable."}
...
2024-01-19T05:39:01.674Z	warn	batchprocessor/batch_processor.go:185	Sender failed	{"kind": "processor", "name": "batch", "error": "rpc error: code = Unavailable desc = Authentication backend unavailable."}

The Authentication backend unavailable from these pods extends further than when I was doing this testing, though. Could this be a permissions issue with prometheus?

from k8s-stackdriver.

CatherineF-dev commented on August 16, 2024

Which line is raising error Error while sending request to Stackdriver googleapi? You can find it in cloud logging.

prometheus-to-sd is here https://github.com/GoogleCloudPlatform/k8s-stackdriver/tree/master/prometheus-to-sd

You can add debugging logs and rebuild using https://github.com/GoogleCloudPlatform/k8s-stackdriver/blob/master/prometheus-to-sd/Makefile#L30

gke-metrics-agent

Is it possible for you to open a google ticket for this? Since this repo doesn't have gke-metrics-agent. Then some experts can help on this.

from k8s-stackdriver.

mathe-matician commented on August 16, 2024

@CatherineF-dev the two lines I see errors from are:

k8s-stackdriver/prometheus-to-sd/translator/stackdriver.go

Line 60 in 9e69b12

glog.Errorf("Error while sending request to Stackdriver %v", err)

k8s-stackdriver/prometheus-to-sd/translator/stackdriver.go

Line 91 in 9e69b12

 glog.Warningf("Error while fetching metric descriptors for %v: %v", config.SourceConfig.Component, err) 

You can add debugging logs and rebuild using

Since I'm using GKE Autopilot, these components are completely managed by Google - so I hesitate / don't necessarily want to redeploy any of these services into my cluster. Are there any alternatives to debugging here?

Is it possible for you to open a google ticket for this? Since this repo doesn't have gke-metrics-agent. Then some experts can help on this.

Sure, I can do that, thanks.

from k8s-stackdriver.

mathe-matician commented on August 16, 2024

@CatherineF-dev just noting I was able to run my 3000 concurrent request test again today (as well as 5000 concurrent request test) both which did not hit the same error. I feel like this supports this comment on this post:

It’s an API error and you only see this error during peak hours (when you are making so many requests to the Stackdriver API). Since it is happening on peak hours, the API cannot handle all of the requests at that time and becomes unavailable; However, it doesn’t mean pods will not scale up. The service is currently just unavailable, it will hold the request and will be sent again.It may take a few minutes to respond successfully to requests.

Any thoughts on this possible rate limiting theory? Otherwise we can probably just close this issue.

from k8s-stackdriver.

CatherineF-dev commented on August 16, 2024

qq: so you didn't see this error today while didn't change any configurations?

from k8s-stackdriver.

mathe-matician commented on August 16, 2024

That is correct, I made no changes to the configuration and did not see this error.

from k8s-stackdriver.

CatherineF-dev commented on August 16, 2024

@mathe-matician got it, I am waiting the reply from another team.

from k8s-stackdriver.

CatherineF-dev commented on August 16, 2024

Deadline exceeded can be caused by many things and occasional errors like this should be tolerated

You can check monitoring error rate at https://pantheon.corp.google.com/apis/dashboard. If two projects have similar pattern, then I think it's likely an occasional backend issue.

from k8s-stackdriver.

Error while sending request to Stackdriver googleapi: Error 503 about k8s-stackdriver HOT 17 CLOSED

Comments (17)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent