Giter VIP home page Giter VIP logo

Comments (17)

CatherineF-dev avatar CatherineF-dev commented on August 16, 2024 1

HorizontalPodAutoscaler

from k8s-stackdriver.

CatherineF-dev avatar CatherineF-dev commented on August 16, 2024

Hi, could you paste your HPA configuration after removing some credential information?

from k8s-stackdriver.

mathe-matician avatar mathe-matician commented on August 16, 2024

@CatherineF-dev to clarify, the HorizontalPodAutoscaler K8s object config? Or the cluster's kube-system/antrea-controller-horizontal-autoscaler config?

from k8s-stackdriver.

mathe-matician avatar mathe-matician commented on August 16, 2024
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: my-hpa
  namespace: my-app
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: my-dep
  minReplicas: 1
  maxReplicas: 8
  metrics:
    - type: Pods
      pods:
        metric:
          name: percent_procs_busy (custom metric)
        target:
          type: AverageValue
          averageValue: 0.4

from k8s-stackdriver.

CatherineF-dev avatar CatherineF-dev commented on August 16, 2024

This can show HPA status and detailed error messages. What are the errors?

kubectl describe hpa my-hpa -n my-app

from k8s-stackdriver.

mathe-matician avatar mathe-matician commented on August 16, 2024
Name:                                                my-hpa
Namespace:                                           my-app
Labels:                                              <none>
Annotations:                                         <none>
CreationTimestamp:                                   Wed, 17 Jan 2024 13:25:27 -0600
Reference:                                           Deployment/my-dep
Metrics:                                             ( current / target )
  "percent_procs_busy" on pods:  0 / 400m
Min replicas:                                        2
Max replicas:                                        8
Deployment pods:                                     2 current / 2 desired
Conditions:
  Type            Status  Reason            Message
  ----            ------  ------            -------
  AbleToScale     True    ReadyForNewScale  recommended size matches current size
  ScalingActive   True    ValidMetricFound  the HPA was able to successfully calculate a replica count from pods metric percent_procs_busy
  ScalingLimited  True    TooFewReplicas    the desired replica count is less than the minimum replica count
Events:           <none>

Disregard the minReplicas as 2 here as I was just testing starting with more replicas to see if it was a service issue with my-dep (which doesn't seem to be the case).

from k8s-stackdriver.

CatherineF-dev avatar CatherineF-dev commented on August 16, 2024

So HPA is working? Or it has another issue?

from k8s-stackdriver.

mathe-matician avatar mathe-matician commented on August 16, 2024

There aren't any issues with the HPA itself; it gets the metrics and scales accordingly. The issue is when it does scale the replicas up I seem to hit the above errors I posted:

E0106 12:33:07.179132       1 stackdriver.go:60] Error while sending request to Stackdriver googleapi: Error 503: Deadline expired before operation could complete., backendError
W0108 12:40:03.550532       1 stackdriver.go:91] Error while fetching metric descriptors for kubedns: googleapi: Error 500: Internal error encountered. Please retry after a few seconds. If internal errors persist, contact support at https://cloud.google.com/support/docs., backendError
E0108 23:36:07.153899       1 stackdriver.go:60] Error while sending request to Stackdriver googleapi: Error 503: Authentication backend unavailable., backendError
E0110 19:03:07.171679       1 stackdriver.go:60] Error while sending request to Stackdriver googleapi: Error 503: Deadline expired before operation could complete., backendError
E0112 02:16:07.167267       1 stackdriver.go:60] Error while sending request to Stackdriver googleapi: Error 503: Deadline expired before operation could complete., backendError

These errors are from the kube-system/kube-dns pods.

These errors seemed to happen consistently when I was testing my HPA the other day. Is there anything I can do on my end to prevent these errors from happening?

from k8s-stackdriver.

CatherineF-dev avatar CatherineF-dev commented on August 16, 2024

Could you search in cloud logging with keyword Error 503: Authentication backend unavailable to see whether another pod is raising this error?

This repo contains custom-metrics-stackdriver-adapter and HPA is working.

from k8s-stackdriver.

mathe-matician avatar mathe-matician commented on August 16, 2024

Searching for Error while sending request to Stackdriver googleapi gives me errors from prometheus-to-sd-exporter and prometheus-to-sd:

"Error while sending request to Stackdriver googleapi: Error 503: Deadline expired before operation could complete., backendError"

Searching for Authentication backend unavailable gives me errors from prometheus / gcm_exporter:

{
    "caller": "export.go:940",
    "component": "gcm_exporter",
    "err": "rpc error: code = Unavailable desc = Authentication backend unavailable.",
    "level": "error",
    "msg": "send batch",
    "size": 200,
    "ts": "2024-01-18T22:36:35.913Z"
}

As well as gke-metrics-agent:

2024-01-19T05:39:01.645Z	error	exporterhelper/queued_retry.go:165	Exporting failed. Try enabling retry_on_failure config option.	{"kind": "exporter", "name": "googlecloud", "error": "rpc error: code = Unavailable desc = Authentication backend unavailable."}
...
2024-01-19T05:39:01.674Z	warn	batchprocessor/batch_processor.go:185	Sender failed	{"kind": "processor", "name": "batch", "error": "rpc error: code = Unavailable desc = Authentication backend unavailable."}

The Authentication backend unavailable from these pods extends further than when I was doing this testing, though. Could this be a permissions issue with prometheus?

from k8s-stackdriver.

CatherineF-dev avatar CatherineF-dev commented on August 16, 2024

Which line is raising error Error while sending request to Stackdriver googleapi? You can find it in cloud logging.

prometheus-to-sd is here https://github.com/GoogleCloudPlatform/k8s-stackdriver/tree/master/prometheus-to-sd

You can add debugging logs and rebuild using https://github.com/GoogleCloudPlatform/k8s-stackdriver/blob/master/prometheus-to-sd/Makefile#L30

gke-metrics-agent

Is it possible for you to open a google ticket for this? Since this repo doesn't have gke-metrics-agent. Then some experts can help on this.

from k8s-stackdriver.

mathe-matician avatar mathe-matician commented on August 16, 2024

@CatherineF-dev the two lines I see errors from are:

glog.Errorf("Error while sending request to Stackdriver %v", err)

glog.Warningf("Error while fetching metric descriptors for %v: %v", config.SourceConfig.Component, err)

You can add debugging logs and rebuild using

Since I'm using GKE Autopilot, these components are completely managed by Google - so I hesitate / don't necessarily want to redeploy any of these services into my cluster. Are there any alternatives to debugging here?

Is it possible for you to open a google ticket for this? Since this repo doesn't have gke-metrics-agent. Then some experts can help on this.

Sure, I can do that, thanks.

from k8s-stackdriver.

mathe-matician avatar mathe-matician commented on August 16, 2024

@CatherineF-dev just noting I was able to run my 3000 concurrent request test again today (as well as 5000 concurrent request test) both which did not hit the same error. I feel like this supports this comment on this post:

It’s an API error and you only see this error during peak hours (when you are making so many requests to the Stackdriver API). Since it is happening on peak hours, the API cannot handle all of the requests at that time and becomes unavailable; However, it doesn’t mean pods will not scale up. The service is currently just unavailable, it will hold the request and will be sent again.It may take a few minutes to respond successfully to requests.

Any thoughts on this possible rate limiting theory? Otherwise we can probably just close this issue.

from k8s-stackdriver.

CatherineF-dev avatar CatherineF-dev commented on August 16, 2024

qq: so you didn't see this error today while didn't change any configurations?

from k8s-stackdriver.

mathe-matician avatar mathe-matician commented on August 16, 2024

That is correct, I made no changes to the configuration and did not see this error.

from k8s-stackdriver.

CatherineF-dev avatar CatherineF-dev commented on August 16, 2024

@mathe-matician got it, I am waiting the reply from another team.

from k8s-stackdriver.

CatherineF-dev avatar CatherineF-dev commented on August 16, 2024

Deadline exceeded can be caused by many things and occasional errors like this should be tolerated

You can check monitoring error rate at https://pantheon.corp.google.com/apis/dashboard. If two projects have similar pattern, then I think it's likely an occasional backend issue.

from k8s-stackdriver.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.