Comments (17)
HorizontalPodAutoscaler
from k8s-stackdriver.
Hi, could you paste your HPA configuration after removing some credential information?
from k8s-stackdriver.
@CatherineF-dev to clarify, the HorizontalPodAutoscaler
K8s object config? Or the cluster's kube-system/antrea-controller-horizontal-autoscaler
config?
from k8s-stackdriver.
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: my-hpa
namespace: my-app
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: my-dep
minReplicas: 1
maxReplicas: 8
metrics:
- type: Pods
pods:
metric:
name: percent_procs_busy (custom metric)
target:
type: AverageValue
averageValue: 0.4
from k8s-stackdriver.
This can show HPA status and detailed error messages. What are the errors?
kubectl describe hpa my-hpa -n my-app
from k8s-stackdriver.
Name: my-hpa
Namespace: my-app
Labels: <none>
Annotations: <none>
CreationTimestamp: Wed, 17 Jan 2024 13:25:27 -0600
Reference: Deployment/my-dep
Metrics: ( current / target )
"percent_procs_busy" on pods: 0 / 400m
Min replicas: 2
Max replicas: 8
Deployment pods: 2 current / 2 desired
Conditions:
Type Status Reason Message
---- ------ ------ -------
AbleToScale True ReadyForNewScale recommended size matches current size
ScalingActive True ValidMetricFound the HPA was able to successfully calculate a replica count from pods metric percent_procs_busy
ScalingLimited True TooFewReplicas the desired replica count is less than the minimum replica count
Events: <none>
Disregard the minReplicas as 2 here as I was just testing starting with more replicas to see if it was a service issue with my-dep
(which doesn't seem to be the case).
from k8s-stackdriver.
So HPA is working? Or it has another issue?
from k8s-stackdriver.
There aren't any issues with the HPA itself; it gets the metrics and scales accordingly. The issue is when it does scale the replicas up I seem to hit the above errors I posted:
E0106 12:33:07.179132 1 stackdriver.go:60] Error while sending request to Stackdriver googleapi: Error 503: Deadline expired before operation could complete., backendError
W0108 12:40:03.550532 1 stackdriver.go:91] Error while fetching metric descriptors for kubedns: googleapi: Error 500: Internal error encountered. Please retry after a few seconds. If internal errors persist, contact support at https://cloud.google.com/support/docs., backendError
E0108 23:36:07.153899 1 stackdriver.go:60] Error while sending request to Stackdriver googleapi: Error 503: Authentication backend unavailable., backendError
E0110 19:03:07.171679 1 stackdriver.go:60] Error while sending request to Stackdriver googleapi: Error 503: Deadline expired before operation could complete., backendError
E0112 02:16:07.167267 1 stackdriver.go:60] Error while sending request to Stackdriver googleapi: Error 503: Deadline expired before operation could complete., backendError
These errors are from the kube-system/kube-dns
pods.
These errors seemed to happen consistently when I was testing my HPA the other day. Is there anything I can do on my end to prevent these errors from happening?
from k8s-stackdriver.
Could you search in cloud logging with keyword Error 503: Authentication backend unavailable
to see whether another pod is raising this error?
This repo contains custom-metrics-stackdriver-adapter and HPA is working.
from k8s-stackdriver.
Searching for Error while sending request to Stackdriver googleapi
gives me errors from prometheus-to-sd-exporter
and prometheus-to-sd
:
"Error while sending request to Stackdriver googleapi: Error 503: Deadline expired before operation could complete., backendError"
Searching for Authentication backend unavailable
gives me errors from prometheus
/ gcm_exporter
:
{
"caller": "export.go:940",
"component": "gcm_exporter",
"err": "rpc error: code = Unavailable desc = Authentication backend unavailable.",
"level": "error",
"msg": "send batch",
"size": 200,
"ts": "2024-01-18T22:36:35.913Z"
}
As well as gke-metrics-agent
:
2024-01-19T05:39:01.645Z error exporterhelper/queued_retry.go:165 Exporting failed. Try enabling retry_on_failure config option. {"kind": "exporter", "name": "googlecloud", "error": "rpc error: code = Unavailable desc = Authentication backend unavailable."}
...
2024-01-19T05:39:01.674Z warn batchprocessor/batch_processor.go:185 Sender failed {"kind": "processor", "name": "batch", "error": "rpc error: code = Unavailable desc = Authentication backend unavailable."}
The Authentication backend unavailable
from these pods extends further than when I was doing this testing, though. Could this be a permissions issue with prometheus?
from k8s-stackdriver.
Which line is raising error Error while sending request to Stackdriver googleapi
? You can find it in cloud logging.
prometheus-to-sd is here https://github.com/GoogleCloudPlatform/k8s-stackdriver/tree/master/prometheus-to-sd
You can add debugging logs and rebuild using https://github.com/GoogleCloudPlatform/k8s-stackdriver/blob/master/prometheus-to-sd/Makefile#L30
gke-metrics-agent
Is it possible for you to open a google ticket for this? Since this repo doesn't have gke-metrics-agent. Then some experts can help on this.
from k8s-stackdriver.
@CatherineF-dev the two lines I see errors from are:
You can add debugging logs and rebuild using
Since I'm using GKE Autopilot, these components are completely managed by Google - so I hesitate / don't necessarily want to redeploy any of these services into my cluster. Are there any alternatives to debugging here?
Is it possible for you to open a google ticket for this? Since this repo doesn't have gke-metrics-agent. Then some experts can help on this.
Sure, I can do that, thanks.
from k8s-stackdriver.
@CatherineF-dev just noting I was able to run my 3000 concurrent request test again today (as well as 5000 concurrent request test) both which did not hit the same error. I feel like this supports this comment on this post:
It’s an API error and you only see this error during peak hours (when you are making so many requests to the Stackdriver API). Since it is happening on peak hours, the API cannot handle all of the requests at that time and becomes unavailable; However, it doesn’t mean pods will not scale up. The service is currently just unavailable, it will hold the request and will be sent again.It may take a few minutes to respond successfully to requests.
Any thoughts on this possible rate limiting theory? Otherwise we can probably just close this issue.
from k8s-stackdriver.
qq: so you didn't see this error today while didn't change any configurations?
from k8s-stackdriver.
That is correct, I made no changes to the configuration and did not see this error.
from k8s-stackdriver.
@mathe-matician got it, I am waiting the reply from another team.
from k8s-stackdriver.
Deadline exceeded can be caused by many things and occasional errors like this should be tolerated
You can check monitoring error rate at https://pantheon.corp.google.com/apis/dashboard. If two projects have similar pattern, then I think it's likely an occasional backend issue.
from k8s-stackdriver.
Related Issues (20)
- Security Policy violation Binary Artifacts HOT 5
- Allow custom metrics from a different pod HOT 3
- Custom metrics adapter spewing errors "apiserver was unable to write a fallback JSON response: http2: stream closed" HOT 11
- custom-metrics-stackdriver-adapter - couldn't get resource list for external.metrics.k8s.io/v1beta1: Got empty response for: external.metrics.k8s.io/v1beta1 HOT 3
- istio.io/service/server/response_latencies metric that HPA collected is different to Cloud monitoring HOT 1
- Filtering metrics by labelSelector in external.metrics.k8s.io api doesn't work HOT 1
- Timeout error logs HOT 5
- Deploying adapter_new_resource_model.yaml results in OOMKilled HOT 3
- 100% memory and CPU and never recovers HOT 2
- Documentation bugs in custom-metrics-stackdriver-adapter README HOT 3
- Tracing custom-metrics-stackdriver-adapter trace logging enabled HOT 6
- wrong version of file in tag HOT 4
- Unable to authenticate the request err="verifying certificate failed: x509: certificate signed by unknown authority"" HOT 4
- Custom log-based metric not recognized by HPA HOT 6
- Dependency Dashboard
- Changing port number might bring downtime in custom-metrics-stackdriver-adapter HOT 3
- Empty external metrics and retrieving the list under custom metrics lead to HPA not working HOT 1
- Empty external metrics list will fail the client-go CachedDiscoveryClient caching and make helm upgrade resending api call HOT 5
- Release https://github.com/GoogleCloudPlatform/k8s-stackdriver/pull/743
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from k8s-stackdriver.