stackdriver / kubernetes-configs Goto Github PK
View Code? Open in Web Editor NEWInternal testing configurations for Stackdriver Kubernetes monitoring.
Home Page: https://cloud.google.com/monitoring/kubernetes-engine/
Internal testing configurations for Stackdriver Kubernetes monitoring.
Home Page: https://cloud.google.com/monitoring/kubernetes-engine/
After deploying a new cluster with workload identity enabled, the stackdriver-metadata-agent-cluster-level
pod keeps failing with the following error:
Failed to publish resource metadata: rpc error: code = Unauthenticated desc = Request had invalid authentication credentials. Expected OAuth 2 access token, login cookie or other valid authentication credential. See https://developers.google.com/identity/sign-in/web/devconsole-project.
Given the Unauthenticated
it clearly seems like the pod doesn't get the proper credentials to the remote service. Can it perhaps be that the stackdriver-metadata-agent still uses Metadata Concealment and doesn't support Workload Identity yet?
gcloud beta container --project "${PROJECT_ID}" clusters create "${CLUSTER_NAME}" \
--region "${LOCATION}" \
--no-enable-basic-auth \
--cluster-version "1.13.7-gke.8" \
--machine-type "n1-standard-1" \
--image-type "COS" \
--disk-type "pd-ssd" \
--disk-size "100" \
--metadata disable-legacy-endpoints=true \
--service-account "${SA_NAME}@${PROJECT_ID}.iam.gserviceaccount.com" \
--num-nodes "1" \
--enable-stackdriver-kubernetes \
--enable-ip-alias \
--network "projects/${PROJECT_ID}/global/networks/default" \
--subnetwork "projects/${PROJECT_ID}/regions/${LOCATION}/subnetworks/default" \
--default-max-pods-per-node "110" \
--enable-network-policy \
--addons HorizontalPodAutoscaling,HttpLoadBalancing \
--enable-autoupgrade \
--enable-autorepair \
--maintenance-window "22:00" \
--database-encryption-key "projects/${PROJECT_ID}/locations/${LOCATION}/keyRings/${KEYRING_NAME}/cryptoKeys/${KEY_NAME}" \
--no-enable-legacy-authorization \
--identity-namespace "${PROJECT_ID}.svc.id.goog"
where
PROJECT_ID
: Name of the Google project
CLUSTER_NAME
: Name of your cluster
LOCATION
: The region to create the cluster in (in our case europe-west1
)
SA_NAME
: The Google service account to be used. It must have the following roles; roles/logging.logWriter, roles/monitoring.metricWriter, roles/monitoring.viewer, roles/cloudkms.cryptoKeyEncrypterDecrypter, roles/stackdriver.resourceMetadata.writer, roles/storage.objectViewer
KEYRING_NAME
and KEY_NAME
: The Cloud KMS keyring and key that should be used to decrypt and encrypt your secrets.
stackdriver-metadata-agent-cluster-level-*
kubectl -n kube-system logs -f stackdriver-metadata-agent-cluster-level-*
I would like to report an issue from the stackdriver-metadata-agent on our production GKE 1.18.17-gke.700 with cloud loggin and monitoring enabled. The machine type of node is n1-standard-1 (1 vCpu, 3.75GB mem)
A few days ago (2021-06-08 9:31:xx GMT+08:00), the cpu usage of the stackdriver-metadata-agent-cluster-level pod suddenly grew drastically. Thus, my production services within the same suffered from severe timeout issues. See the attached CPU chart for reference.
The containers within pod are:
metadata-agent: gcr.io/stackdriver-agents/metadata-agent-go:1.2.0
metadata-agent-nanny: gke.gcr.io/addon-resizer:1.8.11-gke.1
During that time, no suspicious logs from the containers are reported.
Since I could not find the corresponding repository for the metadata agent, I would like to know if any possible issue regarding the CPU load issue was raised and possible resolutions. Owing to the lack of a concrete root cause, I'm concerned about it might happen once again. Or if my report should be created on the specific repository for the issue, please let me know.
Thanks for your consideration!
Ran into an issue after a kube upgrade last week where our fluentd was using high CPU, we were able to resolve it today by changing our buffer path from
/var/log/fluentd-buffers/kubernetes.containers.buffer
to
/var/log/fluentd-buffers/kubernetes.containers.*.buffer
Apparently the wildcard helps with threading.
After we resolved it we noticed that stackdriver's fluentd-gcp-v3.1.1 in the kube-system namespace is also using a lot of CPU, double what it was before the kube upgrade.
https://github.com/Stackdriver/kubernetes-configs/blob/master/logging-agent.yaml#L600
is
/var/log/k8s-fluentd-buffers/kubernetes.system.buffer
perhaps
/var/log/k8s-fluentd-buffers/kubernetes.system.*.buffer
could help. Different plugin, so maybe the threading wildcard would not be the same, but in our case the CPU usage impact was dramatic.
I'm not sure whether this is a role of the stackdriver-metadata-agent but this information is required by the java logging client but it's not available in the workloads.
The k8s cluster is configured with Stackdriver Kubernetes Engine Monitoring and Workload Identity enabled. All agents are running in the kube-system:
stackdriver-metadata-agent-cluster-level-74785fffdd-79b6v 1/1 Running 0 3h46m
The logs show no errors but the information regarding container_name and namespace_id is not available inside the containers:
root@workload-identity-test:/# curl "http://metadata.google.internal/computeMetadata/v1/instance/attributes/" -H "Metadata-Flavor: Google"
cluster-name
root@workload-identity-test:/#
e.g only cluster-name is available but the google-cloud-logging libraries are looking and for namespace-id and container_name.
Hi,
I've found this repo recently as was trying to run Stackdriver outside GCP, in on prem k8s cluster, I've looked into https://github.com/GoogleCloudPlatform/fluent-plugin-google-cloud/ but it has issues.
Are configs provided in this repo adjusted so one can run Stackdriver logging/monitoring outside GCP?
Thanks
In 1.20 the exec probe timeout will start being enforced:
Before Kubernetes 1.20, the field timeoutSeconds was not respected for exec probes: probes continued running indefinitely, even past their configured deadline, until a result was returned.
So if this callback was not intended/tested to be running under 1 second, agent may start being killed in case of heavy load or resource starvation as liveness probe will start failing:
kubernetes-configs/logging-agent.yaml
Lines 46 to 64 in f01ceca
I recommend to bump the value to some big number after testing it
I am getting these errors in stackdriver-metadata-agent-cluster-level deployment pod.
W0202 08:37:09.943363 1 kubernetes.go:118] Failed to publish resource metadata: rpc error: code = ResourceExhausted desc = Resource has been exhausted (e.g. check quota).
I0202 08:38:02.854145 1 binarylog.go:265] rpc: flushed binary log to ""
W0202 08:38:10.252510 1 kubernetes.go:118] Failed to publish resource metadata: rpc error: code = ResourceExhausted desc = Resource has been exhausted (e.g. check quota).
I0202 08:39:02.854485 1 binarylog.go:265] rpc: flushed binary log to ""
W0202 08:39:10.466841 1 kubernetes.go:118] Failed to publish resource metadata: rpc error: code = ResourceExhausted desc = Resource has been exhausted (e.g. check quota).
I0202 08:40:02.854799 1 binarylog.go:265] rpc: flushed binary log to ""
W0202 08:40:10.711573 1 kubernetes.go:118] Failed to publish resource metadata: rpc error: code = ResourceExhausted desc = Resource has been exhausted (e.g. check quota).
I0202 08:41:02.855385 1 binarylog.go:265] rpc: flushed binary log to ""
W0202 08:41:10.936932 1 kubernetes.go:118] Failed to publish resource metadata: rpc error: code = ResourceExhausted desc = Resource has been exhausted (e.g. check quota).
I0202 08:42:01.927927 1 trace.go:898] Failed loading config; disabling tracing: open /export/hda3/trace_data/trace_config.proto: no such file or directory
I0202 08:42:02.855644 1 binarylog.go:265] rpc: flushed binary log to ""
W0202 08:42:11.186921 1 kubernetes.go:118] Failed to publish resource metadata: rpc error: code = ResourceExhausted desc = Resource has been exhausted (e.g. check quota).
I0202 08:43:02.855968 1 binarylog.go:265] rpc: flushed binary log to ""
W0202 08:43:11.412187 1 kubernetes.go:118] Failed to publish resource metadata: rpc error: code = ResourceExhausted desc = Resource has been exhausted (e.g. check quota).
I0202 08:44:02.856350 1 binarylog.go:265] rpc: flushed binary log to ""
W0202 08:44:11.650193 1 kubernetes.go:118] Failed to publish resource metadata: rpc error: code = ResourceExhausted desc = Resource has been exhausted (e.g. check quota).
I0202 08:45:02.856618 1 binarylog.go:265] rpc: flushed binary log to ""
W0202 08:45:11.891921 1 kubernetes.go:118] Failed to publish resource metadata: rpc error: code = ResourceExhausted desc = Resource has been exhausted (e.g. check quota).
I0202 08:46:02.856854 1 binarylog.go:265] rpc: flushed binary log to ""
W0202 08:46:12.113489 1 kubernetes.go:118] Failed to publish resource metadata: rpc error: code = ResourceExhausted desc = Resource has been exhausted (e.g. check quota).
I0202 08:47:01.928158 1 trace.go:898] Failed loading config; disabling tracing: open /export/hda3/trace_data/trace_config.proto: no such file or directory
I0202 08:47:02.857201 1 binarylog.go:265] rpc: flushed binary log to ""
W0202 08:47:12.308528 1 kubernetes.go:118] Failed to publish resource metadata: rpc error: code = ResourceExhausted desc = Resource has been exhausted (e.g. check quota).
I0202 08:48:02.857501 1 binarylog.go:265] rpc: flushed binary log to ""
W0202 08:48:12.450212 1 kubernetes.go:118] Failed to publish resource metadata: rpc error: code = ResourceExhausted desc = Resource has been exhausted (e.g. check quota).
I0202 08:49:02.857788 1 binarylog.go:265] rpc: flushed binary log to ""
W0202 08:49:13.512880 1 kubernetes.go:118] Failed to publish resource metadata: rpc error: code = ResourceExhausted desc = Resource has been exhausted (e.g. check quota).
I0202 08:50:02.858095 1 binarylog.go:265] rpc: flushed binary log to ""
W0202 08:50:13.827352 1 kubernetes.go:118] Failed to publish resource metadata: rpc error: code = ResourceExhausted desc = Resource has been exhausted (e.g. check quota).
I0202 08:51:02.858392 1 binarylog.go:265] rpc: flushed binary log to ""
W0202 08:51:14.062459 1 kubernetes.go:118] Failed to publish resource metadata: rpc error: code = ResourceExhausted desc = Resource has been exhausted (e.g. check quota).
I0202 08:52:01.928374 1 trace.go:898] Failed loading config; disabling tracing: open /export/hda3/trace_data/trace_config.proto: no such file or directory
I0202 08:52:02.858863 1 binarylog.go:265] rpc: flushed binary log to ""
W0202 08:52:14.319897 1 kubernetes.go:118] Failed to publish resource metadata: rpc error: code = ResourceExhausted desc = Resource has been exhausted (e.g. check quota).
I0202 08:53:02.859160 1 binarylog.go:265] rpc: flushed binary log to ""
W0202 08:53:14.570000 1 kubernetes.go:118] Failed to publish resource metadata: rpc error: code = ResourceExhausted desc = Resource has been exhausted (e.g. check quota).
On Google cloud console I can also see
In each 24 hours, it is making 202,559
api calls to publish metadata. Out of which 92% fails. I am using a custom service account and it has Stackdriver Resource Metadata Writer
permissions.
Any idea, why too many requests? How do I resolve it?
We're running a few GKE clusters which have Stackdriver Monitoring manually installed using the configs from this repo (reason for manual install is mainly to add a few custom log parsing rules to the config).
After upgrading to the latest version of the configs which seem to include
some big changes to the metadata agent, the metadata agent doesn't work anymore and metadata disappears from the Kubernetes Dashboard on Stackdriver monitoring.
The metadata agent prints the following errors:
obtained via: kubectl logs -n stackdriver-agents stackdriver-metadata-agent-cluster-level-78599b584-wkprj
W0315 01:48:28.766316 1 kubernetes.go:104] Failed to publish resource metadata: Unexpected response code from confluence: 400 Bad Request
W0315 01:48:28.783876 1 kubernetes.go:104] Failed to publish resource metadata: Unexpected response code from confluence: 400 Bad Request
W0315 01:48:28.934092 1 kubernetes.go:104] Failed to publish resource metadata: Unexpected response code from confluence: 400 Bad Request
The config was obtained from this url: https://raw.githubusercontent.com/Stackdriver/kubernetes-configs/stable/agents.yaml
The logging agent continues to work.
Issue seems to have been introduced by #20
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.