stackdriver / kubernetes-configs Goto Github PK

View Code? Open in Web Editor NEW

14.0 14.0 6.0 203 KB

Internal testing configurations for Stackdriver Kubernetes monitoring.

Home Page: https://cloud.google.com/monitoring/kubernetes-engine/

Shell 100.00%

kubernetes-configs's People

Contributors

Stargazers

Watchers

Forkers

farcaller mrtozkanli stevenycchou orukunu pramotepm

kubernetes-configs's Issues

Metadata agent fails to authenticate with Google when Workload Identity is enabled

After deploying a new cluster with workload identity enabled, the stackdriver-metadata-agent-cluster-level pod keeps failing with the following error:

Failed to publish resource metadata: rpc error: code = Unauthenticated desc = Request had invalid authentication credentials. Expected OAuth 2 access token, login cookie or other valid authentication credential. See https://developers.google.com/identity/sign-in/web/devconsole-project.

Given the Unauthenticated it clearly seems like the pod doesn't get the proper credentials to the remote service. Can it perhaps be that the stackdriver-metadata-agent still uses Metadata Concealment and doesn't support Workload Identity yet?

Steps to reproduce

Deploy a cluster using the following command:

gcloud beta container --project "${PROJECT_ID}" clusters create "${CLUSTER_NAME}" \
 --region "${LOCATION}" \
 --no-enable-basic-auth \
 --cluster-version "1.13.7-gke.8" \
 --machine-type "n1-standard-1" \
 --image-type "COS" \
 --disk-type "pd-ssd" \
 --disk-size "100" \
 --metadata disable-legacy-endpoints=true \
 --service-account "${SA_NAME}@${PROJECT_ID}.iam.gserviceaccount.com" \
 --num-nodes "1" \
 --enable-stackdriver-kubernetes \
 --enable-ip-alias \
 --network "projects/${PROJECT_ID}/global/networks/default" \
 --subnetwork "projects/${PROJECT_ID}/regions/${LOCATION}/subnetworks/default" \
 --default-max-pods-per-node "110" \
 --enable-network-policy \
 --addons HorizontalPodAutoscaling,HttpLoadBalancing \
 --enable-autoupgrade \
 --enable-autorepair \
 --maintenance-window "22:00" \
 --database-encryption-key "projects/${PROJECT_ID}/locations/${LOCATION}/keyRings/${KEYRING_NAME}/cryptoKeys/${KEY_NAME}" \
 --no-enable-legacy-authorization \
 --identity-namespace "${PROJECT_ID}.svc.id.goog"

where
PROJECT_ID: Name of the Google project
CLUSTER_NAME: Name of your cluster
LOCATION: The region to create the cluster in (in our case europe-west1)
SA_NAME: The Google service account to be used. It must have the following roles; roles/logging.logWriter, roles/monitoring.metricWriter, roles/monitoring.viewer, roles/cloudkms.cryptoKeyEncrypterDecrypter, roles/stackdriver.resourceMetadata.writer, roles/storage.objectViewer
KEYRING_NAME and KEY_NAME: The Cloud KMS keyring and key that should be used to decrypt and encrypt your secrets.

Tail the logs for stackdriver-metadata-agent-cluster-level-*

kubectl -n kube-system logs -f stackdriver-metadata-agent-cluster-level-*

Sudden high cpu usage from stackdriver-metadata-agent-cluster-level pod

I would like to report an issue from the stackdriver-metadata-agent on our production GKE 1.18.17-gke.700 with cloud loggin and monitoring enabled. The machine type of node is n1-standard-1 (1 vCpu, 3.75GB mem)

A few days ago (2021-06-08 9:31:xx GMT+08:00), the cpu usage of the stackdriver-metadata-agent-cluster-level pod suddenly grew drastically. Thus, my production services within the same suffered from severe timeout issues. See the attached CPU chart for reference.

The containers within pod are:
metadata-agent: gcr.io/stackdriver-agents/metadata-agent-go:1.2.0
metadata-agent-nanny: gke.gcr.io/addon-resizer:1.8.11-gke.1

During that time, no suspicious logs from the containers are reported.

metadata-agent logs

Since I could not find the corresponding repository for the metadata agent, I would like to know if any possible issue regarding the CPU load issue was raised and possible resolutions. Owing to the lack of a concrete root cause, I'm concerned about it might happen once again. Or if my report should be created on the specific repository for the issue, please let me know.

Thanks for your consideration!

High CPU usage with buffer path that does not have wildcard

Ran into an issue after a kube upgrade last week where our fluentd was using high CPU, we were able to resolve it today by changing our buffer path from
/var/log/fluentd-buffers/kubernetes.containers.buffer
to
/var/log/fluentd-buffers/kubernetes.containers.*.buffer
Apparently the wildcard helps with threading.

After we resolved it we noticed that stackdriver's fluentd-gcp-v3.1.1 in the kube-system namespace is also using a lot of CPU, double what it was before the kube upgrade.

https://github.com/Stackdriver/kubernetes-configs/blob/master/logging-agent.yaml#L600
is
/var/log/k8s-fluentd-buffers/kubernetes.system.buffer
perhaps
/var/log/k8s-fluentd-buffers/kubernetes.system.*.buffer
could help. Different plugin, so maybe the threading wildcard would not be the same, but in our case the CPU usage impact was dramatic.

namespace-id and container_name not registered in the default installation

I'm not sure whether this is a role of the stackdriver-metadata-agent but this information is required by the java logging client but it's not available in the workloads.

The k8s cluster is configured with Stackdriver Kubernetes Engine Monitoring and Workload Identity enabled. All agents are running in the kube-system:

stackdriver-metadata-agent-cluster-level-74785fffdd-79b6v        1/1     Running   0          3h46m

The logs show no errors but the information regarding container_name and namespace_id is not available inside the containers:

root@workload-identity-test:/# curl "http://metadata.google.internal/computeMetadata/v1/instance/attributes/"  -H "Metadata-Flavor: Google"
cluster-name
root@workload-identity-test:/#

e.g only cluster-name is available but the google-cloud-logging libraries are looking and for namespace-id and container_name.

Running outside GCP

Hi,
I've found this repo recently as was trying to run Stackdriver outside GCP, in on prem k8s cluster, I've looked into https://github.com/GoogleCloudPlatform/fluent-plugin-google-cloud/ but it has issues.
Are configs provided in this repo adjusted so one can run Stackdriver logging/monitoring outside GCP?
Thanks

Is the timeoutSeconds=1 is intended?

In 1.20 the exec probe timeout will start being enforced:

Before Kubernetes 1.20, the field timeoutSeconds was not respected for exec probes: probes continued running indefinitely, even past their configured deadline, until a result was returned.

So if this callback was not intended/tested to be running under 1 second, agent may start being killed in case of heavy load or resource starvation as liveness probe will start failing:

kubernetes-configs/logging-agent.yaml

Lines 46 to 64 in f01ceca

 livenessProbe: 

 exec: 

 command: 

 - /bin/sh 

 - -c 

 - | 

  LIVENESS_THRESHOLD_SECONDS=${LIVENESS_THRESHOLD_SECONDS:-300}; STUCK_THRESHOLD_SECONDS=${LIVENESS_THRESHOLD_SECONDS:-900}; if [ ! -e /var/run/google-fluentd/buffers ]; then 

  exit 1; 

  fi; touch -d "${STUCK_THRESHOLD_SECONDS} seconds ago" /tmp/marker-stuck; if [[ -z "$(find /var/run/google-fluentd/buffers -type f -newer /tmp/marker-stuck -print -quit)" ]]; then 

  rm -rf /var/run/google-fluentd/buffers; 

  exit 1; 

  fi; touch -d "${LIVENESS_THRESHOLD_SECONDS} seconds ago" /tmp/marker-liveness; if [[ -z "$(find /var/run/google-fluentd/buffers -type f -newer /tmp/marker-liveness -print -quit)" ]]; then 

  exit 1; 

  fi; 

  failureThreshold: 3 

 initialDelaySeconds: 600 

 periodSeconds: 60 

 successThreshold: 1 

 timeoutSeconds: 1

I recommend to bump the value to some big number after testing it

429 Resource has been exhausted

I am getting these errors in stackdriver-metadata-agent-cluster-level deployment pod.

W0202 08:37:09.943363       1 kubernetes.go:118] Failed to publish resource metadata: rpc error: code = ResourceExhausted desc = Resource has been exhausted (e.g. check quota).
I0202 08:38:02.854145       1 binarylog.go:265] rpc: flushed binary log to ""
W0202 08:38:10.252510       1 kubernetes.go:118] Failed to publish resource metadata: rpc error: code = ResourceExhausted desc = Resource has been exhausted (e.g. check quota).
I0202 08:39:02.854485       1 binarylog.go:265] rpc: flushed binary log to ""
W0202 08:39:10.466841       1 kubernetes.go:118] Failed to publish resource metadata: rpc error: code = ResourceExhausted desc = Resource has been exhausted (e.g. check quota).
I0202 08:40:02.854799       1 binarylog.go:265] rpc: flushed binary log to ""
W0202 08:40:10.711573       1 kubernetes.go:118] Failed to publish resource metadata: rpc error: code = ResourceExhausted desc = Resource has been exhausted (e.g. check quota).
I0202 08:41:02.855385       1 binarylog.go:265] rpc: flushed binary log to ""
W0202 08:41:10.936932       1 kubernetes.go:118] Failed to publish resource metadata: rpc error: code = ResourceExhausted desc = Resource has been exhausted (e.g. check quota).
I0202 08:42:01.927927       1 trace.go:898] Failed loading config; disabling tracing: open /export/hda3/trace_data/trace_config.proto: no such file or directory
I0202 08:42:02.855644       1 binarylog.go:265] rpc: flushed binary log to ""
W0202 08:42:11.186921       1 kubernetes.go:118] Failed to publish resource metadata: rpc error: code = ResourceExhausted desc = Resource has been exhausted (e.g. check quota).
I0202 08:43:02.855968       1 binarylog.go:265] rpc: flushed binary log to ""
W0202 08:43:11.412187       1 kubernetes.go:118] Failed to publish resource metadata: rpc error: code = ResourceExhausted desc = Resource has been exhausted (e.g. check quota).
I0202 08:44:02.856350       1 binarylog.go:265] rpc: flushed binary log to ""
W0202 08:44:11.650193       1 kubernetes.go:118] Failed to publish resource metadata: rpc error: code = ResourceExhausted desc = Resource has been exhausted (e.g. check quota).
I0202 08:45:02.856618       1 binarylog.go:265] rpc: flushed binary log to ""
W0202 08:45:11.891921       1 kubernetes.go:118] Failed to publish resource metadata: rpc error: code = ResourceExhausted desc = Resource has been exhausted (e.g. check quota).
I0202 08:46:02.856854       1 binarylog.go:265] rpc: flushed binary log to ""
W0202 08:46:12.113489       1 kubernetes.go:118] Failed to publish resource metadata: rpc error: code = ResourceExhausted desc = Resource has been exhausted (e.g. check quota).
I0202 08:47:01.928158       1 trace.go:898] Failed loading config; disabling tracing: open /export/hda3/trace_data/trace_config.proto: no such file or directory
I0202 08:47:02.857201       1 binarylog.go:265] rpc: flushed binary log to ""
W0202 08:47:12.308528       1 kubernetes.go:118] Failed to publish resource metadata: rpc error: code = ResourceExhausted desc = Resource has been exhausted (e.g. check quota).
I0202 08:48:02.857501       1 binarylog.go:265] rpc: flushed binary log to ""
W0202 08:48:12.450212       1 kubernetes.go:118] Failed to publish resource metadata: rpc error: code = ResourceExhausted desc = Resource has been exhausted (e.g. check quota).
I0202 08:49:02.857788       1 binarylog.go:265] rpc: flushed binary log to ""
W0202 08:49:13.512880       1 kubernetes.go:118] Failed to publish resource metadata: rpc error: code = ResourceExhausted desc = Resource has been exhausted (e.g. check quota).
I0202 08:50:02.858095       1 binarylog.go:265] rpc: flushed binary log to ""
W0202 08:50:13.827352       1 kubernetes.go:118] Failed to publish resource metadata: rpc error: code = ResourceExhausted desc = Resource has been exhausted (e.g. check quota).
I0202 08:51:02.858392       1 binarylog.go:265] rpc: flushed binary log to ""
W0202 08:51:14.062459       1 kubernetes.go:118] Failed to publish resource metadata: rpc error: code = ResourceExhausted desc = Resource has been exhausted (e.g. check quota).
I0202 08:52:01.928374       1 trace.go:898] Failed loading config; disabling tracing: open /export/hda3/trace_data/trace_config.proto: no such file or directory
I0202 08:52:02.858863       1 binarylog.go:265] rpc: flushed binary log to ""
W0202 08:52:14.319897       1 kubernetes.go:118] Failed to publish resource metadata: rpc error: code = ResourceExhausted desc = Resource has been exhausted (e.g. check quota).
I0202 08:53:02.859160       1 binarylog.go:265] rpc: flushed binary log to ""
W0202 08:53:14.570000       1 kubernetes.go:118] Failed to publish resource metadata: rpc error: code = ResourceExhausted desc = Resource has been exhausted (e.g. check quota).

On Google cloud console I can also see

In each 24 hours, it is making 202,559 api calls to publish metadata. Out of which 92% fails. I am using a custom service account and it has Stackdriver Resource Metadata Writer permissions.

Any idea, why too many requests? How do I resolve it?

Metadata agent not working: Failed to publish resource metadata: Unexpected response code from confluence: 400 Bad Request

We're running a few GKE clusters which have Stackdriver Monitoring manually installed using the configs from this repo (reason for manual install is mainly to add a few custom log parsing rules to the config).

After upgrading to the latest version of the configs which seem to include
some big changes to the metadata agent, the metadata agent doesn't work anymore and metadata disappears from the Kubernetes Dashboard on Stackdriver monitoring.

The metadata agent prints the following errors:
obtained via: kubectl logs -n stackdriver-agents stackdriver-metadata-agent-cluster-level-78599b584-wkprj

W0315 01:48:28.766316       1 kubernetes.go:104] Failed to publish resource metadata: Unexpected response code from confluence: 400 Bad Request
W0315 01:48:28.783876       1 kubernetes.go:104] Failed to publish resource metadata: Unexpected response code from confluence: 400 Bad Request
W0315 01:48:28.934092       1 kubernetes.go:104] Failed to publish resource metadata: Unexpected response code from confluence: 400 Bad Request

The config was obtained from this url: https://raw.githubusercontent.com/Stackdriver/kubernetes-configs/stable/agents.yaml

The logging agent continues to work.

Issue seems to have been introduced by #20

stackdriver / kubernetes-configs Goto Github PK

kubernetes-configs's People

Contributors

Stargazers

Watchers

Forkers

kubernetes-configs's Issues

Metadata agent fails to authenticate with Google when Workload Identity is enabled

Steps to reproduce

Sudden high cpu usage from stackdriver-metadata-agent-cluster-level pod

High CPU usage with buffer path that does not have wildcard

namespace-id and container_name not registered in the default installation

Running outside GCP

Is the timeoutSeconds=1 is intended?

429 Resource has been exhausted

Metadata agent not working: Failed to publish resource metadata: Unexpected response code from confluence: 400 Bad Request

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

	livenessProbe:
	exec:
	command:
	- /bin/sh
	- -c
	- \|
	LIVENESS_THRESHOLD_SECONDS=${LIVENESS_THRESHOLD_SECONDS:-300}; STUCK_THRESHOLD_SECONDS=${LIVENESS_THRESHOLD_SECONDS:-900}; if [ ! -e /var/run/google-fluentd/buffers ]; then
	exit 1;
	fi; touch -d "${STUCK_THRESHOLD_SECONDS} seconds ago" /tmp/marker-stuck; if [[ -z "$(find /var/run/google-fluentd/buffers -type f -newer /tmp/marker-stuck -print -quit)" ]]; then
	rm -rf /var/run/google-fluentd/buffers;
	exit 1;
	fi; touch -d "${LIVENESS_THRESHOLD_SECONDS} seconds ago" /tmp/marker-liveness; if [[ -z "$(find /var/run/google-fluentd/buffers -type f -newer /tmp/marker-liveness -print -quit)" ]]; then
	exit 1;
	fi;
	failureThreshold: 3
	initialDelaySeconds: 600
	periodSeconds: 60
	successThreshold: 1
	timeoutSeconds: 1