Giter VIP home page Giter VIP logo

kubernetes-configs's People

Contributors

bmoyles0117 avatar davidbtucker avatar farcaller avatar igorpeshansky avatar jkohen avatar jkschulz avatar qingling128 avatar rbuskens avatar sophieyfang avatar stackdriver-instrumentation-release avatar stevenycchou avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

kubernetes-configs's Issues

Metadata agent fails to authenticate with Google when Workload Identity is enabled

After deploying a new cluster with workload identity enabled, the stackdriver-metadata-agent-cluster-level pod keeps failing with the following error:

Failed to publish resource metadata: rpc error: code = Unauthenticated desc = Request had invalid authentication credentials. Expected OAuth 2 access token, login cookie or other valid authentication credential. See https://developers.google.com/identity/sign-in/web/devconsole-project.

Given the Unauthenticated it clearly seems like the pod doesn't get the proper credentials to the remote service. Can it perhaps be that the stackdriver-metadata-agent still uses Metadata Concealment and doesn't support Workload Identity yet?

Steps to reproduce

  1. Deploy a cluster using the following command:
gcloud beta container --project "${PROJECT_ID}" clusters create "${CLUSTER_NAME}" \
 --region "${LOCATION}" \
 --no-enable-basic-auth \
 --cluster-version "1.13.7-gke.8" \
 --machine-type "n1-standard-1" \
 --image-type "COS" \
 --disk-type "pd-ssd" \
 --disk-size "100" \
 --metadata disable-legacy-endpoints=true \
 --service-account "${SA_NAME}@${PROJECT_ID}.iam.gserviceaccount.com" \
 --num-nodes "1" \
 --enable-stackdriver-kubernetes \
 --enable-ip-alias \
 --network "projects/${PROJECT_ID}/global/networks/default" \
 --subnetwork "projects/${PROJECT_ID}/regions/${LOCATION}/subnetworks/default" \
 --default-max-pods-per-node "110" \
 --enable-network-policy \
 --addons HorizontalPodAutoscaling,HttpLoadBalancing \
 --enable-autoupgrade \
 --enable-autorepair \
 --maintenance-window "22:00" \
 --database-encryption-key "projects/${PROJECT_ID}/locations/${LOCATION}/keyRings/${KEYRING_NAME}/cryptoKeys/${KEY_NAME}" \
 --no-enable-legacy-authorization \
 --identity-namespace "${PROJECT_ID}.svc.id.goog"

where
PROJECT_ID: Name of the Google project
CLUSTER_NAME: Name of your cluster
LOCATION: The region to create the cluster in (in our case europe-west1)
SA_NAME: The Google service account to be used. It must have the following roles; roles/logging.logWriter, roles/monitoring.metricWriter, roles/monitoring.viewer, roles/cloudkms.cryptoKeyEncrypterDecrypter, roles/stackdriver.resourceMetadata.writer, roles/storage.objectViewer
KEYRING_NAME and KEY_NAME: The Cloud KMS keyring and key that should be used to decrypt and encrypt your secrets.

  1. Tail the logs for stackdriver-metadata-agent-cluster-level-*
kubectl -n kube-system logs -f stackdriver-metadata-agent-cluster-level-*

Sudden high cpu usage from stackdriver-metadata-agent-cluster-level pod

I would like to report an issue from the stackdriver-metadata-agent on our production GKE 1.18.17-gke.700 with cloud loggin and monitoring enabled. The machine type of node is n1-standard-1 (1 vCpu, 3.75GB mem)

A few days ago (2021-06-08 9:31:xx GMT+08:00), the cpu usage of the stackdriver-metadata-agent-cluster-level pod suddenly grew drastically. Thus, my production services within the same suffered from severe timeout issues. See the attached CPU chart for reference.

截圖 2021-06-11 下午4 31 18

The containers within pod are:
metadata-agent: gcr.io/stackdriver-agents/metadata-agent-go:1.2.0
metadata-agent-nanny: gke.gcr.io/addon-resizer:1.8.11-gke.1

During that time, no suspicious logs from the containers are reported.

metadata-agent logs
截圖 2021-06-11 下午5 50 48

Since I could not find the corresponding repository for the metadata agent, I would like to know if any possible issue regarding the CPU load issue was raised and possible resolutions. Owing to the lack of a concrete root cause, I'm concerned about it might happen once again. Or if my report should be created on the specific repository for the issue, please let me know.

Thanks for your consideration!

High CPU usage with buffer path that does not have wildcard

Ran into an issue after a kube upgrade last week where our fluentd was using high CPU, we were able to resolve it today by changing our buffer path from
/var/log/fluentd-buffers/kubernetes.containers.buffer
to
/var/log/fluentd-buffers/kubernetes.containers.*.buffer
Apparently the wildcard helps with threading.

After we resolved it we noticed that stackdriver's fluentd-gcp-v3.1.1 in the kube-system namespace is also using a lot of CPU, double what it was before the kube upgrade.

https://github.com/Stackdriver/kubernetes-configs/blob/master/logging-agent.yaml#L600
is
/var/log/k8s-fluentd-buffers/kubernetes.system.buffer
perhaps
/var/log/k8s-fluentd-buffers/kubernetes.system.*.buffer
could help. Different plugin, so maybe the threading wildcard would not be the same, but in our case the CPU usage impact was dramatic.

namespace-id and container_name not registered in the default installation

I'm not sure whether this is a role of the stackdriver-metadata-agent but this information is required by the java logging client but it's not available in the workloads.

The k8s cluster is configured with Stackdriver Kubernetes Engine Monitoring and Workload Identity enabled. All agents are running in the kube-system:

stackdriver-metadata-agent-cluster-level-74785fffdd-79b6v        1/1     Running   0          3h46m

The logs show no errors but the information regarding container_name and namespace_id is not available inside the containers:

root@workload-identity-test:/# curl "http://metadata.google.internal/computeMetadata/v1/instance/attributes/"  -H "Metadata-Flavor: Google"
cluster-name
root@workload-identity-test:/#

e.g only cluster-name is available but the google-cloud-logging libraries are looking and for namespace-id and container_name.

Is the timeoutSeconds=1 is intended?

In 1.20 the exec probe timeout will start being enforced:

Before Kubernetes 1.20, the field timeoutSeconds was not respected for exec probes: probes continued running indefinitely, even past their configured deadline, until a result was returned.

So if this callback was not intended/tested to be running under 1 second, agent may start being killed in case of heavy load or resource starvation as liveness probe will start failing:

livenessProbe:
exec:
command:
- /bin/sh
- -c
- |
LIVENESS_THRESHOLD_SECONDS=${LIVENESS_THRESHOLD_SECONDS:-300}; STUCK_THRESHOLD_SECONDS=${LIVENESS_THRESHOLD_SECONDS:-900}; if [ ! -e /var/run/google-fluentd/buffers ]; then
exit 1;
fi; touch -d "${STUCK_THRESHOLD_SECONDS} seconds ago" /tmp/marker-stuck; if [[ -z "$(find /var/run/google-fluentd/buffers -type f -newer /tmp/marker-stuck -print -quit)" ]]; then
rm -rf /var/run/google-fluentd/buffers;
exit 1;
fi; touch -d "${LIVENESS_THRESHOLD_SECONDS} seconds ago" /tmp/marker-liveness; if [[ -z "$(find /var/run/google-fluentd/buffers -type f -newer /tmp/marker-liveness -print -quit)" ]]; then
exit 1;
fi;
failureThreshold: 3
initialDelaySeconds: 600
periodSeconds: 60
successThreshold: 1
timeoutSeconds: 1

I recommend to bump the value to some big number after testing it

429 Resource has been exhausted

I am getting these errors in stackdriver-metadata-agent-cluster-level deployment pod.

W0202 08:37:09.943363       1 kubernetes.go:118] Failed to publish resource metadata: rpc error: code = ResourceExhausted desc = Resource has been exhausted (e.g. check quota).
I0202 08:38:02.854145       1 binarylog.go:265] rpc: flushed binary log to ""
W0202 08:38:10.252510       1 kubernetes.go:118] Failed to publish resource metadata: rpc error: code = ResourceExhausted desc = Resource has been exhausted (e.g. check quota).
I0202 08:39:02.854485       1 binarylog.go:265] rpc: flushed binary log to ""
W0202 08:39:10.466841       1 kubernetes.go:118] Failed to publish resource metadata: rpc error: code = ResourceExhausted desc = Resource has been exhausted (e.g. check quota).
I0202 08:40:02.854799       1 binarylog.go:265] rpc: flushed binary log to ""
W0202 08:40:10.711573       1 kubernetes.go:118] Failed to publish resource metadata: rpc error: code = ResourceExhausted desc = Resource has been exhausted (e.g. check quota).
I0202 08:41:02.855385       1 binarylog.go:265] rpc: flushed binary log to ""
W0202 08:41:10.936932       1 kubernetes.go:118] Failed to publish resource metadata: rpc error: code = ResourceExhausted desc = Resource has been exhausted (e.g. check quota).
I0202 08:42:01.927927       1 trace.go:898] Failed loading config; disabling tracing: open /export/hda3/trace_data/trace_config.proto: no such file or directory
I0202 08:42:02.855644       1 binarylog.go:265] rpc: flushed binary log to ""
W0202 08:42:11.186921       1 kubernetes.go:118] Failed to publish resource metadata: rpc error: code = ResourceExhausted desc = Resource has been exhausted (e.g. check quota).
I0202 08:43:02.855968       1 binarylog.go:265] rpc: flushed binary log to ""
W0202 08:43:11.412187       1 kubernetes.go:118] Failed to publish resource metadata: rpc error: code = ResourceExhausted desc = Resource has been exhausted (e.g. check quota).
I0202 08:44:02.856350       1 binarylog.go:265] rpc: flushed binary log to ""
W0202 08:44:11.650193       1 kubernetes.go:118] Failed to publish resource metadata: rpc error: code = ResourceExhausted desc = Resource has been exhausted (e.g. check quota).
I0202 08:45:02.856618       1 binarylog.go:265] rpc: flushed binary log to ""
W0202 08:45:11.891921       1 kubernetes.go:118] Failed to publish resource metadata: rpc error: code = ResourceExhausted desc = Resource has been exhausted (e.g. check quota).
I0202 08:46:02.856854       1 binarylog.go:265] rpc: flushed binary log to ""
W0202 08:46:12.113489       1 kubernetes.go:118] Failed to publish resource metadata: rpc error: code = ResourceExhausted desc = Resource has been exhausted (e.g. check quota).
I0202 08:47:01.928158       1 trace.go:898] Failed loading config; disabling tracing: open /export/hda3/trace_data/trace_config.proto: no such file or directory
I0202 08:47:02.857201       1 binarylog.go:265] rpc: flushed binary log to ""
W0202 08:47:12.308528       1 kubernetes.go:118] Failed to publish resource metadata: rpc error: code = ResourceExhausted desc = Resource has been exhausted (e.g. check quota).
I0202 08:48:02.857501       1 binarylog.go:265] rpc: flushed binary log to ""
W0202 08:48:12.450212       1 kubernetes.go:118] Failed to publish resource metadata: rpc error: code = ResourceExhausted desc = Resource has been exhausted (e.g. check quota).
I0202 08:49:02.857788       1 binarylog.go:265] rpc: flushed binary log to ""
W0202 08:49:13.512880       1 kubernetes.go:118] Failed to publish resource metadata: rpc error: code = ResourceExhausted desc = Resource has been exhausted (e.g. check quota).
I0202 08:50:02.858095       1 binarylog.go:265] rpc: flushed binary log to ""
W0202 08:50:13.827352       1 kubernetes.go:118] Failed to publish resource metadata: rpc error: code = ResourceExhausted desc = Resource has been exhausted (e.g. check quota).
I0202 08:51:02.858392       1 binarylog.go:265] rpc: flushed binary log to ""
W0202 08:51:14.062459       1 kubernetes.go:118] Failed to publish resource metadata: rpc error: code = ResourceExhausted desc = Resource has been exhausted (e.g. check quota).
I0202 08:52:01.928374       1 trace.go:898] Failed loading config; disabling tracing: open /export/hda3/trace_data/trace_config.proto: no such file or directory
I0202 08:52:02.858863       1 binarylog.go:265] rpc: flushed binary log to ""
W0202 08:52:14.319897       1 kubernetes.go:118] Failed to publish resource metadata: rpc error: code = ResourceExhausted desc = Resource has been exhausted (e.g. check quota).
I0202 08:53:02.859160       1 binarylog.go:265] rpc: flushed binary log to ""
W0202 08:53:14.570000       1 kubernetes.go:118] Failed to publish resource metadata: rpc error: code = ResourceExhausted desc = Resource has been exhausted (e.g. check quota).

On Google cloud console I can also see
image

In each 24 hours, it is making 202,559 api calls to publish metadata. Out of which 92% fails. I am using a custom service account and it has Stackdriver Resource Metadata Writer permissions.

Any idea, why too many requests? How do I resolve it?

Metadata agent not working: Failed to publish resource metadata: Unexpected response code from confluence: 400 Bad Request

We're running a few GKE clusters which have Stackdriver Monitoring manually installed using the configs from this repo (reason for manual install is mainly to add a few custom log parsing rules to the config).

After upgrading to the latest version of the configs which seem to include
some big changes to the metadata agent, the metadata agent doesn't work anymore and metadata disappears from the Kubernetes Dashboard on Stackdriver monitoring.

The metadata agent prints the following errors:
obtained via: kubectl logs -n stackdriver-agents stackdriver-metadata-agent-cluster-level-78599b584-wkprj

W0315 01:48:28.766316       1 kubernetes.go:104] Failed to publish resource metadata: Unexpected response code from confluence: 400 Bad Request
W0315 01:48:28.783876       1 kubernetes.go:104] Failed to publish resource metadata: Unexpected response code from confluence: 400 Bad Request
W0315 01:48:28.934092       1 kubernetes.go:104] Failed to publish resource metadata: Unexpected response code from confluence: 400 Bad Request

The config was obtained from this url: https://raw.githubusercontent.com/Stackdriver/kubernetes-configs/stable/agents.yaml

The logging agent continues to work.

Issue seems to have been introduced by #20

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.