Comments (6)
Any chance you can provide logs from kube-mgmt and OPA during one of the events where it drops to 0 and then recovers?
I'm primarily curious to see what kube-mgmt was doing when it dropped them to 0, and anything between then and pushing the data back. The OPA logs should help confirm when the API calls that were happening to clear the data occurred.
from kube-mgmt.
OPA is spewing serious amounts of logging, so we generally do not keep more logs than what Kubernetes does (not much). But we do have the OPA internal metrics exposed. When looking at it, I see calls to the webhook endpoint (POST on /) used by the API server, stop at the same time the OPA data is cleared.
As you may know the API server makes direct connections to OPA pods. For efficiency these connections are long lasting and the API server tends to keep open 1 or 2 of such connections. This bit of information is required to interpret the following graph correctly, as there is no fair load balancing over all available OPA pods.
Interpreting the above graph:
- the orange line (pod "l5gj2") stops at 01:15
- the blue line (pod "b7w4g") stops at 03:45
- the orange line (pod "l5gj2") stops at 06:34
These times AND the pods correlate with the data drops:
So I am wondering:
- why does the API server stop using a specific pod? (just time to cycle, timeout or error?)
- what is the relationship between the API server closing a connection to a pod and the data being cleared from the same pod? (cause or effect)
- is there a race condition in OPA between a webhook call and a data update? (we saw this behaviour before our data metrics, so I am excluding that as a cause for now)
My current hypothesis: there is an issue in OPA (0.13.2) triggered by changes in kube-mgmt (0.8 vs 0.10).
I will dig deeper and try to find proof. Hopefully I have given you some breadcrumbs to zero in on a possible cause. Share any ideas and we'll find the root cause together.
from kube-mgmt.
Even though I have not caught errors in the OPA logs, I do see OPA containers being restarted within the pod (kube-mgmt is not restarted), so that indicates issues with OPA. And the fact that the API server picks another pod at the same time supports the theory that OPA containers panic.
So I am now testing with OPA 0.13.5 (including some panic fixes) and kube-mgmt 0.10 right now.
🤞
from kube-mgmt.
Looking forward to hear if 0.13.5 helps.
I do see OPA containers being restarted within the pod (kube-mgmt is not restarted), so that indicates issues with OPA.
Another thing to check on/rule out is that the liveness probe for OPA didn't timeout. I have seen issues with that where OPA is under load and has too short of a timeout configured for the liveness probe. End result is similar where the OPA pods restart for no immediately apparent reason.
from kube-mgmt.
OPA 0.13.5 with kube-mgmt 0.10 still going strong after 10 hours. In this timespan we would have seen multiple API server reconnects (changes in what OPA pods handle the load) with OPA 0.13.2 and kube-mgmt 0.10, triggered by (we now know) OPA pod panics. No such thing today:
In our case the liveness probes can be ruled out. Failing probes are included in the kubectl get event
stream and there are none.
Still wondering why the panics were not part of the container logging. Maybe the logging is not flushed before process shutdown?
from kube-mgmt.
Closing this issue. Upgrading OPA from 0.13.2 to 0.13.5 fixed the panics.
from kube-mgmt.
Related Issues (20)
- Bad indents: can't specify resources for mgmt HOT 1
- Helm chart does not support Kubernetes v1.25 PodDisruptionBudget HOT 1
- helm: openpolicyagent/opa image is outdated and has a critical vulnerability
- Upgrading the Helm chart on Kubernetes v1.25 fails with podDisruptionsBudget enabled
- kube-mgmt doesn't reload configmaps if opa container restarts HOT 6
- CVE reported on kube-mgmt v8.0.1 - libcrypto1.1 HOT 1
- Breaking issue when running with more than 1 replica HOT 8
- upgrading from 8.0.2 to 8.1.0 breaks namespaces sync HOT 10
- Failed calling webhook "webhook.openpolicyagent.org" error HOT 5
- CVE reported for gopkg.in/yaml.v3 HOT 3
- Kube mgmt fails after upgrade - {"code":"undefined_document","message":"document missing: data.system.main"} HOT 2
- kube-mgmt does not retry adding policies to OPA HOT 1
- When OPA container restarted kube-mgmt is not re-syncing the policies HOT 2
- opa-kube-mgmt Helm Chart config can't use existing Cert-Manager Issuer or an existing Secret created from Cert-Manager HOT 4
- CVE Vulnerabilities HOT 1
- Add startup probe to kube-mgmt container HOT 12
- Add liveness probe to kube-mgmt container HOT 5
- Do not use ClusterRole and ClusterRoleBinding when .Values.mgmt.namespaces list is empty
- Pre populate data in opa container on startup. HOT 8
- {"code":"undefined_document","message":"document missing: data.system.main"} HOT 4
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from kube-mgmt.