Giter VIP home page Giter VIP logo

Comments (6)

patrick-east avatar patrick-east commented on May 24, 2024

Any chance you can provide logs from kube-mgmt and OPA during one of the events where it drops to 0 and then recovers?

I'm primarily curious to see what kube-mgmt was doing when it dropped them to 0, and anything between then and pushing the data back. The OPA logs should help confirm when the API calls that were happening to clear the data occurred.

from kube-mgmt.

rtoma avatar rtoma commented on May 24, 2024

OPA is spewing serious amounts of logging, so we generally do not keep more logs than what Kubernetes does (not much). But we do have the OPA internal metrics exposed. When looking at it, I see calls to the webhook endpoint (POST on /) used by the API server, stop at the same time the OPA data is cleared.

As you may know the API server makes direct connections to OPA pods. For efficiency these connections are long lasting and the API server tends to keep open 1 or 2 of such connections. This bit of information is required to interpret the following graph correctly, as there is no fair load balancing over all available OPA pods.
image

Interpreting the above graph:

  • the orange line (pod "l5gj2") stops at 01:15
  • the blue line (pod "b7w4g") stops at 03:45
  • the orange line (pod "l5gj2") stops at 06:34

These times AND the pods correlate with the data drops:
image

So I am wondering:

  1. why does the API server stop using a specific pod? (just time to cycle, timeout or error?)
  2. what is the relationship between the API server closing a connection to a pod and the data being cleared from the same pod? (cause or effect)
  3. is there a race condition in OPA between a webhook call and a data update? (we saw this behaviour before our data metrics, so I am excluding that as a cause for now)

My current hypothesis: there is an issue in OPA (0.13.2) triggered by changes in kube-mgmt (0.8 vs 0.10).

I will dig deeper and try to find proof. Hopefully I have given you some breadcrumbs to zero in on a possible cause. Share any ideas and we'll find the root cause together.

from kube-mgmt.

rtoma avatar rtoma commented on May 24, 2024

Even though I have not caught errors in the OPA logs, I do see OPA containers being restarted within the pod (kube-mgmt is not restarted), so that indicates issues with OPA. And the fact that the API server picks another pod at the same time supports the theory that OPA containers panic.

So I am now testing with OPA 0.13.5 (including some panic fixes) and kube-mgmt 0.10 right now.

🤞

from kube-mgmt.

patrick-east avatar patrick-east commented on May 24, 2024

Looking forward to hear if 0.13.5 helps.

I do see OPA containers being restarted within the pod (kube-mgmt is not restarted), so that indicates issues with OPA.

Another thing to check on/rule out is that the liveness probe for OPA didn't timeout. I have seen issues with that where OPA is under load and has too short of a timeout configured for the liveness probe. End result is similar where the OPA pods restart for no immediately apparent reason.

from kube-mgmt.

rtoma avatar rtoma commented on May 24, 2024

OPA 0.13.5 with kube-mgmt 0.10 still going strong after 10 hours. In this timespan we would have seen multiple API server reconnects (changes in what OPA pods handle the load) with OPA 0.13.2 and kube-mgmt 0.10, triggered by (we now know) OPA pod panics. No such thing today:
image

In our case the liveness probes can be ruled out. Failing probes are included in the kubectl get event stream and there are none.

Still wondering why the panics were not part of the container logging. Maybe the logging is not flushed before process shutdown?

from kube-mgmt.

rtoma avatar rtoma commented on May 24, 2024

Closing this issue. Upgrading OPA from 0.13.2 to 0.13.5 fixed the panics.

from kube-mgmt.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.