Unsure if this is an actual issue or if I should log separate issues for each of them.

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

<a class="user-mention notranslate" data-hovercard-type="user" data-hover

kube-mgmt throwing errors when pod has been running for quite some time about kube-mgmt HOT 18 CLOSED

open-policy-agent commented on June 4, 2024 2

kube-mgmt throwing errors when pod has been running for quite some time

from kube-mgmt.

Comments (18)

tsandall commented on June 4, 2024 2

@muzcategui1106-gs great. There's a new openpolicyagent/kube-mgmt:0.11 image on Docker Hub that you can try out. Please let me know if this resolves the issue.

…

On Thu, Jan 16, 2020 at 9:00 AM muzcategui1106-gs ***@***.***> wrote: @tsandall <https://github.com/tsandall> yes when the release comes out, I will test out the changes in our dev environment — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#49?email_source=notifications&email_token=AAB2KJP3BTHEESWSFCIG7JDQ6BSBFA5CNFSM4ITROD52YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEJEE5OA#issuecomment-575164088>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAB2KJKDFUUVYGJO32UZ6NLQ6BSBFANCNFSM4ITROD5Q> .

-- -Torin

from kube-mgmt.

tsandall commented on June 4, 2024 2

We've confirmed that v0.11 fixes the logging issue. Since the logging problem and the resource sync issue have been resolved, I'm going to close this for now.

from kube-mgmt.

tsandall commented on June 4, 2024

@muzcategui1106-gs can you show an example of a configmap that's annotated with policy-status that should not be? From looking at the configmap replication code in kube-mgmt, it's not obvious how this could happen.

RE: The sync channel closure, that is to be expected with the Kubernetes client. However, the delay on upserting into OPA is NOT expected. I'll try to reproduce the problem.

One thing that might help would be to update the data replicator to use the PATCH method on OPA instead of DELETE followed by a series of PUTs. Here is the code in question: https://github.com/open-policy-agent/kube-mgmt/blob/master/pkg/data/generic.go#L147

from kube-mgmt.

muzcategui1106-gs commented on June 4, 2024

Sure I not allowed to show an entire configmcannot show the entire configmap but only the error that I get ( i changed the filename in the error message)

Version Info:
OPA: 0.14.0
Kube-mgmt: 0.9

Flags on kube-mgmt:
args: - "--replicate-cluster=v1/namespaces" - "--replicate-cluster=v1/persistentvolumes" - "--replicate=v1/persistentvolumeclaims" - "--replicate=v1/resourcequotas" - "--require-policy-label"
Error on configmap
openpolicyagent.org/policy-status: '{"status":"error","error":{"code":"invalid_parameter","message":"error(s) occurred while compiling module(s)","errors":[{"code":"rego_parse_error","message":"no match found, expected: [ \\t\\r\\n] or [A-Za-z_]","location":{"file":"opa/example.conf","row":7,"col":11}}]}}'

Regarding resource utilization I dont have any specification on my deployment about that. I should have bvut from what i see they pods dont seem to be starving for resources

NAME CPU(cores) MEMORY(bytes)
opa-5985f8cf78-kv45t 1571m 132Mi
opa-5985f8cf78-rbq4p 1m 151Mi
opa-5985f8cf78-rcrds 1369m 156Mi
policy-engine-6b6ffbb5f9-5kznf 914m 78Mi
policy-engine-6b6ffbb5f9-64l8d 503m 76Mi
policy-engine-6b6ffbb5f9-bhlht 329m 81Mi

What should be the recommended values in your opinion?

Regarding the code you mention https://github.com/open-policy-agent/kube-mgmt/blob/master/pkg/data/generic.go#L147 I observed the same when I was looking from a performance point of view it does seem like a good idea to do PATCH as opposed to DELETE and PUTS

from kube-mgmt.

tsandall commented on June 4, 2024

@muzcategui1106-gs can you confirm whether the openpolicyagent.org/policy label is set on that configmap? I have a test deployment that is emitting the "Unable to decode..." error log but the configmaps are not being loaded incorrectly:

$ kubectl -n torin-test get configmap -o yaml
apiVersion: v1
items:
- apiVersion: v1
  data:
    x.txt: |
      hello
  kind: ConfigMap
  metadata:
    creationTimestamp: "2019-09-19T15:26:12Z"
    name: notpolicy
    namespace: torin-test
    resourceVersion: "7431454"
    selfLink: /api/v1/namespaces/torin-test/configmaps/notpolicy
    uid: d01934e1-daf1-11e9-ae65-024400571c2a
- apiVersion: v1
  data:
    x.rego: |-
      package foo

      p = 1
  kind: ConfigMap
  metadata:
    annotations:
      openpolicyagent.org/policy-status: '{"status":"ok"}'
    creationTimestamp: "2019-09-19T15:26:06Z"
    labels:
      openpolicyagent.org/policy: rego
    name: policy
    namespace: torin-test
    resourceVersion: "7431761"
    selfLink: /api/v1/namespaces/torin-test/configmaps/policy
    uid: cc69ae52-daf1-11e9-ae65-024400571c2a
kind: List
metadata:
  resourceVersion: ""
  selfLink: ""

Logs:

E0923 13:10:34.000910       1 streamwatcher.go:109] Unable to decode an event from the watch stream: unable to decode watch event: no kind "Status" is registered for version "v1" in scheme "github.com/open-policy-agent/kube-mgmt/pkg/configmap/configmap.go:102"
E0923 13:32:19.022065       1 streamwatcher.go:109] Unable to decode an event from the watch stream: unable to decode watch event: no kind "Status" is registered for version "v1" in scheme "github.com/open-policy-agent/kube-mgmt/pkg/configmap/configmap.go:102"
E0923 13:49:55.032696       1 streamwatcher.go:109] Unable to decode an event from the watch stream: unable to decode watch event: no kind "Status" is registered for version "v1" in scheme "github.com/open-policy-agent/kube-mgmt/pkg/configmap/configmap.go:102"
time="2019-09-23T13:51:44Z" level=info msg="Sync channel for v1/pods closed. Restarting immediately." 
time="2019-09-23T13:51:44Z" level=info msg="Syncing v1/pods." 
time="2019-09-23T13:51:44Z" level=info msg="Listed v1/pods and got 633 resources with resourceVersion 9361168. Took 294.467604ms." 
time="2019-09-23T13:51:45Z" level=info msg="Loaded v1/pods resources into OPA. Took 524.414201ms. Starting watch at resourceVersion 9361168."

from kube-mgmt.

muzcategui1106-gs commented on June 4, 2024

the openpolicyagent.org/policy is not set only the annotation

from kube-mgmt.

tsandall commented on June 4, 2024

@muzcategui1106-gs we've cut v0.10 which contains #50. This should reduce CPU usage during resync. As of yet, the configmap annotation problem is still unclear and I've not been able to reproduce it. Any hints you can provide would be helpful. Finally, @patrick-east has looked into the decode errors and has some thoughts.

from kube-mgmt.

muzcategui1106-gs commented on June 4, 2024

@tsandall this is awesome so hopefully this should reduce CPU uses and increase resource sync. Will give it a try.

Regarding the configmap sync let me see what I can dig in the next couple of days :)

on the meantime I will start running 0.10 in my dev environment

from kube-mgmt.

patrick-east commented on June 4, 2024

@patrick-east has looked into the decode errors and has some thoughts.

Right, so looking into it these type of errors occur if the scheme wasn't registered with our client (which happens over here: https://github.com/open-policy-agent/kube-mgmt/blob/master/pkg/configmap/configmap.go#L104-L112 ). What isn't clear is what object came over the wire that it was unable to decode.

The current theory is that some object version/schema changed while kube-mgmt was running, or the client code we're using is out of date or has a bug in it and is missing part of the schema for some object.

As of the last chat with @tsandall it seems like this particular issue isn't likely a big deal as kube-mgmt will just continue processing the objects we care about.

from kube-mgmt.

anderseknert commented on June 4, 2024

Having just installed OPA and kube-mgmt in our 3 dev clusters I am also seeing this in the logs for each environment (131 "Unable to decode.." messages last hour).

Also seem to be having some (possibly related?) issues syncing kubernetes resources, leading to data being seemingly unavailable to some of the OPA instances, which in turn leads to authorization decisions being allowed or denied based on which of the instances is consulted (the default of course set to deny).

Let me know if I can provide any data to help you look into this.

from kube-mgmt.

rtoma commented on June 4, 2024

Hi @anderseknert, please add more details. Like: versions of OPA/kube-mgmt, the startup arguments and the full error.

from kube-mgmt.

anderseknert commented on June 4, 2024

Hi @rtoma, and thanks for reaching out. Here are some of the details. Let me know if you need more.

OPA image: docker.io/openpolicyagent/opa:0.14.2
Startup arguments run --server --authentication=token --authorization=basic --ignore=.* /policies/authz.rego

kube-mgmt image: openpolicyagent/kube-mgmt:0.10
Startup arguments: --opa-auth-token-file=/policies/token --require-policy-label=true --replicate=v1/services --replicate=v1/pods --replicate=v1/configmaps --replicate=v1/persistentvolumeclaims --replicate=apps/v1/deployments --replicate=apps/v1/statefulsets --replicate=autoscaling/v1/horizontalpodautoscalers --replicate=batch/v1/jobs --replicate=batch/v1beta1/cronjobs --replicate=extensions/v1beta1/ingresses --replicate=extensions/v1beta1/replicasets --replicate=networking.k8s.io/v1/networkpolicies

All in all the payload returned when hitting /v1/data/ is about 25MB of JSON so it's definitely syncing - it just seems to have intermittent problems on some of the instances.

Full error:
E1030 18:02:45.864525 1 streamwatcher.go:109] Unable to decode an event from the watch stream: unable to decode watch event: no kind "Status" is registered for version "v1" in scheme "github.com/open-policy-agent/kube-mgmt/pkg/configmap/configmap.go:102"

from kube-mgmt.

rtoma commented on June 4, 2024

Hi, our kube-mgmt logging is full of these errors without functional impact. Which is confirmed by:

As of the last chat with @tsandall it seems like this particular issue isn't likely a big deal as kube-mgmt will just continue processing the objects we care about.

So maybe focus on the 'issues syncing kubernetes resources' which clearly has a functional impact. Maybe it is an idea to create a script that makes /v1/data calls on all pods every minute and compare the results? We've created a metrics collector (which I can not share) for this purpose. Another suggestion because it seems you want to seriously use OPA: develop an OPA regression tester that periodically POSTs artificial AdmissionReview payloads (extract from OPA's decision log) to the OPA webhook endpoint and match the result against the expected results. That way you can verify OPA and its policies are behaving as you expect.

from kube-mgmt.

anderseknert commented on June 4, 2024

Thanks @rtoma. Yeah we are definitely looking at using OPA at a larger scale. So far I have some tests in place, and more definitely to come - though currently they all point to an ingress controller just like the actual authorization webhook that we're looking at using OPA for as our first proof of concenpt. It's from these tests I've noticed different results on re-running tests though nothing in the kubernetes data actually changed during that time frame. Will see about running tests targeting the pods directly, thanks for the pointer. Even if I can verify the sync issue I'm not sure how to proceed with that from there though.

from kube-mgmt.

anderseknert commented on June 4, 2024

So my tests work something like this:

In the setup, create and deploy a kubernetes resource.
Verify that I can run policy queries targeting OP with aforementioned resource as data.
Teardown, delete kubernetes resource.

Sleeping 60 seconds between 1 and 2 seems to solve the problem I had with data getting out of sync, so this seems like a problem with my tests rather than with kube-mgmt. Will extend the test suite to also target individual pods, but for now this seems solved for me.

The error message originally reported is still very much present and annoying, but it did not cause sync issues.

from kube-mgmt.

patoarvizu commented on June 4, 2024

Will anything be done about the no kind "Status" is registered for version "v1" error? It's kind of confusing and can make debugging harder.

Is it an issue with the Kubernetes version? I'm seeing it on both a 1.12 and a 1.15 cluster.

from kube-mgmt.

tsandall commented on June 4, 2024

I took another look into the no kind "Status" is registered error message and found (what I think is) a solution. I've posted a PR w/ the one-line fix #58. I'll merge this fix later today and cut a release.

@muzcategui1106-gs if you can test this out, that would be great.

from kube-mgmt.

muzcategui1106-gs commented on June 4, 2024

@tsandall yes when the release comes out, I will test out the changes in our dev environment

from kube-mgmt.

kube-mgmt throwing errors when pod has been running for quite some time about kube-mgmt HOT 18 CLOSED

Comments (18)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent