Giter VIP home page Giter VIP logo

Comments (16)

wallrj avatar wallrj commented on June 5, 2024 1

I am more in favor of using a non-caching client when reconciling certificaterequests in approver-policy.

Yes, I agree that would solve the problem.
You'd GET the CertificateRequest using a non-caching client, here:

cr := new(cmapi.CertificateRequest)
if err := c.lister.Get(ctx, req.NamespacedName, cr); err != nil {
return ctrl.Result{}, nil, client.IgnoreNotFound(err)
}

Ignoring a custom error message will never be clean, and quite fragile. If anyone changes the literal error message in the cert-manager webhook, this will break again. The HTTP status code is just Forbidden - which could mean many things.

The full error message is:

{
  "kind": "Status",
  "apiVersion": "v1",
  "metadata": {},
  "status": "Failure",
  "message": "admission webhook \"webhook.cert-manager.io\" denied the request: status.conditions: Forbidden: 'Approved' condition may not be modified once set",
  "reason": "NotAcceptable",
  "code": 406
}

Given the following patch.json:

{
    "apiVersion": "cert-manager.io/v1",
    "kind": "CertificateRequest",
    "status": {
        "conditions": [
            {
                "message": "new message",
                "type": "Approved"
            }
        ]
    }
}
curl 'https://127.0.0.1:44121/apis/cert-manager.io/v1/namespaces/default/certificaterequests/foo/status?fieldManager=cmctl&force=true' \
     --silent \
     --insecure \
     --header "Accept: application/json" \
     --header "Authorization: Bearer $(kubectl create token -n cert-manager cert-manager)" \
     --header "Content-Type: application/apply-patch+yaml" \
     --request PATCH \
     --data @patch.json

So that error response might be unique and stable enough to reliably discern it from other errors.
I'd probably choose this over adding a non-caching client because the code solving the problem would be easy to understand and close to the site of the problem.

The configuration of the non-caching client would be here, I suppose, which is a further away from where the site of the problem:

mgr, err := ctrl.NewManager(opts.RestConfig, ctrl.Options{
Scheme: policyapi.GlobalScheme,
LeaderElection: true,
LeaderElectionID: "policy.cert-manager.io",
LeaderElectionReleaseOnCancel: true,
LeaderElectionResourceLock: "leases",
LeaderElectionNamespace: opts.LeaderElectionNamespace,
ReadinessEndpointName: "/readyz",
HealthProbeBindAddress: opts.ReadyzAddress,
Metrics: server.Options{
BindAddress: opts.MetricsAddress,
},
WebhookServer: ctrlwebhook.NewServer(ctrlwebhook.Options{
Port: opts.Webhook.Port,
Host: opts.Webhook.Host,
TLSOpts: []func(*tls.Config){
func(cfg *tls.Config) {
cfg.GetCertificate = certificateSource.GetCertificate
},
},
}),
Logger: mlog,
})

Another option might be to use the cached CR resourceVersion in the patch request, which would cause a HTTP conflict error if the cached CR had since been modified by cert-manager or by an external issuer,
approver-policy could handle the conflict by requeuing the CR, by returning an error and relying on the error backoff mechanism or by explicitly requeuing the CR.
Some future reconcile attempt would eventually succeed once the approver-policy cache had the latest version of the CR.

E.g. the following json patch applied using the previous curl command:

{
    "apiVersion": "cert-manager.io/v1",
    "kind": "CertificateRequest",
    "metadata": {
        "resourceVersion": "999"
    },
    "status": {
        "conditions": [
            {
                "message": "new message",
                "type": "Approved"
            }
        ]
    }
}

Gives:

{
  "kind": "Status",
  "apiVersion": "v1",
  "metadata": {},
  "status": "Failure",
  "message": "Operation cannot be fulfilled on certificaterequests.cert-manager.io \"foo\": the object has been modified; please apply your changes to the latest version and try again",
  "reason": "Conflict",
  "details": {
    "name": "foo",
    "group": "cert-manager.io",
    "kind": "certificaterequests"
  },
  "code": 409
}

from approver-policy.

erikgb avatar erikgb commented on June 5, 2024 1

To all following this issue, I can report that it seems to work a lot better now! 🎉 I had a feeling the fix would not matter, but it has an effect....

from approver-policy.

SgtCoDFish avatar SgtCoDFish commented on June 5, 2024

See also this slack thread

from approver-policy.

erikgb avatar erikgb commented on June 5, 2024

Based on our certificate renewal policy, the frequency of this error should have decreased if this was a transient error. This is not the case, so I suspect this to be a real bug.

It seems like cert-manager/cert-manager#3735 introduced this strict validation in cert-manager: https://github.com/cert-manager/cert-manager/blob/5141dddf2c0c5e10c5d4452c99a2e260c6eb2983/internal/apis/certmanager/validation/certificaterequest.go#L202-L214. I wonder if the validation was intentionally made extremely strict (DeepEqual), or if the validation could be loosened up a bit? I am pretty sure it's not vital to validate that conditions do not change at all, or?

from approver-policy.

erikgb avatar erikgb commented on June 5, 2024

Looking closer at the managedFields in the example request, I wonder if this might be a bug in cert-manager:

  managedFields:
    - apiVersion: cert-manager.io/v1
      fieldsType: FieldsV1
      fieldsV1:
        'f:status':
          'f:conditions':
            'k:{"type":"Approved"}':
              .: {}
              'f:lastTransitionTime': {}
              'f:message': {}
              'f:reason': {}
              'f:status': {}
              'f:type': {}
      manager: approver-policy
      operation: Apply
      subresource: status
      time: '2023-10-30T13:45:09Z'
    - apiVersion: cert-manager.io/v1
      fieldsType: FieldsV1
      fieldsV1:
        'f:status':
          'f:ca': {}
          'f:certificate': {}
          'f:conditions':
            'k:{"type":"Approved"}':
              .: {}
              'f:lastTransitionTime': {}
              'f:message': {}
              'f:reason': {}
              'f:status': {}
              'f:type': {}
            'k:{"type":"Ready"}':
              .: {}
              'f:lastTransitionTime': {}
              'f:message': {}
              'f:reason': {}
              'f:status': {}
              'f:type': {}
      manager: cert-manager-certificaterequests-issuer-vault
      operation: Apply
      subresource: status
      time: '2023-10-30T13:45:09Z'

I do not expect the cert-manager Vault issuer to approve certificate requests, but it seems like it (cert-manager-certificaterequests-issuer-vault) did from the managed fields.

from approver-policy.

erikgb avatar erikgb commented on June 5, 2024

Checking if cert-manager is configured correctly, since I am not expecting cert-manager to approve any requests:

I1103 11:05:17.741662 1 controller.go:192] "cert-manager/controller: not starting controller as it's disabled" controller="certificatesigningrequests-issuer-vault"
I1103 11:05:17.742502 1 controller.go:192] "cert-manager/controller: not starting controller as it's disabled" controller="certificatesigningrequests-issuer-acme"
I1103 11:05:17.742515 1 controller.go:192] "cert-manager/controller: not starting controller as it's disabled" controller="ingress-shim"
I1103 11:05:17.742521 1 controller.go:192] "cert-manager/controller: not starting controller as it's disabled" controller="certificaterequests-approver"
I1103 11:05:17.743122 1 controller.go:215] "cert-manager/controller: starting controller" controller="certificates-metrics"
I1103 11:05:17.743150 1 controller.go:215] "cert-manager/controller: starting controller" controller="certificates-trigger"
I1103 11:05:17.743166 1 controller.go:215] "cert-manager/controller: starting controller" controller="certificates-key-manager"
I1103 11:05:17.743181 1 controller.go:215] "cert-manager/controller: starting controller" controller="issuers"
I1103 11:05:17.743197 1 controller.go:215] "cert-manager/controller: starting controller" controller="certificates-revision-manager"
I1103 11:05:17.743214 1 controller.go:215] "cert-manager/controller: starting controller" controller="certificates-issuing"
I1103 11:05:17.743319 1 controller.go:215] "cert-manager/controller: starting controller" controller="certificates-request-manager"
I1103 11:05:17.746431 1 controller.go:192] "cert-manager/controller: not starting controller as it's disabled" controller="certificatesigningrequests-issuer-ca"
I1103 11:05:17.746998 1 controller.go:192] "cert-manager/controller: not starting controller as it's disabled" controller="certificatesigningrequests-issuer-selfsigned"
I1103 11:05:17.747009 1 controller.go:192] "cert-manager/controller: not starting controller as it's disabled" controller="certificatesigningrequests-issuer-venafi"
I1103 11:05:17.747549 1 controller.go:192] "cert-manager/controller: not starting controller as it's disabled" controller="gateway-shim"
I1103 11:05:17.748065 1 controller.go:215] "cert-manager/controller: starting controller" controller="certificaterequests-issuer-selfsigned"
I1103 11:05:17.748113 1 controller.go:215] "cert-manager/controller: starting controller" controller="certificaterequests-issuer-vault"
I1103 11:05:17.748131 1 controller.go:215] "cert-manager/controller: starting controller" controller="certificaterequests-issuer-venafi"
I1103 11:05:17.748147 1 controller.go:215] "cert-manager/controller: starting controller" controller="clusterissuers"
I1103 11:05:17.748159 1 controller.go:215] "cert-manager/controller: starting controller" controller="challenges"
I1103 11:05:17.748174 1 controller.go:215] "cert-manager/controller: starting controller" controller="certificates-readiness"
I1103 11:05:17.748190 1 controller.go:215] "cert-manager/controller: starting controller" controller="orders"
I1103 11:05:17.748632 1 controller.go:215] "cert-manager/controller: starting controller" controller="certificaterequests-issuer-acme"
I1103 11:05:17.748656 1 controller.go:215] "cert-manager/controller: starting controller" controller="certificaterequests-issuer-ca"

from approver-policy.

erikgb avatar erikgb commented on June 5, 2024

Here is a newer request from today, which I think proves that there is a bug in cert-manager: the cert-manager Vault issuer sets/owns the Approved condition - with a timestamp after approver-policy approval. No idea why the error ends up in approver-policy tho. I cannot find any related errors/warnings in cert-manager logs.

apiVersion: cert-manager.io/v1
kind: CertificateRequest
metadata:
  annotations:
    cert-manager.io/certificate-name: prefixverification-forecast-selector
    cert-manager.io/certificate-revision: '7'
    cert-manager.io/private-key-secret-name: prefixverification-forecast-selector-9qvxg
  resourceVersion: '3086765553'
  name: prefixverification-forecast-selector-7
  uid: 05ae7f2b-8672-4185-87d5-bc5e6ba1cf4e
  creationTimestamp: '2023-11-05T09:57:44Z'
  generation: 1
  managedFields:
    - apiVersion: cert-manager.io/v1
      fieldsType: FieldsV1
      fieldsV1:
        'f:status':
          'f:conditions':
            'k:{"type":"Approved"}':
              .: {}
              'f:lastTransitionTime': {}
              'f:message': {}
              'f:reason': {}
              'f:status': {}
              'f:type': {}
      manager: approver-policy
      operation: Apply
      subresource: status
      time: '2023-11-05T09:57:44Z'
    - apiVersion: cert-manager.io/v1
      fieldsType: FieldsV1
      fieldsV1:
        'f:status':
          'f:ca': {}
          'f:certificate': {}
          'f:conditions':
            'k:{"type":"Approved"}':
              .: {}
              'f:lastTransitionTime': {}
              'f:message': {}
              'f:reason': {}
              'f:status': {}
              'f:type': {}
            'k:{"type":"Ready"}':
              .: {}
              'f:lastTransitionTime': {}
              'f:message': {}
              'f:reason': {}
              'f:status': {}
              'f:type': {}
      manager: cert-manager-certificaterequests-issuer-vault
      operation: Apply
      subresource: status
      time: '2023-11-05T09:57:45Z'
    - apiVersion: cert-manager.io/v1
      fieldsType: FieldsV1
      fieldsV1:
        'f:metadata':
          'f:annotations':
            .: {}
            'f:cert-manager.io/certificate-name': {}
            'f:cert-manager.io/certificate-revision': {}
            'f:cert-manager.io/private-key-secret-name': {}
          'f:labels':
            .: {}
            'f:app.kubernetes.io/managed-by': {}
            'f:app.kubernetes.io/name': {}
          'f:ownerReferences':
            .: {}
            'k:{"uid":"fc04c6eb-d67e-466c-93f2-0b0fa6a12f9f"}': {}
        'f:spec':
          .: {}
          'f:duration': {}
          'f:issuerRef':
            .: {}
            'f:group': {}
            'f:kind': {}
            'f:name': {}
          'f:request': {}
          'f:usages': {}
      manager: cert-manager-certificates-request-manager
      operation: Update
      time: '2023-11-05T09:57:44Z'
  namespace: preprod-fifty-for-imbalance-forecast-postprocessing
  ownerReferences:
    - apiVersion: cert-manager.io/v1
      blockOwnerDeletion: true
      controller: true
      kind: Certificate
      name: prefixverification-forecast-selector
      uid: fc04c6eb-d67e-466c-93f2-0b0fa6a12f9f
  labels:
    app.kubernetes.io/managed-by: identity-provider
    app.kubernetes.io/name: prefixverification-forecast-selector
spec:
  duration: 144h0m0s
  extra:
    authentication.kubernetes.io/pod-name:
      - cert-manager-5f86f8b489-6k548
    authentication.kubernetes.io/pod-uid:
      - f5d36458-dfb1-458e-8c79-31c78c87a0ea
  groups:
    - 'system:serviceaccounts'
    - 'system:serviceaccounts:cert-manager'
    - 'system:authenticated'
  issuerRef:
    group: cert-manager.io
    kind: ClusterIssuer
    name: vault-spiffe-issuer
  request: >-
    <REDACTED>
  uid: 256a8935-1197-4f9a-915d-5dfd38d5f0c6
  usages:
    - client auth
  username: 'system:serviceaccount:cert-manager:cert-manager'
status:
  ca: >-
    <REDACTED>
  certificate: >-
    <REDACTED>
  conditions:
    - lastTransitionTime: '2023-11-05T09:57:44Z'
      message: 'Approved by CertificateRequestPolicy: "vault-spiffe"'
      reason: policy.cert-manager.io
      status: 'True'
      type: Approved
    - lastTransitionTime: '2023-11-05T09:57:45Z'
      message: Certificate fetched from issuer successfully
      reason: Issued
      status: 'True'
      type: Ready

from approver-policy.

wallrj avatar wallrj commented on June 5, 2024

@erikgb I looked at the code and my theory is that there are two problems:

  1. approver-policy sometimes attempts to approve an already approved CertificateRequest

    I'm hopeless at describing these sorts of interactions, but bear with me:

    a. cert-manager creates CR v0
    b. cert-manager and approver-policy both reconcile CR v0 in parallel:

    1. approver-policy begins reconciling CR v0 and performs policy checks , meanwhile
    2. cert-manager reconciles CR v0 and adds the Ready condition with status False,
    3. approver-policy cache receives the updated CR v1 with Ready=False and re-queues CR v1 to be re-reconciled
    4. approver-policy completes policy checks and sets Approved condition.

    c. approver-policy reconciles CR v1

    1. approver-policy completes policy checks and again attempts to set Approved condition
    2. cert-manager webhook denies the request and approver-policy logs the error

    d. approver-policy cache receives the updated CR v2 with Approved=True, Ready=False and re-queues CR v2 to be re-reconciled
    e. approver-policy ignores CR v2 because it already has an Approved condition.

One solution would be to modify approver-policy to expect this situation and ignore the error, let's call it ApprovedConditionAlreadySet, in much the same way that we might IgnoreNotFound or IgnoreAlreadyExists errors.

  1. cert-manager server side apply is setting (but not changing) the Approved condition and in so doing, taking joint ownership of that field.

    I haven't looked into this, but I would guess that cert-manager is creating a server-side apply patch which contains the existing conditions from the CR, rather than only the Ready condition.

from approver-policy.

inteon avatar inteon commented on June 5, 2024

@erikgb I looked at the code and my theory is that there are two problems:

  1. approver-policy sometimes attempts to approve an already approved CertificateRequest

    I'm hopeless at describing these sorts of interactions, but bear with me:

    a. cert-manager creates CR v0
    b. cert-manager and approver-policy both reconcile CR v0 in parallel:

    1. approver-policy begins reconciling CR v0 and performs policy checks , meanwhile
    2. cert-manager reconciles CR v0 and adds the Ready condition with status False,
    3. approver-policy cache receives the updated CR v1 with Ready=False and re-queues CR v1 to be re-reconciled
    4. approver-policy completes policy checks and sets Approved condition.

    c. approver-policy reconciles CR v1

    1. approver-policy completes policy checks and again attempts to set Approved condition
    2. cert-manager webhook denies the request and approver-policy logs the error

    d. approver-policy cache receives the updated CR v2 with Approved=True, Ready=False and re-queues CR v2 to be re-reconciled
    e. approver-policy ignores CR v2 because it already has an Approved condition.

One solution would be to modify approver-policy to expect this situation and ignore the error, let's call it ApprovedConditionAlreadySet, in much the same way that we might IgnoreNotFound or IgnoreAlreadyExists errors.

  1. cert-manager server side apply is setting (but not changing) the Approved condition and in so doing, taking joint ownership of that field.

    I haven't looked into this, but I would guess that cert-manager is creating a server-side apply patch which contains the existing conditions from the CR, rather than only the Ready condition.

I don't think the scenario you are describing is possible because each resource can only be reconciled one at a time

from approver-policy.

wallrj avatar wallrj commented on June 5, 2024

I don't think the scenario you are describing is possible because each resource can only be reconciled one at a time

Yes, and that is how it is in the scenario I described.

approver-policy is re-reconciling the CR at v1 before its cache has been updated, with the result of its previous reconcile.
I.e. approver-policy cache contains CR with the condition Ready=False, but not yet Approved=true/false.

from approver-policy.

inteon avatar inteon commented on June 5, 2024

I don't think the scenario you are describing is possible because each resource can only be reconciled one at a time

Yes, and that is how it is in the scenario I described.

approver-policy is re-reconciling the CR at v1 before its cache has been updated, with the result of its previous reconcile.
I.e. approver-policy cache contains CR with the condition Ready=False, but not yet Approved=true/false.

Thanks for explaining. I would expect the patch to be identical, so the webhook shouldn't complain about the condition changing, right?

from approver-policy.

erikgb avatar erikgb commented on June 5, 2024

Thanks for explaining. I would expect the patch to be identical, so the webhook shouldn't complain about the condition changing, right?

The cache is only updated from watch push events, ref. kubernetes-sigs/controller-runtime#1622, so if the cache is not updated from the previous round of reconciliation, the patch will not be identical. The lastTransitionTime field is the culprit.

Thinking a bit more about the suggested solution from @wallrj, I am more in favor of using a non-caching client when reconciling certificaterequests in approver-policy. Ignoring a custom error message will never be clean, and quite fragile. If anyone changes the literal error message in the cert-manager webhook, this will break again. The HTTP status code is just Forbidden - which could mean many things.

from approver-policy.

erikgb avatar erikgb commented on June 5, 2024

Wow! 👏 Thanks for your deep analysis and suggestions, @wallrj! I didn't know you could use resourceVersion with SSA as with CSA. The difference is probably that it's required with CSA, but optional with SSA. I am more in favor of your last suggestion now. It will address the root cause of this issue. Handling the error from cert-manager is just a workaround IMO. WDYT? @inteon

from approver-policy.

inteon avatar inteon commented on June 5, 2024

I'm confused, does #303 not solve this issue? Do we still get these errors after merging that PR?

from approver-policy.

erikgb avatar erikgb commented on June 5, 2024

Confused? Why? I think it might solve this issue, but we won't know for sure. I do not have the intention to build and provision a snapshot version in our busiest cluster.

from approver-policy.

wallrj avatar wallrj commented on June 5, 2024

@inteon Yeah, I realise that I was over complicating...I hadn't looked carefully at the code. I still think the situation I described can happen. It will be interesting to see whether @erikgb observes these errors after he deploys this fix (albeit fewer of them).

And still think that there is a bug in the cert-manager server-side apply code which causes it to take joint field ownership of the approval conditions.

from approver-policy.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.