Giter VIP home page Giter VIP logo

Comments (4)

mlb5000 avatar mlb5000 commented on August 16, 2024

I should note that

  1. This affects ALL nodes in the cluster
  2. I have restarted microk8s on all nodes using sudo snap restart microk8s, but it did not fix anything

from microk8s.

mlb5000 avatar mlb5000 commented on August 16, 2024

Ok, so I managed to isolate the corrupted deployment configuration. Somehow there is a corrupted protocol buffer in the dqlite database.

Isolate the corrupted service

On any of the nodes, run

sudo /snap/microk8s/current/bin/dqlite \
  --cert /var/snap/microk8s/current/var/kubernetes/backend/cluster.crt \
  --key /var/snap/microk8s/current/var/kubernetes/backend/cluster.key \
  --servers file:////var/snap/microk8s/current/var/kubernetes/backend/cluster.yaml \
  k8s

Then in dqlite run

dqlite> select name from kine where name like '%deployments/default%';

I then copied the deployment names, dropped them in to Sublime text, and created a script with a bunch of lines that look like this:

echo "search" && microk8s kubectl get deployments/search-worker -o yaml | grep "apiVersion:"

This will error on the specific deployment that is causing the problem, and print apiVersion: apps/v1 for everything else.

View the configuration

Back in dqlite, grab the BLOB data for that particular bad registry entry, and the BLOB data for a good record while you're at it. This data is stored as an ASCII buffer.

The bad record's data starts with 107 56 115 0 10 21 10 7 97 112 112 115 16 118 49, the latter part of which reads as apps\x10v1.

The good record's data starts with 107 56 115 0 10 21 10 7 97 112 112 115 47 118 49, the latter part of which reads as apps/v1, which is what we want.

There doesn't appear to be any other corruption in here, but even if there is, it's this first part of this protocol buffer that I need to fix. Then I can just delete and recreate the deployment through the API as expected.

Basically, I either need to patch that 16 with a 47 in the dqlite database, or find a way to remove that Registry entry. However, I'm not sure how to do this in a way where the change will propagate to the other nodes like it's supposed to.

from microk8s.

mlb5000 avatar mlb5000 commented on August 16, 2024

Explicitly deleting that record in the dqlite database unstuck the deployment lifecycle across the entire cluster, and things are now back in working order.

However, someone from the microk8s team should look into this, since it feels very wrong to me that a protocol buffer that has been corrupted should ever find its way into the dqlite database. Especially if this corruption results in completely knocking out basic reliability/recovery functionality.

from microk8s.

mlb5000 avatar mlb5000 commented on August 16, 2024

Basically, the root cause here seems to be the dqlite record being persisted with a resource type + version combination that does not exist in kubectl api-resources.

Feels like the solution here is two-fold

  1. the resource KIND + APIVERSION combination should be validated prior to persistence
  2. apiserver should be updated to be more resilient to record corruption like this. Just because a single deployment record could not be read should not prevent things like list commands from succeeding.

I don't know if microk8s has its own apiserver implementation, or if this issue really belongs in the Kubernetes mainline, but a single corrupted byte in a single record in the dqlite database shouldn't have such an outsized effect on the platform.

from microk8s.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.