Most of the time bootkube reports that components hav

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

These cluster crashes still happen with <a class="commit-link" data-hovercard-type="co

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Flaky bring-up: apiserver fails,about kubernetes-retired/bootkube

Comments (10)

aaronlevy commented on May 18, 2024

@dghubble Can you try increasing the logs on the api-server (--v=10)?

Based on that retry message, the apiserver should not be exiting (just in a loop which tries to bind on that interface):

https://github.com/kubernetes/kubernetes/blob/v1.2.4/pkg/genericapiserver/genericapiserver.go#L716

If it does exit, and there's only one api-server (which is true for now), this could lead to a race... but kubelet should even be able to recover locally from that (loss of api-server after pod has been scheduled).

from bootkube.

aaronlevy commented on May 18, 2024

Also, are you giving it any time to recover? The sync loop may take a bit (but again ... it shouldn't be exiting based on that log message anyway).

from bootkube.

dghubble commented on May 18, 2024

It ran for a few hours without recovering. I'll add the verbosity flag in the future when I have time to provision a bunch of clusters.

from bootkube.

dghubble commented on May 18, 2024

These cluster crashes still happen with dee39f8 in regular use, typically within the first few minutes of starting generated pod manifests. Validating by running a vagrant VM and checking the desired containers will mask the unreliability of keeping the cluster running.

This is especially hard to check because docker logs are cleared within a few minutes after cluster crash? As far as causes, it is hard to separate the noise of the "regular" failures from what matters without more knowledge of Kubernetes. I've seen the api port binding error, I've seen TLS handshakes work for a time and then seemingly start failing, etc. In one (mis-)configuration (unrelated misconfig), a kube-proxy could not find the /etc/kubernetes/kubeconfig (was playing with mount from other locations, bad idea) - surprisingly the inability to start the proxy seemed to precipitate the cluster crash, though apiserver was running.

I've tested various bootkube executables and generated files. But there is too much going on here for me to make sense of or spend more time on right now. It may be advisable for bootkube start wait until generated pod manifests are running successfully for a time before existing.

from bootkube.

aaronlevy commented on May 18, 2024

Validating by running a vagrant VM and checking the desired containers will mask the unreliability of keeping the cluster running.

Can you provide more details? Not really sure what you mean here.

This is especially hard to check because docker logs are cleared within a few minutes after cluster crash

The container GC should be configurable if you want it to wait longer:
https://github.com/kubernetes/kubernetes/blob/master/cmd/kubelet/app/options/options.go#L190
https://github.com/coreos/bootkube/blob/master/pkg/asset/templates/kubelet.yaml#L34

I've seen TLS handshakes work for a time and then seemingly start failing.
[...] a kube-proxy could not find the /etc/kubernetes/kubeconfig

These both sound like mis-configuration of the templates. Do you have something that is reproducible? If so we can go over the configs for potential problems.

It may be advisable for bootkube start wait until generated pod manifests are running successfully for a time before existing.

Possibly, but i'm not sure "wait longer" is the right answer here. If components are unable to start due to misconfiguration, they're never going to start no matter how long we wait. And if it's a race - we should identify it .. waiting longer likely won't help the issue.

Unfortunately, I've still not run into this issue (in any recent iteration of this tool).

from bootkube.

dghubble commented on May 18, 2024

Can you provide more details? Not really sure what you mean here.

I mean that typically, I'll see kubectl commands succeed for a time and return nodes and show containers starting and otherwise seem ok. Over the next few minutes, the cluster can become inaccessible. Spinning up a vagrant VM and checking kube-* pods are running once and then destroying it may miss something happening on real clusters. I'd leave it running for a bit, launch some pods, etc.

Ok, increasing the GC period is a change on the deploys to improve debuggability for when this happens.

Do you have something that is reproducible?

No, its random. It still occurs with updates poseidon/matchbox#237 and through an internal workflow (I'll link you offline). The hack examples use vagrant/virtualbox so at some point I may need to try to set that up and reproduce using your flow or have one of you using Linux verify in the libvirt or bare metal flow.

These both sound like mis-configuration of the templates.

Yes, /etc/kubernetes/kubeconfig was an unrelated mounting experiment I was doing. The surprising bit is that repeated failures to launch the proxy (in that case, expected) in the logs seemed to precede the apiserver crash. Perhaps an unrelated red herring.

from bootkube.

aaronlevy commented on May 18, 2024

We are testing more than just spinning up / immediately shutting down. See conformance tests:
https://github.com/coreos/bootkube#conformance-tests

Also, haven't run into this issue against GCE clusters: #55

Now somewhat related, if you do lose all api-servers, there isn't currently a recovery method for this. The options currently are: multiple apiservers (must be load-balanced or behind DNS) -- or checkpointing the api-server pod locally, which is what @derekparker has been working on as an interim solution: #50

But this still doesn't have to do with bootstrapping per-se ... your apiservers shouldn't just immediately fail. But reboots, or docker failing, etc could cause the self-hosted api-server to not recover (assuming a single copy). Along those lines -- could it be that your nodes are rebooting during a CoreOS update right after first boot?

from bootkube.

dghubble commented on May 18, 2024

Auto-updates would cause the apiserver to fail, but so far that hasn't been the cause. As soon as I see kubectl commands failing I SSH and check docker ps -a to find a bunch of exited kube-* containers. If the clusters lasted more than a few minutes, the master node would eventually find and download a CoreOS update now that there is a newer alpha than 1053.0.0, but we don't get that far.

I'll finish updating our uses of bootkube from 1.2.2 to 1.3.0 alpha 5 today so everyone is on the same page.

from bootkube.

philips commented on May 18, 2024

@dghubble Can you kill -SIGQUIT the process so we know what go routine it is hung on?

from bootkube.

dghubble commented on May 18, 2024

As of a few weeks ago, we're able to pin both the rendering and on-host bootkube to matching tagged releases every time. I haven't seen this flake since doing so. I'll re-open if it crops up.

from bootkube.

Flaky bring-up: apiserver fails about bootkube HOT 10 CLOSED

Comments (10)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent