Comments (10)
@dghubble Can you try increasing the logs on the api-server (--v=10
)?
Based on that retry message, the apiserver should not be exiting (just in a loop which tries to bind on that interface):
https://github.com/kubernetes/kubernetes/blob/v1.2.4/pkg/genericapiserver/genericapiserver.go#L716
If it does exit, and there's only one api-server (which is true for now), this could lead to a race... but kubelet should even be able to recover locally from that (loss of api-server after pod has been scheduled).
from bootkube.
Also, are you giving it any time to recover? The sync loop may take a bit (but again ... it shouldn't be exiting based on that log message anyway).
from bootkube.
It ran for a few hours without recovering. I'll add the verbosity flag in the future when I have time to provision a bunch of clusters.
from bootkube.
These cluster crashes still happen with dee39f8 in regular use, typically within the first few minutes of starting generated pod manifests. Validating by running a vagrant VM and checking the desired containers will mask the unreliability of keeping the cluster running.
This is especially hard to check because docker logs are cleared within a few minutes after cluster crash? As far as causes, it is hard to separate the noise of the "regular" failures from what matters without more knowledge of Kubernetes. I've seen the api port binding error, I've seen TLS handshakes work for a time and then seemingly start failing, etc. In one (mis-)configuration (unrelated misconfig), a kube-proxy could not find the /etc/kubernetes/kubeconfig (was playing with mount from other locations, bad idea) - surprisingly the inability to start the proxy seemed to precipitate the cluster crash, though apiserver was running.
I've tested various bootkube executables and generated files. But there is too much going on here for me to make sense of or spend more time on right now. It may be advisable for bootkube start
wait until generated pod manifests are running successfully for a time before existing.
from bootkube.
Validating by running a vagrant VM and checking the desired containers will mask the unreliability of keeping the cluster running.
Can you provide more details? Not really sure what you mean here.
This is especially hard to check because docker logs are cleared within a few minutes after cluster crash
The container GC should be configurable if you want it to wait longer:
https://github.com/kubernetes/kubernetes/blob/master/cmd/kubelet/app/options/options.go#L190
https://github.com/coreos/bootkube/blob/master/pkg/asset/templates/kubelet.yaml#L34
I've seen TLS handshakes work for a time and then seemingly start failing.
[...] a kube-proxy could not find the /etc/kubernetes/kubeconfig
These both sound like mis-configuration of the templates. Do you have something that is reproducible? If so we can go over the configs for potential problems.
It may be advisable for bootkube start wait until generated pod manifests are running successfully for a time before existing.
Possibly, but i'm not sure "wait longer" is the right answer here. If components are unable to start due to misconfiguration, they're never going to start no matter how long we wait. And if it's a race - we should identify it .. waiting longer likely won't help the issue.
Unfortunately, I've still not run into this issue (in any recent iteration of this tool).
from bootkube.
Can you provide more details? Not really sure what you mean here.
I mean that typically, I'll see kubectl
commands succeed for a time and return nodes and show containers starting and otherwise seem ok. Over the next few minutes, the cluster can become inaccessible. Spinning up a vagrant VM and checking kube-*
pods are running once and then destroying it may miss something happening on real clusters. I'd leave it running for a bit, launch some pods, etc.
Ok, increasing the GC period is a change on the deploys to improve debuggability for when this happens.
Do you have something that is reproducible?
No, its random. It still occurs with updates poseidon/matchbox#237 and through an internal workflow (I'll link you offline). The hack examples use vagrant/virtualbox so at some point I may need to try to set that up and reproduce using your flow or have one of you using Linux verify in the libvirt or bare metal flow.
These both sound like mis-configuration of the templates.
Yes, /etc/kubernetes/kubeconfig
was an unrelated mounting experiment I was doing. The surprising bit is that repeated failures to launch the proxy (in that case, expected) in the logs seemed to precede the apiserver crash. Perhaps an unrelated red herring.
from bootkube.
We are testing more than just spinning up / immediately shutting down. See conformance tests:
https://github.com/coreos/bootkube#conformance-tests
Also, haven't run into this issue against GCE clusters: #55
Now somewhat related, if you do lose all api-servers, there isn't currently a recovery method for this. The options currently are: multiple apiservers (must be load-balanced or behind DNS) -- or checkpointing the api-server pod locally, which is what @derekparker has been working on as an interim solution: #50
But this still doesn't have to do with bootstrapping per-se ... your apiservers shouldn't just immediately fail. But reboots, or docker failing, etc could cause the self-hosted api-server to not recover (assuming a single copy). Along those lines -- could it be that your nodes are rebooting during a CoreOS update right after first boot?
from bootkube.
Auto-updates would cause the apiserver to fail, but so far that hasn't been the cause. As soon as I see kubectl commands failing I SSH and check docker ps -a
to find a bunch of exited kube-*
containers. If the clusters lasted more than a few minutes, the master node would eventually find and download a CoreOS update now that there is a newer alpha than 1053.0.0, but we don't get that far.
I'll finish updating our uses of bootkube from 1.2.2 to 1.3.0 alpha 5 today so everyone is on the same page.
from bootkube.
@dghubble Can you kill -SIGQUIT
the process so we know what go routine it is hung on?
from bootkube.
As of a few weeks ago, we're able to pin both the rendering and on-host bootkube to matching tagged releases every time. I haven't seen this flake since doing so. I'll re-open if it crops up.
from bootkube.
Related Issues (20)
- update CRI in checkpointer
- Enable api server aggrigation HOT 5
- Update Checklist for 1.17 HOT 11
- Update flannel
- Update calico HOT 16
- update coredns
- Add support for installing helm charts using helm 3 as a library during “bootkube start” HOT 20
- Proposal : CNI Support HOT 15
- Rervert coredns labels HOT 10
- Move hack/* to flatcar linux HOT 4
- PodCIDR and ServiceCIDR should be multiples HOT 1
- Create Release HOT 8
- Move Default Templates Render to 1.18.X HOT 8
- Can no longer execute "make vendor" HOT 14
- Don't use rkt for building images HOT 3
- Check if tests runs properly in CI HOT 13
- Bootkube - Bare-metal documentation link is broken HOT 5
- Pod cannot obtain pod ip HOT 5
- Update vendored kubelet utils to align with K8s version from go.mod HOT 8
- Add support for ARM64 architecture HOT 7
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from bootkube.