Comments (8)
This is a recurring issue I noticed over the last couple of weeks, still investigating. It's most likely something related to all of our custom networking + microos + hetzner. For the time being, disable autoupgrades.
from terraform-hcloud-kube-hetzner.
I took what is documented here:
https://github.com/kube-hetzner/terraform-hcloud-kube-hetzner/blob/master/README.md#examples
Sadly, given the hetzner IP blacklist bs, using an egressgateway is the only way to ensure the cluster works with autoscaling because many images are stored in ghcr and hetzner ips are randomly blocked there. It's also needed for things like SMTP since many SMTP providers also block hetzner IPs.
I will do some digging and see if maybe I can find out why this happens. I disabled autoupgrades now for both nodes and k3s yet some nodes still had the same behavior.
So I did some digging and it looks like it still tried to do an upgrade leading to the NotReady nodes situation again. When I change autoupgrades to off does it not reflect it for already provisioned nodes? Do I need to remove kured?
EDIT: okay I figured out that I do need to edit the nodes myself by running:
systemctl --now disable transactional-update.timer
Checking the logs I see the upgrade runs then the CPU gets locked and NetworkManager gets stuck fully killing the networking for the node since it never recovers. I then see all these CPU stuck errors.
Mar 28 04:38:08 infra-large-btl kernel: watchdog: BUG: soft lockup - CPU#3 stuck for 26s! [NetworkManager:2151]
Mar 28 04:38:08 infra-large-btl kernel: Modules linked in: algif_hash af_alg ext4 mbcache jbd2 udp_diag inet_diag ip_set xt_CT cls_bpf sch_ingre>
Mar 28 04:38:08 infra-large-btl kernel: xhci_pci xhci_pci_renesas libata aesni_intel xhci_hcd virtio_scsi crypto_simd sd_mod cryptd t10_pi sg u>
Mar 28 04:38:08 infra-large-btl kernel: CPU: 3 PID: 2151 Comm: NetworkManager Not tainted 6.7.2-1-default #1 openSUSE Tumbleweed e152b88f51363d1>
Mar 28 04:38:08 infra-large-btl kernel: Hardware name: Hetzner vServer/Standard PC (Q35 + ICH9, 2009), BIOS 20171111 11/11/2017
Mar 28 04:38:08 infra-large-btl kernel: RIP: 0010:virtnet_send_command+0x106/0x170 [virtio_net]
Mar 28 04:38:08 infra-large-btl kernel: Code: 74 24 48 e8 fc 6b b8 c6 85 c0 78 60 48 8b 7b 08 e8 0f 4c b8 c6 84 c0 75 11 eb 22 48 8b 7b 08 e8 20>
Mar 28 04:38:08 infra-large-btl kernel: RSP: 0018:ffffbf9c40853a08 EFLAGS: 00000246
Mar 28 04:38:08 infra-large-btl kernel: RAX: 0000000000000000 RBX: ffff999ec1f229c0 RCX: 0000000000000001
I am attaching my full log file from when this happened to see if maybe someone here can shine some light on it.
I wonder if maybe the networking can't handle autoupgrades or updates given my settings? No idea tbh, but it'd be nice if this were solved or someone knew. I will continue researching on my end, but I think more heads are better than 1.
from terraform-hcloud-kube-hetzner.
@mysticaltech I haven't moved back to default cilium settings yet because I am working on a new way of handling images not pulling without the egress gateway. I am currently evaluating using a squid proxy as a replacement. Since disabling autoupgrades though I have had no further issues.
from terraform-hcloud-kube-hetzner.
@sharkymcdongles No idea what could be happening, but I would suggest using our default cilium config instead. So remove cilium_values and try again.
from terraform-hcloud-kube-hetzner.
And getting cilium to work well on Hetzner is super tricky, hence my above suggestion.
from terraform-hcloud-kube-hetzner.
It's possible. So try switching to default networking settings, remove cilium_values
and see if it works better.
from terraform-hcloud-kube-hetzner.
@sharkymcdongles Any updates, did the suggestion work?
from terraform-hcloud-kube-hetzner.
Ok, great! The proxy solution sounds awesome. Don't hesitate to share in due time if you see fit.
We are narrowing down the automated upgrade issues in other threads, so will close this one for now.
from terraform-hcloud-kube-hetzner.
Related Issues (20)
- Not able to upgrade Traefik HOT 1
- [Bug]: Sudden drop of public internet connectivity for some nodes of arm64 cluster HOT 10
- [Bug]: zram_size not passed on HOT 4
- [Bug]: Terraform Validate fails agent_nodepools HOT 1
- [Bug]: Waiting for load-balancer to get an IP... Hangs HOT 2
- Disable the default load balancer HOT 7
- [Bug]: nginx stuck deploying when not scheduling on control-plane
- Upgrading a clean cluster 1.27 to 1.28 - one of the nodes stuck in emergency mode HOT 1
- Update `cluster-autoscaler` version HOT 4
- [Bug]: Restore hangs waiting for load balancer ip HOT 2
- [Feature Request]: Allow specifying an existing Floating IP HOT 3
- [Bug]: Disabling SELINUX option is not working HOT 2
- Unable to specify multiple networks for nodes, custom solution
- [Bug]: image pull backoff error with latest: hetznercloud/hcloud-csi-driver:v2.7.0 HOT 2
- [Bug]: Waiting for MicroOS to become available... HOT 4
- [Bug]: /etc/cloud/rename_interface.sh: No such file or directory HOT 2
- [Bug]: helm releases keep installing after disabling them in kube.tf HOT 2
- [Bug]: Terraform does not deploy well HOT 2
- [Feature Request]: Allow patching default Helm values HOT 1
- [Bug]: Unknown connection
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from terraform-hcloud-kube-hetzner.