Giter VIP home page Giter VIP logo

Comments (8)

aleksasiriski avatar aleksasiriski commented on June 20, 2024 3

This is a recurring issue I noticed over the last couple of weeks, still investigating. It's most likely something related to all of our custom networking + microos + hetzner. For the time being, disable autoupgrades.

from terraform-hcloud-kube-hetzner.

sharkymcdongles avatar sharkymcdongles commented on June 20, 2024 1

I took what is documented here:

https://github.com/kube-hetzner/terraform-hcloud-kube-hetzner/blob/master/README.md#examples

Sadly, given the hetzner IP blacklist bs, using an egressgateway is the only way to ensure the cluster works with autoscaling because many images are stored in ghcr and hetzner ips are randomly blocked there. It's also needed for things like SMTP since many SMTP providers also block hetzner IPs.

I will do some digging and see if maybe I can find out why this happens. I disabled autoupgrades now for both nodes and k3s yet some nodes still had the same behavior.

So I did some digging and it looks like it still tried to do an upgrade leading to the NotReady nodes situation again. When I change autoupgrades to off does it not reflect it for already provisioned nodes? Do I need to remove kured?
EDIT: okay I figured out that I do need to edit the nodes myself by running:
systemctl --now disable transactional-update.timer

Checking the logs I see the upgrade runs then the CPU gets locked and NetworkManager gets stuck fully killing the networking for the node since it never recovers. I then see all these CPU stuck errors.

Mar 28 04:38:08 infra-large-btl kernel: watchdog: BUG: soft lockup - CPU#3 stuck for 26s! [NetworkManager:2151]
Mar 28 04:38:08 infra-large-btl kernel: Modules linked in: algif_hash af_alg ext4 mbcache jbd2 udp_diag inet_diag ip_set xt_CT cls_bpf sch_ingre>
Mar 28 04:38:08 infra-large-btl kernel:  xhci_pci xhci_pci_renesas libata aesni_intel xhci_hcd virtio_scsi crypto_simd sd_mod cryptd t10_pi sg u>
Mar 28 04:38:08 infra-large-btl kernel: CPU: 3 PID: 2151 Comm: NetworkManager Not tainted 6.7.2-1-default #1 openSUSE Tumbleweed e152b88f51363d1>
Mar 28 04:38:08 infra-large-btl kernel: Hardware name: Hetzner vServer/Standard PC (Q35 + ICH9, 2009), BIOS 20171111 11/11/2017
Mar 28 04:38:08 infra-large-btl kernel: RIP: 0010:virtnet_send_command+0x106/0x170 [virtio_net]
Mar 28 04:38:08 infra-large-btl kernel: Code: 74 24 48 e8 fc 6b b8 c6 85 c0 78 60 48 8b 7b 08 e8 0f 4c b8 c6 84 c0 75 11 eb 22 48 8b 7b 08 e8 20>
Mar 28 04:38:08 infra-large-btl kernel: RSP: 0018:ffffbf9c40853a08 EFLAGS: 00000246
Mar 28 04:38:08 infra-large-btl kernel: RAX: 0000000000000000 RBX: ffff999ec1f229c0 RCX: 0000000000000001

I am attaching my full log file from when this happened to see if maybe someone here can shine some light on it.

logs.txt

I wonder if maybe the networking can't handle autoupgrades or updates given my settings? No idea tbh, but it'd be nice if this were solved or someone knew. I will continue researching on my end, but I think more heads are better than 1.

from terraform-hcloud-kube-hetzner.

sharkymcdongles avatar sharkymcdongles commented on June 20, 2024 1

@mysticaltech I haven't moved back to default cilium settings yet because I am working on a new way of handling images not pulling without the egress gateway. I am currently evaluating using a squid proxy as a replacement. Since disabling autoupgrades though I have had no further issues.

from terraform-hcloud-kube-hetzner.

mysticaltech avatar mysticaltech commented on June 20, 2024

@sharkymcdongles No idea what could be happening, but I would suggest using our default cilium config instead. So remove cilium_values and try again.

from terraform-hcloud-kube-hetzner.

mysticaltech avatar mysticaltech commented on June 20, 2024

And getting cilium to work well on Hetzner is super tricky, hence my above suggestion.

from terraform-hcloud-kube-hetzner.

mysticaltech avatar mysticaltech commented on June 20, 2024

It's possible. So try switching to default networking settings, remove cilium_values and see if it works better.

from terraform-hcloud-kube-hetzner.

mysticaltech avatar mysticaltech commented on June 20, 2024

@sharkymcdongles Any updates, did the suggestion work?

from terraform-hcloud-kube-hetzner.

mysticaltech avatar mysticaltech commented on June 20, 2024

Ok, great! The proxy solution sounds awesome. Don't hesitate to share in due time if you see fit.

We are narrowing down the automated upgrade issues in other threads, so will close this one for now.

from terraform-hcloud-kube-hetzner.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.