Light

Help needed - 1 Master node always fails about k3s-ansible-traefik-rancher HOT 48 CLOSED

treverehrfurth commented on July 17, 2024

Help needed - 1 Master node always fails

from k3s-ansible-traefik-rancher.

Comments (48)

ChrisThePCGeek commented on July 17, 2024

make sure your nodes are using eth0 for the first network interface name. Ubuntu typically does not use eth0 it might be ens18 or enp0s1 or something else like that. to check it ssh into a node and type ip a to view all interfaces, look for the one with your lan IP and verify its name.

from k3s-ansible-traefik-rancher.

treverehrfurth commented on July 17, 2024

Is this eth0 or enp0s18?
I tried with the later and that caused all masters to fail and script to exit

server@k3s-master-02:~$ ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host noprefixroute 
       valid_lft forever preferred_lft forever
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP group default qlen 1000
    link/ether ce:fc:22:6b:12:1c brd ff:ff:ff:ff:ff:ff
    altname enp0s18
    inet 192.168.4.102/24 metric 100 brd 192.168.4.255 scope global dynamic eth0
       valid_lft 75942sec preferred_lft 75942sec
    inet6 fe80::ccfc:22ff:fe6b:121c/64 scope link 
       valid_lft forever preferred_lft forever
3: flannel.1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue state UNKNOWN group default 
    link/ether 3e:34:33:84:b9:16 brd ff:ff:ff:ff:ff:ff
    inet 10.42.2.0/32 scope global flannel.1
       valid_lft forever preferred_lft forever
    inet6 fe80::3c34:33ff:fe84:b916/64 scope link 
       valid_lft forever preferred_lft forever
4: cni0: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN group default qlen 1000
    link/ether 0a:2f:dc:ba:3d:2a brd ff:ff:ff:ff:ff:ff
    inet 10.42.1.1/24 brd 10.42.1.255 scope global cni0
       valid_lft forever preferred_lft forever
    inet6 fe80::82f:dcff:feba:3d2a/64 scope link 
       valid_lft forever preferred_lft forever

from k3s-ansible-traefik-rancher.

treverehrfurth commented on July 17, 2024

I did a ip route get 8.8.8.8 and looks like all nodes are on eth0 unless its being tricky or something

server@k3s-worker-03:~$ ip route get 8.8.8.8
8.8.8.8 via 192.168.4.1 dev eth0 src 192.168.4.113 uid 1000 
    cache

from k3s-ansible-traefik-rancher.

ChrisThePCGeek commented on July 17, 2024

Okay so it is eth0, must be something else. I haven't re-tested with ubuntu as I dont use it anymore personally. I run everything on debian. I can fire up my test server and see if I get the same result and I'll let you know

from k3s-ansible-traefik-rancher.

treverehrfurth commented on July 17, 2024

Cool that would be appreciated, maybe its something up with ubuntu 24.04 being so new...
Seems to always fail on the TASK [k3s_server_post : Wait for MetalLB resources] if it helps.

from k3s-ansible-traefik-rancher.

ChrisThePCGeek commented on July 17, 2024

yeah could also be a metalLB version conflict with the k3s version

from k3s-ansible-traefik-rancher.

treverehrfurth commented on July 17, 2024

I haven't changed the efaults from the repo for the k3s version or the metalLB version, unless some update updated them potentially.
I will start setting up debian vms instead.
Do you also prefer/recommend lxc containers over vm's?

from k3s-ansible-traefik-rancher.

ChrisThePCGeek commented on July 17, 2024

I use vm's to test with my prod cluster is bare metal. I havent tried using lxc other than to test functionality. I usually run debian12 cloud-init custom built images but vanilla deb12 install would work as well. i dont customize it too much, one of my other repo's has the packer files for it to build from the net-install ISO

from k3s-ansible-traefik-rancher.

treverehrfurth commented on July 17, 2024

Gotcha, I can try setting up a debian cloud init.
I just tried using the latest versions of k3s/kubevip/metallb with no luck, can't seem to get this working on ubuntu 24.04 :/

from k3s-ansible-traefik-rancher.

ChrisThePCGeek commented on July 17, 2024

welp i tried setting up 24.02 coud-init vm's to test, it wont even get past downloading k3s binary. something about an https connection missing a parameter. doesnt make sense. that task hasnt been changed. i tried another release version to no avail as well. must be something in the canonical provided ubuntu cloud-init images

from k3s-ansible-traefik-rancher.

ChrisThePCGeek commented on July 17, 2024

fixed that. needed to update ansible haha

from k3s-ansible-traefik-rancher.

ChrisThePCGeek commented on July 17, 2024

I did change it to 1.28.8 when trying to get the k3s binary to download. I ran against 3 masters of ubuntu 22.04 LTS cloud-init image from canonical within proxmox. the cluster fully installed past the metalLB error you were getting. I wonder what else could be causing the problem for you. If its because you have some agent nodes too maybe, I can try to check with that but metalLB wouldnt go on the agents anyways since its a critical addon

from k3s-ansible-traefik-rancher.

treverehrfurth commented on July 17, 2024

So I setup 6 debian12 containers and got a bit farther, but still hit an error on the cert manager install:

TASK [cert-manager : apply cert-manager CRDs] *******************************************************************************
FAILED - RETRYING: [192.168.5.101]: apply cert-manager CRDs (2 retries left).
FAILED - RETRYING: [192.168.5.101]: apply cert-manager CRDs (1 retries left).
fatal: [192.168.5.101]: FAILED! => {"attempts": 2, "changed": true, "cmd": ["kubectl", "apply", "-f", "https://github.com/jetstack/cert-manager/releases/download/v1.13.2/cert-manager.crds.yaml"], "delta": "0:00:00.081072", "end": "2024-05-14 23:10:15.135764", "msg": "non-zero return code", "rc": 1, "start": "2024-05-14 23:10:15.054692", "stderr": "The connection to the server 127.0.0.1:6443 was refused - did you specify the right host or port?", "stderr_lines": ["The connection to the server 127.0.0.1:6443 was refused - did you specify the right host or port?"], "stdout": "", "stdout_lines": []}

PLAY RECAP ******************************************************************************************************************
192.168.5.101              : ok=55   changed=28   unreachable=0    failed=1    skipped=25   rescued=0    ignored=0   
192.168.5.102              : ok=35   changed=17   unreachable=0    failed=0    skipped=26   rescued=0    ignored=0   
192.168.5.103              : ok=35   changed=17   unreachable=0    failed=0    skipped=26   rescued=0    ignored=0   
192.168.5.111              : ok=11   changed=7    unreachable=0    failed=0    skipped=19   rescued=0    ignored=0   
192.168.5.112              : ok=11   changed=7    unreachable=0    failed=0    skipped=19   rescued=0    ignored=0   
192.168.5.113              : ok=11   changed=7    unreachable=0    failed=0    skipped=19   rescued=0    ignored=0

This just due to a version conflict? I'll give it more attempts today

from k3s-ansible-traefik-rancher.

ChrisThePCGeek commented on July 17, 2024

Okay looks like there its not getting the kubeconfig from environment although it should. I might need to revisit the command ran to do that and specify the path to the config as an argument. its reverting to default looking for a k3s api at 127.0.0.1 rather than trying to use either a master server ip or the vip, perferably the vip which would be in the kubeconfig from prior scripted actions. there is a part that copies that from /etc into the ansible user's home dir

from k3s-ansible-traefik-rancher.

treverehrfurth commented on July 17, 2024

Gotcha, I will wait until those updates then if that's the case, I appreciate it!

from k3s-ansible-traefik-rancher.

ChrisThePCGeek commented on July 17, 2024

welcome. I will take a look into it prob this afternoon

from k3s-ansible-traefik-rancher.

ChrisThePCGeek commented on July 17, 2024

Hey sorry it took me until today to look at this. I just made a bunch of updates/fixes and tested running smooth on the ubuntu cloud-init test vm's i set up. between the kubectl command and then traefik updating things in their values files that I missed even in my own main documentation...but it should be good now. Next I need to adjust versions for k3s and metallb on the defaults in the repo so changing them YMMV but other than that let me know if more issues arrise.

from k3s-ansible-traefik-rancher.

treverehrfurth commented on July 17, 2024

No worries that was fast as is!
I brought in the new code and made my updates to the files needed, but this time ran into an error on the "Wait for MetalLB resources" section with this, its past midnight here so i'll have to take a look deeper tomorrow on these errors (running on Debian 12):

TASK [k3s_server_post : Wait for MetalLB resources] **********************************************************************************************************************************************************failed: [192.168.5.101] (item=controller) => {"ansible_loop_var": "item", "changed": false, "cmd": ["k3s", "kubectl", "wait", "deployment", "--namespace=metallb-system", "controller", "--for", "condition=Available=True", "--timeout=120s"], "delta": "0:02:10.206022", "end": "2024-05-17 00:06:50.899498", "item": {"condition": "--for condition=Available=True", "description": "controller", "name": "controller", "resource": "deployment"}, "msg": "non-zero return code", "rc": 1, "start": "2024-05-17 00:04:40.693476", "stderr": "E0517 00:05:41.908297   32755 reflector.go:140] k8s.io/[email protected]/tools/cache/reflector.go:169: Failed to watch *unstructured.Unstructured: the server is currently unable to handle the request\nW0517 00:05:42.962668   32755 reflector.go:424] k8s.io/[email protected]/tools/cache/reflector.go:169: failed to list *unstructured.Unstructured: apiserver not ready\nE0517 00:05:42.962692   32755 reflector.go:140] k8s.io/[email protected]/tools/cache/reflector.go:169: Failed to watch *unstructured.Unstructured: failed to list *unstructured.Unstructured: apiserver not ready\nW0517 00:05:46.149432   32755 reflector.go:424] k8s.io/[email protected]/tools/cache/reflector.go:169: failed to list *unstructured.Unstructured: apiserver not ready\nE0517 00:05:46.149453   32755 reflector.go:140] k8s.io/[email protected]/tools/cache/reflector.go:169: Failed to watch *unstructured.Unstructured: failed to list *unstructured.Unstructured: apiserver not ready\nW0517 00:05:50.371827   32755 reflector.go:424] k8s.io/[email protected]/tools/cache/reflector.go:169: failed to list *unstructured.Unstructured: apiserver not ready\nE0517 00:05:50.371848   32755 reflector.go:140] k8s.io/[email protected]/tools/cache/reflector.go:169: Failed to watch *unstructured.Unstructured: failed to list *unstructured.Unstructured: apiserver not ready\nW0517 00:05:58.570020   32755 reflector.go:424] k8s.io/[email protected]/tools/cache/reflector.go:169: failed to list *unstructured.Unstructured: apiserver not ready\nE0517 00:05:58.570051   32755 reflector.go:140] k8s.io/[email protected]/tools/cache/reflector.go:169: Failed to watch *unstructured.Unstructured: failed to list *unstructured.Unstructured: apiserver not ready\nerror: timed out waiting for the condition on deployments/controller", "stderr_lines": ["E0517 00:05:41.908297   32755 reflector.go:140] k8s.io/[email protected]/tools/cache/reflector.go:169: Failed to watch *unstructured.Unstructured: the server is currently unable to handle the request", "W0517 00:05:42.962668   32755 reflector.go:424] k8s.io/[email protected]/tools/cache/reflector.go:169: failed to list *unstructured.Unstructured: apiserver not ready", "E0517 00:05:42.962692   32755 reflector.go:140] k8s.io/[email protected]/tools/cache/reflector.go:169: Failed to watch *unstructured.Unstructured: failed to list *unstructured.Unstructured: apiserver not ready", "W0517 00:05:46.149432   32755 reflector.go:424] k8s.io/[email protected]/tools/cache/reflector.go:169: failed to list *unstructured.Unstructured: apiserver not ready", "E0517 00:05:46.149453   32755 reflector.go:140] k8s.io/[email protected]/tools/cache/reflector.go:169: Failed to watch *unstructured.Unstructured: failed to list *unstructured.Unstructured: apiserver not ready", "W0517 00:05:50.371827   32755 reflector.go:424] k8s.io/[email protected]/tools/cache/reflector.go:169: failed to list *unstructured.Unstructured: apiserver not ready", "E0517 00:05:50.371848   32755 reflector.go:140] k8s.io/[email protected]/tools/cache/reflector.go:169: Failed to watch *unstructured.Unstructured: failed to list *unstructured.Unstructured: apiserver not ready", "W0517 00:05:58.570020   32755 reflector.go:424] k8s.io/[email protected]/tools/cache/reflector.go:169: failed to list *unstructured.Unstructured: apiserver not ready", "E0517 00:05:58.570051   32755 reflector.go:140] k8s.io/[email protected]/tools/cache/reflector.go:169: Failed to watch *unstructured.Unstructured: failed to list *unstructured.Unstructured: apiserver not ready", "error: timed out waiting for the condition on deployments/controller"], "stdout": "", "stdout_lines": []}
ok: [192.168.5.101] => (item=webhook service)
failed: [192.168.5.101] (item=pods in replica sets) => {"ansible_loop_var": "item", "changed": false, "cmd": ["k3s", "kubectl", "wait", "pod", "--namespace=metallb-system", "--selector=component=controller,app=metallb", "--for", "condition=Ready", "--timeout=120s"], "delta": "0:02:06.534153", "end": "2024-05-17 00:08:59.172505", "item": {"condition": "--for condition=Ready", "description": "pods in replica sets", "resource": "pod", "selector": "component=controller,app=metallb"}, "msg": "non-zero return code", "rc": 1, "start": "2024-05-17 00:06:52.638352", "stderr": "error: timed out waiting for the condition on pods/controller-586bfc6b59-s8sfj", "stderr_lines": ["error: timed out waiting for the condition on pods/controller-586bfc6b59-s8sfj"], "stdout": "", "stdout_lines": []}
failed: [192.168.5.101] (item=ready replicas of controller) => {"ansible_loop_var": "item", "changed": false, "cmd": ["k3s", "kubectl", "wait", "replicaset", "--namespace=metallb-system", "--selector=component=controller,app=metallb", "--for=jsonpath={.status.readyReplicas}=1", "--timeout=120s"], "delta": "0:01:00.088430", "end": "2024-05-17 00:10:12.166949", "item": {"condition": "--for=jsonpath='{.status.readyReplicas}'=1", "description": "ready replicas of controller", "resource": "replicaset", "selector": "component=controller,app=metallb"}, "msg": "non-zero return code", "rc": 1, "start": "2024-05-17 00:09:12.078519", "stderr": "Error from server (Timeout): the server was unable to return a response in the time allotted, but may still be processing the request (get replicasets.apps)", "stderr_lines": ["Error from server (Timeout): the server was unable to return a response in the time allotted, but may still be processing the request (get replicasets.apps)"], "stdout": "", "stdout_lines": []}
ok: [192.168.5.101] => (item=fully labeled replicas of controller)
failed: [192.168.5.101] (item=available replicas of controller) => {"ansible_loop_var": "item", "changed": false, "cmd": ["k3s", "kubectl", "wait", "replicaset", "--namespace=metallb-system", "--selector=component=controller,app=metallb", "--for=jsonpath={.status.availableReplicas}=1", "--timeout=120s"], "delta": "0:02:00.647835", "end": "2024-05-17 00:12:26.566429", "item": {"condition": "--for=jsonpath='{.status.availableReplicas}'=1", "description": "available replicas of controller", "resource": "replicaset", "selector": "component=controller,app=metallb"}, "msg": "non-zero return code", "rc": 1, "start": "2024-05-17 00:10:25.918594", "stderr": "E0517 00:11:38.158727   33785 reflector.go:140] k8s.io/[email protected]/tools/cache/reflector.go:169: Failed to watch *unstructured.Unstructured: the server is currently unable to handle the request\nW0517 00:11:39.017024   33785 reflector.go:424] k8s.io/[email protected]/tools/cache/reflector.go:169: failed to list *unstructured.Unstructured: apiserver not ready\nE0517 00:11:39.017145   33785 reflector.go:140] k8s.io/[email protected]/tools/cache/reflector.go:169: Failed to watch *unstructured.Unstructured: failed to list *unstructured.Unstructured: apiserver not ready\nW0517 00:11:41.637903   33785 reflector.go:424] k8s.io/[email protected]/tools/cache/reflector.go:169: failed to list *unstructured.Unstructured: apiserver not ready\nE0517 00:11:41.637928   33785 reflector.go:140] k8s.io/[email protected]/tools/cache/reflector.go:169: Failed to watch *unstructured.Unstructured: failed to list *unstructured.Unstructured: apiserver not ready\nW0517 00:11:45.512648   33785 reflector.go:424] k8s.io/[email protected]/tools/cache/reflector.go:169: failed to list *unstructured.Unstructured: apiserver not ready\nE0517 00:11:45.512670   33785 reflector.go:140] k8s.io/[email protected]/tools/cache/reflector.go:169: Failed to watch *unstructured.Unstructured: failed to list *unstructured.Unstructured: apiserver not ready\nW0517 00:11:57.735361   33785 reflector.go:424] k8s.io/[email protected]/tools/cache/reflector.go:169: failed to list *unstructured.Unstructured: apiserver not ready\nE0517 00:11:57.735381   33785 reflector.go:140] k8s.io/[email protected]/tools/cache/reflector.go:169: Failed to watch *unstructured.Unstructured: failed to list *unstructured.Unstructured: apiserver not ready\nerror: timed out waiting for the condition on replicasets/controller-586bfc6b59", "stderr_lines": ["E0517 00:11:38.158727   33785 reflector.go:140] k8s.io/[email protected]/tools/cache/reflector.go:169: Failed to watch *unstructured.Unstructured: the server is currently unable to handle the request", "W0517 00:11:39.017024   33785 reflector.go:424] k8s.io/[email protected]/tools/cache/reflector.go:169: failed to list *unstructured.Unstructured: apiserver not ready", "E0517 00:11:39.017145   33785 reflector.go:140] k8s.io/[email protected]/tools/cache/reflector.go:169: Failed to watch *unstructured.Unstructured: failed to list *unstructured.Unstructured: apiserver not ready", "W0517 00:11:41.637903   33785 reflector.go:424] k8s.io/[email protected]/tools/cache/reflector.go:169: failed to list *unstructured.Unstructured: apiserver not ready", "E0517 00:11:41.637928   33785 reflector.go:140] k8s.io/[email protected]/tools/cache/reflector.go:169: Failed to watch *unstructured.Unstructured: failed to list *unstructured.Unstructured: apiserver not ready", "W0517 00:11:45.512648   33785 reflector.go:424] k8s.io/[email protected]/tools/cache/reflector.go:169: failed to list *unstructured.Unstructured: apiserver not ready", "E0517 00:11:45.512670   33785 reflector.go:140] k8s.io/[email protected]/tools/cache/reflector.go:169: Failed to watch *unstructured.Unstructured: failed to list *unstructured.Unstructured: apiserver not ready", "W0517 00:11:57.735361   33785 reflector.go:424] k8s.io/[email protected]/tools/cache/reflector.go:169: failed to list *unstructured.Unstructured: apiserver not ready", "E0517 00:11:57.735381   33785 reflector.go:140] k8s.io/[email protected]/tools/cache/reflector.go:169: Failed to watch *unstructured.Unstructured: failed to list *unstructured.Unstructured: apiserver not ready", "error: timed out waiting for the condition on replicasets/controller-586bfc6b59"], "stdout": "", "stdout_lines": []}

NO MORE HOSTS LEFT *******************************************************************************************************************************************************************************************

PLAY RECAP ***************************************************************************************************************************************************************************************************
192.168.5.101              : ok=43   changed=16   unreachable=0    failed=1    skipped=20   rescued=0    ignored=0   
192.168.5.102              : ok=34   changed=10   unreachable=0    failed=0    skipped=26   rescued=0    ignored=0   
192.168.5.103              : ok=34   changed=10   unreachable=0    failed=0    skipped=26   rescued=0    ignored=0   
192.168.5.111              : ok=11   changed=3    unreachable=0    failed=0    skipped=19   rescued=0    ignored=0   
192.168.5.112              : ok=11   changed=3    unreachable=0    failed=0    skipped=19   rescued=0    ignored=0   
192.168.5.113              : ok=11   changed=3    unreachable=0    failed=0    skipped=19   rescued=0    ignored=0

from k3s-ansible-traefik-rancher.

ChrisThePCGeek commented on July 17, 2024

is that due to nic device name this go around in the vars file?

from k3s-ansible-traefik-rancher.

treverehrfurth commented on July 17, 2024

Nah i'm still on eth0

from k3s-ansible-traefik-rancher.

ChrisThePCGeek commented on July 17, 2024

weird. what metallb version?

from k3s-ansible-traefik-rancher.

treverehrfurth commented on July 17, 2024

13.12

from k3s-ansible-traefik-rancher.

ChrisThePCGeek commented on July 17, 2024

k3s version?

from k3s-ansible-traefik-rancher.

treverehrfurth commented on July 17, 2024

v1.26.10+k3s1

from k3s-ansible-traefik-rancher.

ChrisThePCGeek commented on July 17, 2024

okay i forgot to bump that in the sample vars file. try it with v1.28.8+k3s1 that's what I tested it with.

from k3s-ansible-traefik-rancher.

treverehrfurth commented on July 17, 2024

we are getting closer!
It made it past that, still got 1st master failing, now at the apply metallbn crs step:

TASK [k3s_server_post : Apply metallb CRs] *******************************************************************************************************************************************************************
FAILED - RETRYING: [192.168.5.101]: Apply metallb CRs (5 retries left).
FAILED - RETRYING: [192.168.5.101]: Apply metallb CRs (4 retries left).
FAILED - RETRYING: [192.168.5.101]: Apply metallb CRs (3 retries left).
FAILED - RETRYING: [192.168.5.101]: Apply metallb CRs (2 retries left).
FAILED - RETRYING: [192.168.5.101]: Apply metallb CRs (1 retries left).
fatal: [192.168.5.101]: FAILED! => {"attempts": 5, "changed": false, "cmd": ["k3s", "kubectl", "apply", "-f", "/tmp/k3s/metallb-crs.yaml", "--timeout=120s"], "delta": "0:00:00.090814", "end": "2024-05-19 20:05:39.450825", "msg": "non-zero return code", "rc": 1, "start": "2024-05-19 20:05:39.360011", "stderr": "The connection to the server 127.0.0.1:6443 was refused - did you specify the right host or port?", "stderr_lines": ["The connection to the server 127.0.0.1:6443 was refused - did you specify the right host or port?"], "stdout": "", "stdout_lines": []}

NO MORE HOSTS LEFT *******************************************************************************************************************************************************************************************

PLAY RECAP ***************************************************************************************************************************************************************************************************
192.168.5.101              : ok=45   changed=16   unreachable=0    failed=1    skipped=20   rescued=0    ignored=0   
192.168.5.102              : ok=34   changed=10   unreachable=0    failed=0    skipped=26   rescued=0    ignored=0   
192.168.5.103              : ok=34   changed=10   unreachable=0    failed=0    skipped=26   rescued=0    ignored=0   
192.168.5.111              : ok=11   changed=3    unreachable=0    failed=0    skipped=19   rescued=0    ignored=0   
192.168.5.112              : ok=11   changed=3    unreachable=0    failed=0    skipped=19   rescued=0    ignored=0   
192.168.5.113              : ok=11   changed=3    unreachable=0    failed=0    skipped=19   rescued=0    ignored=0

from k3s-ansible-traefik-rancher.

ChrisThePCGeek commented on July 17, 2024

zero issues with that on my end.

from k3s-ansible-traefik-rancher.

ChrisThePCGeek commented on July 17, 2024

Just ran another test on ubuntu 24.04 LTS and also debian 12 both successful with no errors.

from k3s-ansible-traefik-rancher.

ChrisThePCGeek commented on July 17, 2024

make sure you run reset playbook and then try to run k3s-uninstall.sh (if its not found thats good) on each node to make sure its cleaned out, reboot and then try fresh. I had done a reset in one of my tests that some k3s files were left behind somehow. k3s-uninstall was still there so i ran that to clean the rest out

from k3s-ansible-traefik-rancher.

treverehrfurth commented on July 17, 2024

I have ran the kill-all.sh, but I don't see a k3s-uninstall.sh?

from k3s-ansible-traefik-rancher.

ChrisThePCGeek commented on July 17, 2024

I have ran the kill-all.sh, but I don't see a k3s-uninstall.sh?

when you ssh to each node. if its not there in the tab auto-complete then thats good its all cleaned out

from k3s-ansible-traefik-rancher.

treverehrfurth commented on July 17, 2024

I didn't find any k3s-uninstall.sh on any of the vm's.
It couldn't be because i'm running the deploy.sh from ubuntu when all my vm's are now debian 12 could it?

from k3s-ansible-traefik-rancher.

ChrisThePCGeek commented on July 17, 2024

nah. only thing I would make sure is you installed ansible from pip and not from apt repo so its the newest version, that you can ssh to each node with the ansible user using ssh key auth and that user has sudo rights without needing a password to be re-entered since that is how typically a cloud-init image is built

from k3s-ansible-traefik-rancher.

treverehrfurth commented on July 17, 2024

Damn, it made it farther then ever this time:

TASK [cert-manager : deploy cert-manager using helm without internal CA] **************************************************************************************************************************fatal: [192.168.5.101]: FAILED! => {"changed": true, "cmd": ["helm", "install", "cert-manager", "jetstack/cert-manager", "--namespace", "cert-manager", "--version", "v1.13.2", "--wait"], "delta": "0:00:15.008485", "end": "2024-05-21 21:54:28.767310", "msg": "non-zero return code", "rc": 1, "start": "2024-05-21 21:54:13.758825", "stderr": "Error: INSTALLATION FAILED: Unable to continue with install: could not get information about the resource ServiceAccount \"cert-manager-webhook\" in namespace \"cert-manager\": Get \"https://192.168.5.50:6443/api/v1/namespaces/cert-manager/serviceaccounts/cert-manager-webhook\": dial tcp 192.168.5.50:6443: connect: connection refused - error from a previous attempt: unexpected EOF", "stderr_lines": ["Error: INSTALLATION FAILED: Unable to continue with install: could not get information about the resource ServiceAccount \"cert-manager-webhook\" in namespace \"cert-manager\": Get \"https://192.168.5.50:6443/api/v1/namespaces/cert-manager/serviceaccounts/cert-manager-webhook\": dial tcp 192.168.5.50:6443: connect: connection refused - error from a previous attempt: unexpected EOF"], "stdout": "", "stdout_lines": []}

PLAY RECAP ****************************************************************************************************************************************************************************************192.168.5.101              : ok=57   changed=30   unreachable=0    failed=1    skipped=27   rescued=0    ignored=0   
192.168.5.102              : ok=35   changed=17   unreachable=0    failed=0    skipped=26   rescued=0    ignored=0   
192.168.5.103              : ok=35   changed=17   unreachable=0    failed=0    skipped=26   rescued=0    ignored=0   
192.168.5.111              : ok=11   changed=7    unreachable=0    failed=0    skipped=19   rescued=0    ignored=0   
192.168.5.112              : ok=11   changed=7    unreachable=0    failed=0    skipped=19   rescued=0    ignored=0   
192.168.5.113              : ok=11   changed=7    unreachable=0    failed=0    skipped=19   rescued=0    ignored=0   

server@k3s-admin:~/k3s-ansible-traefik-rancher$ ansible --version
ansible [core 2.16.6]
  config file = /home/server/k3s-ansible-traefik-rancher/ansible.cfg
  configured module search path = ['/home/server/.ansible/plugins/modules', '/usr/share/ansible/plugins/modules']
  ansible python module location = /usr/lib/python3/dist-packages/ansible
  ansible collection location = /home/server/.ansible/collections:/usr/share/ansible/collections
  executable location = /usr/bin/ansible
  python version = 3.12.3 (main, Apr 10 2024, 05:33:47) [GCC 13.2.0] (/usr/bin/python3)
  jinja version = 3.1.2
  libyaml = True

Maybe it is because of my Ansible version, it looks like i'm on 2.16.6?

from k3s-ansible-traefik-rancher.

ChrisThePCGeek commented on July 17, 2024

im on 2.15.11

from k3s-ansible-traefik-rancher.

ChrisThePCGeek commented on July 17, 2024

Is this a slower system you are trying to deploy to? or maybe all on spinning drives for storage? seems like resources arent becoming ready within the time they should and its causing errors. that or as things are in progress still stuff going up/down and causing issues. k3s and kubernetes in general is very sensitive to latency

from k3s-ansible-traefik-rancher.

treverehrfurth commented on July 17, 2024

Its on an NVMe that all my other proxmox VMs also run on, not sure if that would be the issue or not.

from k3s-ansible-traefik-rancher.

ChrisThePCGeek commented on July 17, 2024

could be if there is a lot of IO overhead

from k3s-ansible-traefik-rancher.

treverehrfurth commented on July 17, 2024

hmm could be, I will be installing 4 x 2TB NVMe's in either 2 truenas pools (both in mirror) or in 1 pool with 2 mirrored vdevs.
Then I can test it on there with no other vm's running.

from k3s-ansible-traefik-rancher.

ChrisThePCGeek commented on July 17, 2024

technically if its all the same host you dont need to run worker nodes, just allocate more to the server nodes for resources. my bare metal cluster of 5 are all masters each node is all roles. i can tolerate 2 of them going down and everything will still keep running

from k3s-ansible-traefik-rancher.

treverehrfurth commented on July 17, 2024

VM's are running on 3 seperate servers but running from the same nfs share.
But I shall try just running 3 masters across them instead and see.

from k3s-ansible-traefik-rancher.

ChrisThePCGeek commented on July 17, 2024

well that could be the issue right there, storing VM disk over NFS

from k3s-ansible-traefik-rancher.

ChrisThePCGeek commented on July 17, 2024

my test server is a dell r620 with 96gb ram and both sas ssd's and some 10k 2.5 sas hdd's, the hdd's are in a zfs pool. I have put all 3 test vm's on the same sas ssd with no issue there in testing.

from k3s-ansible-traefik-rancher.

treverehrfurth commented on July 17, 2024

My proxmox is in HA so it needs to be a shared persistant storage outside of the servers, residing in my truenas 45Drives Q30.
It runs multiple docker vm's pretty flawlessly but maybe between that and this the iops isn't there.
But I will get the mirrored NVMe pool up and see how that goes when I get the time.
While installing that i'll be replacing a backplane that had a defective slot for an HHD and swapping moba/cpu/ram to a supermicro board so lots of updates in the queue!

from k3s-ansible-traefik-rancher.

ChrisThePCGeek commented on July 17, 2024

nice. I havent done much research on using NFS for vm storage, id be inclined to try it with iscsi targets though. In proxmox each target would be assigned to the vm, (i think from my test) at least unless you go into the proxmox CLI to configure the iscsi share, then mount it to somewhere under /mnt and add it as a directory type in proxmox datacenter -> storage. not sure if iscsi is faster or not but you do typically need fast storage for etcd databases in k3s. since k3s has HA in itself another option is to exclude the nodes from any live migrations on the proxmox servers and just use a local storage on each proxmox node for those vm's.

from k3s-ansible-traefik-rancher.

treverehrfurth commented on July 17, 2024

Hey @ChrisThePCGeek!
It's been a busy time with work, travel and life BUT I finally got time to setup my containers to local storage and not HA containers in proxmox and finally have your script working without failure!

So note to other's, if you run your proxmox VM's in HA from a shared pool (even with mine being an NVMe pool) this might not work no matter what.
Once VM's were running on local from each server of mine this worked flawlessly.

Thanks again for all your support and explanations here!

from k3s-ansible-traefik-rancher.

treverehrfurth commented on July 17, 2024

I haven't been able to get to rancher yet by going to 192.168.4.60 though, I just get a 404 page not found, am I missing something?

server@k3s-master-01:~$ kubectl get svc --all-namespaces -o wide
NAMESPACE                         NAME                   TYPE           CLUSTER-IP      EXTERNAL-IP    PORT(S)                      AGE   SELECTOR
cattle-fleet-system               gitjob                 ClusterIP      10.43.45.81     <none>         80/TCP                       70m   app=gitjob
cattle-provisioning-capi-system   capi-webhook-service   ClusterIP      10.43.194.147   <none>         443/TCP                      69m   cluster.x-k8s.io/provider=cluster-api
cattle-system                     rancher                ClusterIP      10.43.225.226   <none>         80/TCP,443/TCP               71m   app=rancher
cattle-system                     rancher-webhook        ClusterIP      10.43.22.84     <none>         443/TCP                      69m   app=rancher-webhook
cert-manager                      cert-manager           ClusterIP      10.43.37.110    <none>         9402/TCP                     72m   app.kubernetes.io/component=controller,app.kubernetes.io/instance=cert-manager,app.kubernetes.io/name=cert-manager
cert-manager                      cert-manager-webhook   ClusterIP      10.43.191.82    <none>         443/TCP                      72m   app.kubernetes.io/component=webhook,app.kubernetes.io/instance=cert-manager,app.kubernetes.io/name=webhook
default                           kubernetes             ClusterIP      10.43.0.1       <none>         443/TCP                      74m   <none>
kube-system                       kube-dns               ClusterIP      10.43.0.10      <none>         53/UDP,53/TCP,9153/TCP       74m   k8s-app=kube-dns
kube-system                       metrics-server         ClusterIP      10.43.218.3     <none>         443/TCP                      74m   k8s-app=metrics-server
kube-system                       traefik                LoadBalancer   10.43.236.73    192.168.4.60   80:31371/TCP,443:31919/TCP   72m   app.kubernetes.io/instance=traefik-kube-system,app.kubernetes.io/name=traefik
kube-system                       traefik-external       LoadBalancer   10.43.245.211   192.168.4.61   80:31377/TCP,443:31794/TCP   71m   app.kubernetes.io/instance=traefik-external-kube-system,app.kubernetes.io/name=traefik
metallb-system                    webhook-service        ClusterIP      10.43.232.32    <none>         443/TCP                      74m   component=controller

from k3s-ansible-traefik-rancher.

treverehrfurth commented on July 17, 2024

I was able to get to it by adding 192.168.4.60 into my windows hosts file but yet I can't from just the IP

from k3s-ansible-traefik-rancher.

Related Issues (4)

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.