lablabs / ansible-role-rke2 Goto Github PK

View Code? Open in Web Editor NEW

264.0 17.0 129.0 357 KB

Ansible Role to install RKE2 Kubernetes.

Home Page: https://galaxy.ansible.com/ui/standalone/roles/lablabs/rke2/

License: MIT License

Shell 9.25% Jinja 90.75%

rke2 kubernetes kubernetes-cluster kubernetes-deployment rancher

ansible-role-rke2's Issues

Error templating

I am having this error while running the playbook.

{"msg": "The conditional check 'inventory_hostname is in groups.masters' failed. The error was: template error while templating string: expected token 'end of statement block', got '.'. String: {% if inventory_hostname is in groups.masters %} True {% else %} False {% endif %}\n\nThe error appears to be in '/root/.ansible/roles/lablabs.rke2/tasks/main.yml': line 3, column 3, but may\nbe elsewhere in the file depending on the exact syntax problem.\n\nThe offending line appears to be:\n\n\n- name: Install Keepalived when HA mode is enabled\n ^ here\n"}

I am using the latest ansible version; and I have tried an older version as well, any idea whats going on here ?

feature: Node labels

Summary

I just made the part for taints.
I could do something similar for nodelabels to set specific labels per node.

Issue Type

Feature Idea

Node labels

Hi,

Thank you for a great role, made my life easier when migrating from RKE1 to RKE2.
First looking into rancherd but then moving back to RKE2 I was happy when I found this role, so thank you for that ☺️

I have a question regarding node labels.
I found that it was possible to add node labels using the var k8s_node_label.

My question is regarding the documentation for the labels (or rather the k8s_node_label var).
I haven't seen any documentation for it. I found it when looking through the source files (to see if it was possible to add nodes).

The missing documentation makes me a bit worried that it's not ready for use?
Is there a reason for me not to use it? Or is it that it just hasn't been documented (or did I miss it somewhere)?

Install stuck in "wait for the first server be ready" with kubevip, cilium and kube proxy disabled

Summary

During the initial installation of a cluster using RKE2 version 1.27.1+rke2r1, kubevip, cilium and kube proxy disabled, the first node is stuck in the NOTREADY state preventing the successful completion of the cluster installation process.

The workaround I found :

Connect to the first server with SSH
Manually set the $rke2_api_ip on the network interface ip a a 192.0.2.20 dev ens224
Restart rke2 service systemctl restart rke2-server.service

Not sure why this is happening so far, possibly due to the disabling of kube proxy.

Issue Type

Bug Report

Ansible Version

Ansible 2.14.8

Steps to Reproduce

Deploy RKE2 with the following variables :

rke2_version: v1.27.1+rke2r1
rke2_cluster_group_name: kubernetes_cluster
rke2_servers_group_name: kubernetes_masters
rke2_agents_group_name: kubernetes_workers
rke2_ha_mode: true
rke2_ha_mode_keepalived: false
rke2_ha_mode_kubevip: true
rke2_additional_sans:
  - kubernetes-api.example.net
rke2_api_ip: "192.0.2.20"
rke2_kubevip_svc_enable: false
rke2_interface: "ens224"
rke2_kubevip_cloud_provider_enable: false
rke2_cni: cilium
rke2_disable:
  - rke2-canal
  - rke2-ingress-nginx
rke2_custom_manifests:
  - rke2-cilium-proxy.yaml
disable_kube_proxy: true
rke2_drain_node_during_upgrade: true
rke2_wait_for_all_pods_to_be_ready: true

Here is the content of rke2-cilium-proxy.yaml :

apiVersion: helm.cattle.io/v1
kind: HelmChartConfig
metadata:
  name: rke2-cilium
  namespace: kube-system
spec:
  valuesContent: |-
    kubeProxyReplacement: strict
    k8sServiceHost: {{ rke2_api_ip }}
    k8sServicePort: {{ rke2_apiserver_dest_port }}
    ipv4NativeRoutingCIDR: 10.43.0.0/15
    hubble:
      enabled: true
      metrics:
        enabled:
        - dns:query;ignoreAAAA
        - drop
        - tcp
        - flow
        - icmp
        - http
      relay:
        enabled: true
        replicas: 3
      ui:
        enabled: true
        replicas: 3
        ingress:
          enabled: false

Expected Results

The first server should at some point be in the READY state, so the installation of the cluster succeed.

Actual Results

[…]
FAILED - RETRYING: [k8s01.example.net]: Wait for the first server be ready (1 retries left).
fatal: [k8s01.example.net]: FAILED! => changed=false 
attempts: 40
cmd: |-
set -o pipefail
  /var/lib/rancher/rke2/bin/kubectl --kubeconfig /etc/rancher/rke2/rke2.yaml get nodes | grep "k8s01.example.net"
delta: '0:00:00.096538'
end: '2023-08-23 09:31:26.649490'
msg: ''
rc: 0
start: '2023-08-23 09:31:26.552952'
stderr: ''
stderr_lines: <omitted>
stdout: k8s01.example.net   NotReady   control-plane,etcd,master   10m   v1.27.1+rke2r1
stdout_lines: <omitted>

bug: AIRGAP COPY - multus and calico images not being copied to agent/images path

Summary

Environmental Info:
RKE2 Version: v1.23.7+rke2r2

Node(s) CPU architecture, OS, and Version:
Linux master-01 5.4.0-117-generic #132-Ubuntu SMP Thu Jun 2 00:39:06 UTC 2022 x86_64 x86_64 x86_64 GNU/Linu

Cluster Configuration:
1 master
4 workers
airgap environment
CNI: multus,calico

Describe the bug:

When bootstraping RKE2 airgap cluster with CNI plugins other than default (in my case, multus and calico), the CNI images are copied into the /var/lib/rancher/rke2/artifacts/. But, in order to deploy CNIs, the tarballs (compressed zst or tar.gz) have to be also copied in /var/lib/rancher/rke2/agent/images.

Please, note that I mentioned the default settings for artifacts and images.

I already opened an issue for RKE2: rancher/rke2#3147

And the response was that it is the expected behaviour, so the images of the CNIs have to be manually copied to this path.

Issue Type

Bug Report

Ansible Version

ansible [core 2.12.6]
  config file = /home/test/ANA/Offline-RKE2-ANA/ansible.cfg
  configured module search path = ['/home/test/.ansible/plugins/modules', '/usr/share/ansible/plugins/modules']
  ansible python module location = /home/test/.local/lib/python3.8/site-packages/ansible
  ansible collection location = /home/test/ANA/Offline-RKE2/collections
  executable location = /home/test/.local/bin/ansible
  python version = 3.8.10 (default, Mar 15 2022, 12:22:08) [GCC 9.4.0]
  jinja version = 3.1.2
  libyaml = True

Steps to Reproduce

---
- name: RKE2 K8S BOOTSTRAPPING
  hosts: all
  gather_facts: true
  become: true

  vars_files:
    - vars/global.yaml

  vars:
    rke2_ha_mode: false
    rke2_airgap_mode: true
    rke2_airgap_implementation: copy
    rke2_ha_mode_keepalived: false
    rke2_ha_mode_kubevip: false
    rke2_additional_sans:
      - ana.pt
      - k8s.ana.pt
      - k8s-vmware.ana.pt
    rke2_apiserver_dest_port: 6443
    rke_server_taint: true
    rke2_token: GIzY5kxm9WRGxBekiifQ
    rke2_version: v1.23.7+rke2r2
    rke2_channel: stable
    rke2_artifact_path: /var/lib/rancher/rke2/artifacts
    rke2_airgap_copy_sourcepath: local_artifacts/local_artifacts_rke2
    rke2_cni:
      - multus
      - calico
    rke2_download_kubeconf: true
    rke2_download_kubeconf_file_name: rke2.yaml
    rke2_download_kubeconf_path: /tmp
    nexus_container_registry: "{{ nexus_ingress_cr_host }}"
    rke2_custom_registry_mirrors:
      - name: docker.io
        endpoint: 
          - "https://{{ nexus_container_registry }}"
      - name: quay.io
        endpoint: 
          - "https://{{ nexus_container_registry }}"
      - name: docker.elastic.co
        endpoint: 
          - "https://{{ nexus_container_registry }}"
      - name: cr.fluentbit.io
        endpoint:
          - "https://{{ nexus_container_registry }}"
      - name: registry.gitlab.com
        endpoint: 
          - "https://{{ nexus_container_registry }}"
    rke2_custom_registry_configs:
      - endpoint: "\"{{ nexus_container_registry }}\""
        config:
          tls: 
            insecure_skip_verify: true
    rke2_custom_manifests:
      - roles/lablabs.rke2/files/rke2-ingress-nginx-config.yml

    rke2_artifact:
      - sha256sum-{{ rke2_architecture }}.txt
      - rke2.linux-{{ rke2_architecture }}.tar.gz
      - rke2-images.linux-{{ rke2_architecture }}.tar.zst
      - rke2-images-multus.linux-{{ rke2_architecture }}.tar.zst
      - rke2-images-calico.linux-{{ rke2_architecture }}.tar.zst
      
  roles:
    - role: lablabs.rke2

Expected Results

RKE2 nodes in Ready state.

Actual Results

First server can't find the multus or calico images. The artifacts files are copied to `{{ rke2_artifact_path: }}` but not to the `agent/images`

feature: support CIS security hardening guides

Summary

Hey, I've tried out this role and am already a fan of it over Rancher's role as it is published on ansible-galaxy, which doesn't seem to be on the roadmap for that repo.

One thing I'd love is support for running CIS hardening as part of RKE2 Security Hardening guide. It's included in the rancherfederal repo here.

I can add the following as a var:

rke2_server_options:
  - "profile: cis-1.6"

But I get the following error following the logs:

Jun 28 04:12:21 [HOSTNAME] sh[90630]: Failed to get unit file state for nm-cloud-setup.service: No such file or directory
Jun 28 04:12:22 [HOSTNAME]  rke2[90634]: time="2022-06-28T04:12:22Z" level=fatal msg="missing required: user: unknown user etcd\nmissing required: group: unknown group etcd\ninvalid kernel parameter value vm.overcommit_memory=0 - expected 1\ninvalid kernel parameter value kernel.panic=0 - expected 10\ninvalid kernel parameter value kernel.panic_on_oops=0 - expected 1\n"
Jun 28 04:12:22 [HOSTNAME]  systemd[1]: rke2-server.service: Main process exited, code=exited, status=1/FAILURE
Jun 28 04:12:22 [HOSTNAME]  systemd[1]: rke2-server.service: Failed with result 'exit-code'.
Jun 28 04:12:22 [HOSTNAME]  systemd[1]: Failed to start Rancher Kubernetes Engine v2 (server).

which is because it's failing in the CIS checks, as noted from the docs:

Checks that host-level requirements have been met. If they haven't, RKE2 will exit with a fatal error describing the unmet requirements.

Side note: what is the background of this role? Was it created in parallel/independently of the rancherfederal repo?

Issue Type

Feature Idea

bug: while deploying new cluster its stuck in "Restore etcd" block execution

Summary

While creating a new server ansible was stuck in "Restore etcd" block execution for a long time.

Issue Type

Bug Report

Ansible Version

ansible [core 2.12.7]
  config file = /home/mgmt/ansible/rke2-setup/ansible.cfg
  configured module search path = ['/home/mgmt/.ansible/plugins/modules', '/usr/share/ansible/plugins/modules']
  ansible python module location = /usr/lib/python3/dist-packages/ansible
  ansible collection location = /home/mgmt/.ansible/collections:/usr/share/ansible/collections
  executable location = /usr/bin/ansible
  python version = 3.8.10 (default, Jun 22 2022, 20:18:18) [GCC 9.4.0]
  jinja version = 2.10.1
  libyaml = True

Steps to Reproduce

When I try to use the below to deploy RKE2, I am facing an error mentioning syntax error.

\:$ cat playbook.yaml
---
- name: Deploy RKE2
  hosts: All
  become: yes
  vars:
    rke2_ha_mode: true
    rke2_ha_mode_keepalived: false
    rke2_ha_mode_kubevip: true
    rke2_api_ip: 192.168.5.220
    rke2_api_cidr: 24
    rke2_interface: eno1
    rke2_kubevip_svc_enable: true
    rke2_loadbalancer_ip_range: 192.168.5.221-192.168.5.254
    rke2_additional_sans: [infra.example.com]
    rke2_version: v1.21.11+rke2r1
    rke2_artifact_path: /rke2/artifact
    rke2_airgap_copy_sourcepath: /rke2/artifact
    rke2_airgap_mode: true
    rke2_airgap_implementation: copy
    rke2_disable:
      - rke2-canal
      - rke2-calico
      - rke2-kube-proxy
      - rke2-multus
    rke2_cni:
      - cilium
    rke2_airgap_copy_additional_tarballs:
      - rke2-images-cilium.linux-{{ rke2_architecture }}.tar.zst
    rke2_drain_node_during_upgrade: false
  roles:
     - role: lablabs.rke2

Inventory file used

[masters]
server1 ansible_host=192.168.3.11 rke2_type=server
server2 ansible_host=192.168.3.12 rke2_type=server
server2 ansible_host=192.168.3.13 rke2_type=server

[workers]
agent-2 ansible_host=192.168.3.23 rke2_type=agent
agent-2 ansible_host=192.168.3.43 rke2_type=agent
agent-3 ansible_host=192.168.3.34 rke2_type=agent

[k8s_cluster:children]
server
agent

Expected Results

The entire cluster should be deployed with the required manifests and config information.

Actual Results

During the deployment of the server "Restore etcd" block of first_server.yml task execution stuck for long time. Commented on the block and proceeded to complete the deployment.

bug: Monitoring with Prometheus stack chart (can't get targets Etcd, Scheduler, Controller-manager )

Summary

I've installed rke2 with ansible, installed there prometheus-stack chart, in the targets - etcd, Scheduler, Controller-manager can't be reached

Issue Type

Bug Report

Ansible Version

ansible 2.10.17
  config file = None
  configured module search path = ['/root/.ansible/plugins/modules', '/usr/share/ansible/plugins/modules']
  ansible python module location = /usr/local/lib/python3.8/dist-packages/ansible
  executable location = /usr/local/bin/ansible
  python version = 3.8.10 (default, Jun 22 2022, 20:18:18) [GCC 9.4.0]



### Steps to Reproduce

Install rke2. 
Install prmetheus-stack chart
Check targets for monitoring in prometheus


### Expected Results

All services are reachable

### Actual Results

```console
can't get targets Etcd, Scheduler, Controller-manager

bug: Unable to provision multiple nodes using Vagrant

Summary

I have a Vagrantfile that provisions three boxes running AlmaLinux 8 via libvirt, which use the Ansible provisioner to include this role.

I have no agent nodes, thus I'm not tainting the server nodes. I ran into no issues when provisioning a single node cluster, but I run into issues when specifying multiple server nodes in my Ansible inventory.

In High Availability mode:
I run into the following error on the task: Create keepalived config file

An exception occurred during task execution. To see the full traceback, use -vvv.
The error was: ansible.errors.AnsibleUndefinedVariable: 'ansible.vars.hostvars.HostVarsVars object' has no attribute 'ansible_default_ipv4'.
'ansible.vars.hostvars.HostVarsVars object' has no attribute 'ansible_default_ipv4'

I was able to get past this issue by changing the below line to {{ hostvars[host].ansible_host }}

ansible-role-rke2/templates/keepalived.conf.j2

Line 48 in dc6d426

Issue Type

Bug Report

Ansible Version

ansible [core 2.14.1]
  config file = None
  configured module search path = ['/home/austin/.ansible/plugins/modules', '/usr/share/ansible/plugins/modules']
  ansible python module location = /home/austin/.local/lib/python3.11/site-packages/ansible
  ansible collection location = /home/austin/.ansible/collections:/usr/share/ansible/collections
  executable location = /home/austin/.local/bin/ansible
  python version = 3.11.1 (main, Dec 11 2022, 15:18:51) [GCC 10.2.1 20201203] (/usr/bin/python3)
  jinja version = 3.1.2
  libyaml = True

Steps to Reproduce

Have libvirt setup and the vagrant-libvirt plugin installed along with Vagrant, Ansible, and this role.

Below are the three files necessary when running vagrant up:

Vagrantfile

NODES = [
    { hostname: "controller1", ip: "192.168.111.2", ram: 4096, cpu: 2 },
    { hostname: "controller2", ip: "192.168.111.3", ram: 4096, cpu: 2 },
    { hostname: "controller3", ip: "192.168.111.4", ram: 4096, cpu: 2 }
]

Vagrant.configure(2) do |config|
  NODES.each do |node|
    config.vm.define node[:hostname] do |config|
      config.vm.hostname = node[:hostname]
      config.vm.box = "almalinux/8"
      config.vm.network :private_network, ip: node[:ip]

      config.vm.provider :libvirt do |domain|
        domain.memory = node[:ram]
        domain.cpus = node[:cpu]
      end

      config.vm.provision :ansible do |ansible|
        ansible.playbook = "playbooks/provision.yml"
        ansible.inventory_path = "inventory/hosts.ini"
      end
    end
  end
end

playbooks/provision.yml

- hosts: all
  become: true
  vars:
    rke2_channel: stable
    rke2_servers_group_name: rke2_servers
    rke2_agents_group_name: rke2_agents
    rke2_ha_mode: true
  roles:
  - lablabs.rke2

inventory/hosts.ini

[rke2_servers]
controller1 ansible_host=192.168.111.2 rke2_type=server
controller2 ansible_host=192.168.111.3 rke2_type=server
controller3 ansible_host=192.168.111.4 rke2_type=server

[rke2_agents]

[k8s_cluster:children]
rke2_servers
rke2_agents

Expected Results

For three server nodes to be provisioned after running vagrant up

Actual Results

All servers fail to provision rke2 Ansible role.

Timeout on rke2_version: v1.21.6+rke2r1

Hi,

on the other rke2_version than default : v1.21.2+rke2r1

Tested on:
rke2_version: v1.21.5+rke2r2
rke2_version: v1.21.6+rke2r1

It is failing on wait for the first server to be ready.

TASK [lablabs.rke2 : Wait for the first server be ready] ***************************************************************************************************************
FAILED - RETRYING: Wait for the first server be ready (40 retries left).
FAILED - RETRYING: Wait for the first server be ready (39 retries left).

Log from rke2-server:

Nov 04 14:29:38 rke-test-1-dc1-mgmt rke2[4881]: time="2021-11-04T14:29:38+01:00" level=info msg="Event(v1.ObjectReference{Kind:\"Addon\", Namespace:\"kube-system\", Name:\"rke2-canal\", UID:\"45e39394-4aa4-4c41-9d6c-6400d2fed972\", APIVersion:\"k3s.cattle.io/v1\", ResourceVersion:\"297\", FieldPath:\"\"}): type: 'Normal' reason: 'ApplyingManifest' Applying manifest at \"/var/lib/rancher/rke2/server/manifests/rke2-canal.yaml\""
Nov 04 14:29:38 rke-test-1-dc1-mgmt rke2[4881]: time="2021-11-04T14:29:38+01:00" level=info msg="Active TLS secret rke2-serving (ver=296) (count 9): map[listener.cattle.io/cn-10.149.100.141:10.149.100.141 listener.cattle.io/cn-10.43.0.1:10.43.0.1 listener.cattle.io/cn-127.0.0.1:127.0.0.1 listener.cattle.io/cn-kubernetes:kubernetes listener.cattle.io/cn-kubernetes.default:kubernetes.default listener.cattle.io/cn-kubernetes.default.svc:kubernetes.default.svc listener.cattle.io/cn-kubernetes.default.svc.cluster.local:kubernetes.default.svc.cluster.local listener.cattle.io/cn-localhost:localhost listener.cattle.io/cn-rke-test-1-dc1-mgmt:rke-test-1-dc1-mgmt listener.cattle.io/fingerprint:SHA1=9110B4A56B425CE1ADF5FD2C7E4536E2E0097175]"
Nov 04 14:29:38 rke-test-1-dc1-mgmt rke2[4881]: time="2021-11-04T14:29:38+01:00" level=info msg="Event(v1.ObjectReference{Kind:\"Addon\", Namespace:\"kube-system\", Name:\"rke2-canal\", UID:\"45e39394-4aa4-4c41-9d6c-6400d2fed972\", APIVersion:\"k3s.cattle.io/v1\", ResourceVersion:\"297\", FieldPath:\"\"}): type: 'Normal' reason: 'AppliedManifest' Applied manifest at \"/var/lib/rancher/rke2/server/manifests/rke2-canal.yaml\""
Nov 04 14:29:38 rke-test-1-dc1-mgmt rke2[4881]: time="2021-11-04T14:29:38+01:00" level=info msg="Event(v1.ObjectReference{Kind:\"Addon\", Namespace:\"kube-system\", Name:\"rke2-cilium\", UID:\"\", APIVersion:\"k3s.cattle.io/v1\", ResourceVersion:\"\", FieldPath:\"\"}): type: 'Normal' reason: 'DeletingManifest' Deleting manifest at \"/var/lib/rancher/rke2/server/manifests/rke2-cilium.yaml\""
Nov 04 14:29:39 rke-test-1-dc1-mgmt rke2[4881]: time="2021-11-04T14:29:39+01:00" level=info msg="Event(v1.ObjectReference{Kind:\"Addon\", Namespace:\"kube-system\", Name:\"rke2-coredns\", UID:\"89e3aa88-84a6-49be-9a0f-668817ed3d61\", APIVersion:\"k3s.cattle.io/v1\", ResourceVersion:\"327\", FieldPath:\"\"}): type: 'Normal' reason: 'ApplyingManifest' Applying manifest at \"/var/lib/rancher/rke2/server/manifests/rke2-coredns.yaml\""
Nov 04 14:29:39 rke-test-1-dc1-mgmt rke2[4881]: time="2021-11-04T14:29:39+01:00" level=info msg="Event(v1.ObjectReference{Kind:\"Addon\", Namespace:\"kube-system\", Name:\"rke2-coredns\", UID:\"89e3aa88-84a6-49be-9a0f-668817ed3d61\", APIVersion:\"k3s.cattle.io/v1\", ResourceVersion:\"327\", FieldPath:\"\"}): type: 'Normal' reason: 'AppliedManifest' Applied manifest at \"/var/lib/rancher/rke2/server/manifests/rke2-coredns.yaml\""
Nov 04 14:29:39 rke-test-1-dc1-mgmt rke2[4881]: time="2021-11-04T14:29:39+01:00" level=info msg="Event(v1.ObjectReference{Kind:\"Addon\", Namespace:\"kube-system\", Name:\"rke2-ingress-nginx\", UID:\"3ae85ff0-c634-4186-b9cb-79d0f3f8aecc\", APIVersion:\"k3s.cattle.io/v1\", ResourceVersion:\"343\", FieldPath:\"\"}): type: 'Normal' reason: 'ApplyingManifest' Applying manifest at \"/var/lib/rancher/rke2/server/manifests/rke2-ingress-nginx.yaml\""
Nov 04 14:29:39 rke-test-1-dc1-mgmt rke2[4881]: time="2021-11-04T14:29:39+01:00" level=info msg="Event(v1.ObjectReference{Kind:\"Addon\", Namespace:\"kube-system\", Name:\"rke2-ingress-nginx\", UID:\"3ae85ff0-c634-4186-b9cb-79d0f3f8aecc\", APIVersion:\"k3s.cattle.io/v1\", ResourceVersion:\"343\", FieldPath:\"\"}): type: 'Normal' reason: 'AppliedManifest' Applied manifest at \"/var/lib/rancher/rke2/server/manifests/rke2-ingress-nginx.yaml\""
Nov 04 14:29:39 rke-test-1-dc1-mgmt rke2[4881]: time="2021-11-04T14:29:39+01:00" level=info msg="Event(v1.ObjectReference{Kind:\"Addon\", Namespace:\"kube-system\", Name:\"rke2-kube-proxy\", UID:\"\", APIVersion:\"k3s.cattle.io/v1\", ResourceVersion:\"\", FieldPath:\"\"}): type: 'Normal' reason: 'DeletingManifest' Deleting manifest at \"/var/lib/rancher/rke2/server/manifests/rke2-kube-proxy.yaml\""
Nov 04 14:29:39 rke-test-1-dc1-mgmt rke2[4881]: time="2021-11-04T14:29:39+01:00" level=info msg="Event(v1.ObjectReference{Kind:\"Addon\", Namespace:\"kube-system\", Name:\"rke2-metrics-server\", UID:\"f61febdf-962f-4d43-a7ac-a00806bd8f42\", APIVersion:\"k3s.cattle.io/v1\", ResourceVersion:\"364\", FieldPath:\"\"}): type: 'Normal' reason: 'ApplyingManifest' Applying manifest at \"/var/lib/rancher/rke2/server/manifests/rke2-metrics-server.yaml\""
Nov 04 14:29:39 rke-test-1-dc1-mgmt rke2[4881]: time="2021-11-04T14:29:39+01:00" level=info msg="Event(v1.ObjectReference{Kind:\"Addon\", Namespace:\"kube-system\", Name:\"rke2-metrics-server\", UID:\"f61febdf-962f-4d43-a7ac-a00806bd8f42\", APIVersion:\"k3s.cattle.io/v1\", ResourceVersion:\"364\", FieldPath:\"\"}): type: 'Normal' reason: 'AppliedManifest' Applied manifest at \"/var/lib/rancher/rke2/server/manifests/rke2-metrics-server.yaml\""
Nov 04 14:29:39 rke-test-1-dc1-mgmt rke2[4881]: time="2021-11-04T14:29:39+01:00" level=info msg="Event(v1.ObjectReference{Kind:\"Addon\", Namespace:\"kube-system\", Name:\"rke2-multus\", UID:\"\", APIVersion:\"k3s.cattle.io/v1\", ResourceVersion:\"\", FieldPath:\"\"}): type: 'Normal' reason: 'DeletingManifest' Deleting manifest at \"/var/lib/rancher/rke2/server/manifests/rke2-multus.yaml\""
Nov 04 14:29:40 rke-test-1-dc1-mgmt rke2[4881]: time="2021-11-04T14:29:40+01:00" level=info msg="Running kube-proxy --cluster-cidr=10.42.0.0/16 --conntrack-max-per-core=0 --conntrack-tcp-timeout-close-wait=0s --conntrack-tcp-timeout-established=0s --healthz-bind-address=127.0.0.1 --hostname-override=rke-test-1-dc1-mgmt --kubeconfig=/var/lib/rancher/rke2/agent/kubeproxy.kubeconfig --proxy-mode=iptables"
Nov 04 14:29:40 rke-test-1-dc1-mgmt rke2[4881]: time="2021-11-04T14:29:40+01:00" level=info msg="Stopped tunnel to 127.0.0.1:9345"
Nov 04 14:29:40 rke-test-1-dc1-mgmt rke2[4881]: time="2021-11-04T14:29:40+01:00" level=info msg="Connecting to proxy" url="wss://10.149.100.141:9345/v1-rke2/connect"
Nov 04 14:29:40 rke-test-1-dc1-mgmt rke2[4881]: time="2021-11-04T14:29:40+01:00" level=info msg="Proxy done" err="context canceled" url="wss://127.0.0.1:9345/v1-rke2/connect"
Nov 04 14:29:40 rke-test-1-dc1-mgmt rke2[4881]: time="2021-11-04T14:29:40+01:00" level=info msg="error in remotedialer server [400]: websocket: close 1006 (abnormal closure): unexpected EOF"
Nov 04 14:29:40 rke-test-1-dc1-mgmt rke2[4881]: time="2021-11-04T14:29:40+01:00" level=info msg="Handling backend connection request [rke-test-1-dc1-mgmt]"

bug: rke2 upgrade, agent nodes should be upgraded after all the master nodes

Summary

I upgraded rke2 from v1.22.9 to v1.23.9 which actually worked fine, but I noticed that some worker nodes were upgraded in between the master nodes which goes against RKE2 recommendations:

Note: Upgrade the server nodes first, one at a time. Once all servers have been upgraded, you may then upgrade agent nodes.

see https://docs.rke2.io/upgrade/basic_upgrade/

Ansible Output:

TASK [lablabs.rke2 : Cordon and Drain the node platform-rancher-master-k8s-master-0] ***
skipping: [platform-rancher-master-k8s-master-0]
TASK [lablabs.rke2 : Restart RKE2 service on platform-rancher-master-k8s-master-0] ***
changed: [platform-rancher-master-k8s-master-0]
TASK [lablabs.rke2 : Wait for all nodes to be ready again] *********************
FAILED - RETRYING: [platform-rancher-master-k8s-master-0 -> platform-rancher-master-k8s-master-2]: Wait for all nodes to be ready again (100 retries left).
ok: [platform-rancher-master-k8s-master-0 -> platform-rancher-master-k8s-master-2(10.10.50.103)]
TASK [lablabs.rke2 : Uncordon the node platform-rancher-master-k8s-master-0] ***
skipping: [platform-rancher-master-k8s-master-0]
TASK [lablabs.rke2 : Cordon and Drain the node platform-rancher-master-k8s-master-1] ***
skipping: [platform-rancher-master-k8s-master-1]
TASK [lablabs.rke2 : Restart RKE2 service on platform-rancher-master-k8s-master-1] ***
changed: [platform-rancher-master-k8s-master-1]
TASK [lablabs.rke2 : Wait for all nodes to be ready again] *********************
ok: [platform-rancher-master-k8s-master-1 -> platform-rancher-master-k8s-master-2(10.10.50.103)]
TASK [lablabs.rke2 : Uncordon the node platform-rancher-master-k8s-master-1] ***
skipping: [platform-rancher-master-k8s-master-1]
TASK [lablabs.rke2 : Cordon and Drain the node platform-rancher-master-k8s-worker-1] ***
skipping: [platform-rancher-master-k8s-worker-1]
TASK [lablabs.rke2 : Restart RKE2 service on platform-rancher-master-k8s-worker-1] ***
changed: [platform-rancher-master-k8s-worker-1]
TASK [lablabs.rke2 : Wait for all nodes to be ready again] *********************
FAILED - RETRYING: [platform-rancher-master-k8s-worker-1 -> platform-rancher-master-k8s-master-2]: Wait for all nodes to be ready again (100 retries left).
ok: [platform-rancher-master-k8s-worker-1 -> platform-rancher-master-k8s-master-2(10.10.50.103)]
TASK [lablabs.rke2 : Uncordon the node platform-rancher-master-k8s-worker-1] ***
skipping: [platform-rancher-master-k8s-worker-1]
TASK [lablabs.rke2 : Cordon and Drain the node platform-rancher-master-k8s-master-2] ***
skipping: [platform-rancher-master-k8s-master-2]
TASK [lablabs.rke2 : Restart RKE2 service on platform-rancher-master-k8s-master-2] ***
changed: [platform-rancher-master-k8s-master-2]
TASK [lablabs.rke2 : Wait for all nodes to be ready again] *********************
ok: [platform-rancher-master-k8s-master-2]
TASK [lablabs.rke2 : Uncordon the node platform-rancher-master-k8s-master-2] ***
skipping: [platform-rancher-master-k8s-master-2]
TASK [lablabs.rke2 : Cordon and Drain the node platform-rancher-master-k8s-worker-0] ***
skipping: [platform-rancher-master-k8s-worker-0]
TASK [lablabs.rke2 : Restart RKE2 service on platform-rancher-master-k8s-worker-0] ***

Issue Type

Bug Report

Ansible Version

ansible [core 2.12.7]
  config file = None
  configured module search path = ['/root/.ansible/plugins/modules', '/usr/share/ansible/plugins/modules']
  ansible python module location = /usr/local/lib/python3.10/site-packages/ansible
  ansible collection location = /root/.ansible/collections:/usr/share/ansible/collections
  executable location = /usr/local/bin/ansible
  python version = 3.10.5 (main, Jul 13 2022, 05:45:22) [GCC 10.2.1 20210110]
  jinja version = 3.1.2
  libyaml = True

Steps to Reproduce

trigger a RKE2 upgrade, i.e. from 1.22.9 to 1.23.9

Expected Results

Master nodes should be upgraded first, then the worker nodes

Actual Results

Nodes are upgraded seemingly randomly

bug: kubelet server certificates does not include keepalived VIP

Summary

When using HA setup with Keeplived, the server certificates provisioned for Kubelet does not include the Keepalived VIP. This causes TLS verification issues when performing various operations like viewing logs or port forwarding on the current leader.

Issue Type

Bug Report

Ansible Version

ansible [core 2.14.6]
  config file = /Users/moray/.ansible.cfg
  configured module search path = ['/Users/moray/.ansible/plugins/modules', '/usr/share/ansible/plugins/modules']
  ansible python module location = /opt/homebrew/Cellar/ansible/7.6.0/libexec/lib/python3.11/site-packages/ansible
  ansible collection location = /Users/moray/.ansible/collections:/usr/share/ansible/collections
  executable location = /opt/homebrew/bin/ansible
  python version = 3.11.4 (main, Jul 25 2023, 17:36:13) [Clang 14.0.3 (clang-1403.0.22.14.1)] (/opt/homebrew/Cellar/ansible/7.6.0/libexec/bin/python3.11)
  jinja version = 3.1.2
  libyaml = True

Steps to Reproduce

Install RKE2 (sample playbook below)

- hosts: rke
  become: true
  roles:
    - role: lablabs.rke2
  vars:
    rke2_ha_mode: true
    rke2_ha_mode_keepalived: true
    rke2_version: v1.26.7+rke2r1
    rke2_install_bash_url: https://get.rke2.io
    rke2_api_ip: 10.64.0.9
    rke2_disable:
      - rke2-ingress-nginx
    rke2_cni: canal
    rke2_cluster_group_name: rke
    rke2_servers_group_name: rke_master
    # Ansible group including worker nodes
    rke2_agents_group_name: rke_worker
    rke2_server_options:
      - "disable-cloud-controller: true"

Try viewing logs of any pod on the current Keepalived leader

Expected Results

The TLS certificate generated for Kubelet includes the Keepalived VIP (10.64.0.9 in the example above), issuing kubectl logs and kubectl port-forward command on pods on the current leader works without problem.

Actual Results

The TLS certificate for Kubelet does not include the Keepalived VIP (10.64.0.9 in the example above). Issuing kubectl logs or kubectl port-forward commands on pods on the current leader results in the following error:

Error from server: Get "https://10.64.0.9:10250/containerLogs/kube-system/kube-proxy-master-0/kube-proxy": tls: failed to verify certificate: x509: certificate is valid for 127.0.0.1, 10.64.0.10, not 10.64.0.9

Additional information:

The API server serving certificate does include the VIP.
The leader's internal ip address always shows up as the VIP.
I have tried setting RKE2 options node-ip and advertise-address to the non-virtual ip but to no avail.

Question about support of haproxy inplace of keepalived

We are currently running kubespray in ha with haproxy. Is it possible to do same here?
I saw that keepalived is mandatory in ha-mode

Playbook stuck while starting the RKE2 service on agents

Summary

In the troubleshooting section here: https://github.com/lablabs/ansible-role-rke2#troubleshooting, it mentions that it might be a network limitation.

The problem is that the RKE2 script is never executed on the agent which has condition with the variable installed_rke2_version. While that variable is depends on condition "rke2-server.service" in ansible_facts.services.

Below is the changes I made to fix the issue:

Before the Run AirGap RKE2 scripttask (

ansible-role-rke2/tasks/rke2.yml

Line 91 in dc6d426

- name: Run AirGap RKE2 script

), I added the following tasks by checking that the rke2 binary path exists and don't relying on this line

ansible-role-rke2/tasks/rke2.yml

Line 89 in dc6d426

when: '"rke2-server.service" in ansible_facts.services'

- name: Check rke2 bin exists
  ansible.builtin.stat:
    path: "{{ rke2_bin_path }}"
  register: rke2_exists

- name: Check RKE2 version
  ansible.builtin.shell: |
    set -o pipefail
    {{ rke2_bin_path }} --version | grep -E "rke2 version" | awk '{print $3}'
  args:
    executable: /bin/bash
  changed_when: false
  register: installed_rke2_version
  when: rke2_exists.stat.exists

Issue Type

Bug Report

Ansible Version

ansible [core 2.14.2]
  config file = /etc/ansible/ansible.cfg
  configured module search path = ['/root/.ansible/plugins/modules', '/usr/share/ansible/plugins/modules']
  ansible python module location = /usr/lib/python3/dist-packages/ansible
  ansible collection location = /root/.ansible/collections:/usr/share/ansible/collections
  executable location = /usr/bin/ansible
  python version = 3.10.6 (main, Nov 14 2022, 16:10:14) [GCC 11.3.0] (/usr/bin/python3)
  jinja version = 3.0.3
  libyaml = True

Steps to Reproduce

- name: Deploy RKE2
  hosts: all
  become: yes
  vars:
    rke2_version: v1.26.0+rke2r2    
    rke2_api_ip : 192.168.1.10
    rke2_download_kubeconf: true    
    rke2_server_node_taints:
      - 'CriticalAddonsOnly=true:NoExecute'
    rke2_cni:
      - cilium
  roles:
     - role: lablabs.rke2

[masters]
master-01 ansible_host=192.168.1.10 rke2_type=server
master-02 ansible_host=192.168.1.11 rke2_type=server
master-03 ansible_host=192.168.1.12 rke2_type=server

[workers]
worker-01 ansible_host=192.168.1.20 rke2_type=agent
worker-02 ansible_host=192.168.1.21 rke2_type=agent

[k8s_cluster:children]
masters
workers

Expected Results

Worker nodes should be provisioned if the rke2.sh script have been executed on the following task

ansible-role-rke2/tasks/rke2.yml

Line 100 in dc6d426

- name: Run RKE2 script

Actual Results

It's just hanging until timeout.

sudo: a password is required

I'm using the user with paswordless sudo,

TASK [lablabs.rke2 : Replace loopback IP by master server IP] **********************************************************************************************************
fatal: [rke-test-3-dc3-mgmt -> localhost]: FAILED! => {"changed": false, "module_stderr": "sudo: a password is required\n", "module_stdout": "", "msg": "MODULE FAILURE\nSee stdout/stderr for the exact error", "rc": 1}
fatal: [rke-test-1-dc1-mgmt -> localhost]: FAILED! => {"changed": false, "module_stderr": "sudo: a password is required\n", "module_stdout": "", "msg": "MODULE FAILURE\nSee stdout/stderr for the exact error", "rc": 1}

feature: support for adding worker new node easily without impacting existing cluster.

Summary

While trying to add a new worker node to the existing HA cluster, it restarts the RKE2 services of all existing master and worker nodes. Also, it takes a long time to complete the ansible execution which can be improved.

While adding new nodes I have commented "- name: Wait for remaining nodes to be ready" in remaining_nodes.yml, and also commented "Rolling restart" task from "main.yml".
This improved start of services in the newly added worker node.

We can have parameters to support adding new workers to the existing cluster?

Question: Even during new server deployment do we need to have "- name: Wait for remaining nodes to be ready" in remaining_nodes.yml?

Issue Type

Feature Idea

Question: Kubevip deployment clarification

Summary

While trying to deploy rke2 in HA mode with kube-vip it's failing.
I would like to confirm playbook.yml parameters are correct for a kubevip deployment in baremetal servers?

Issue Type

Bug Report

Ansible Version

ansible [core 2.12.7]
  config file = /home/mgmt/ansible/rke2-setup/ansible.cfg
  configured module search path = ['/home/mgmt/.ansible/plugins/modules', '/usr/share/ansible/plugins/modules']
  ansible python module location = /usr/lib/python3/dist-packages/ansible
  ansible collection location = /home/mgmt/.ansible/collections:/usr/share/ansible/collections
  executable location = /usr/bin/ansible
  python version = 3.8.10 (default, Jun 22 2022, 20:18:18) [GCC 9.4.0]
  jinja version = 2.10.1
  libyaml = True

Steps to Reproduce

When I try to use the below to deploy RKE2, I am facing an error mentioning syntax error.

\:$ cat playbook.yaml
---
- name: Deploy RKE2
  hosts: All
  become: yes
  vars:
    rke2_ha_mode: true
    rke2_ha_mode_keepalived: false
    rke2_ha_mode_kubevip: true
    rke2_api_ip: 192.168.5.220
    rke2_api_cidr: 24
    rke2_interface: eno1
    rke2_kubevip_svc_enable: true
    rke2_loadbalancer_ip_range: 192.168.5.221-192.168.5.254
    rke2_additional_sans: [infra.example.com]
    rke2_version: v1.21.11+rke2r1
    rke2_artifact_path: /rke2/artifact
    rke2_airgap_copy_sourcepath: /rke2/artifact
    rke2_airgap_mode: true
    rke2_airgap_implementation: copy
    rke2_disable:
      - rke2-canal
      - rke2-calico
      - rke2-kube-proxy
      - rke2-multus
    rke2_cni:
      - cilium
    rke2_airgap_copy_additional_tarballs:
      - rke2-images-cilium.linux-{{ rke2_architecture }}.tar.zst
    rke2_drain_node_during_upgrade: false
  roles:
     - role: lablabs.rke2

Inventory file used

[server]
server1 ansible_host=192.168.3.11 rke2_type=server
server2 ansible_host=192.168.3.12 rke2_type=server
server2 ansible_host=192.168.3.13 rke2_type=server

[agent]
agent-2 ansible_host=192.168.3.23 rke2_type=agent
agent-2 ansible_host=192.168.3.43 rke2_type=agent
agent-3 ansible_host=192.168.3.34 rke2_type=agent

[All:children]
server
agent

Expected Results

Deploy RKE2 servers with HA

Actual Results

After deploying the mentioned playbook facing proxy error and eventually services goes down.

bug: rke2 config.yaml for server taint

Summary

Setting server taint leads to a broken config.yaml. At least for me the task

- name: Set server taints
  ansible.builtin.set_fact:
    combined_node_taints: "{{ node_taints}} + [ 'CriticalAddonsOnly=true:NoExecute' ] "
  when: rke2_server_taint and rke2_type == 'server'

leads to a broken list in the config.yaml, as the combined_node_taints variable ist treated as string instead of list.
The template line in question:

{% for taint in combined_node_taints %}
  - {{ taint }}

The correct syntax should be:

- name: Set server taints
  ansible.builtin.set_fact:
    combined_node_taints: "{{ node_taints + [ 'CriticalAddonsOnly=true:NoExecute' ] }}"
  when: rke2_server_taint and rke2_type == 'server'

Both in the task tasks/first_server.yml and vim tasks/remaining_nodes.yml

I do not understand why this seems to be only a problem for me?

Issue Type

Bug Report

Ansible Version

ansible [core 2.13.1]
  config file = None
  configured module search path = ['/home/user/.ansible/plugins/modules', '/usr/share/ansible/plugins/modules']
  ansible python module location = /home/user/.local/lib/python3.9/site-packages/ansible
  ansible collection location = /home/user/.ansible/collections:/usr/share/ansible/collections
  executable location = /home/user/.local/bin/ansible
  python version = 3.9.5 (default, Jun  4 2021, 12:28:51) [GCC 7.5.0]
  jinja version = 3.0.3
  libyaml = True

Steps to Reproduce

Run the playbook with server taint enabled

Expected Results

in config.yaml:

node-taint:
  - CriticalAddonsOnly=true:NoExecute

Actual Results

in config.yaml:

node-taint:
  - [
  - ]
  - 
  - [
  - C
  - r
...

bug: restore etcd from snapshot not working

Summary

When trying to restore etcd snapshot, the block "Restore etcd" in the first_server.yml file is skipped due to the following condition : 'and ( "rke2-server.service" is not in ansible_facts.services )'.

Trying the without this second condition and it's working, only with 'when: rke2_etcd_snapshot_file'.

Issue Type

Bug Report

Ansible Version

ansible [core 2.13.3]
  config file = /etc/ansible/ansible.cfg
  configured module search path = ['/home/exploit/.ansible/plugins/modules', '/usr/share/ansible/plugins/modules']
  ansible python module location = /home/exploit/.local/lib/python3.8/site-packages/ansible
  ansible collection location = /home/exploit/.ansible/collections:/usr/share/ansible/collections
  executable location = /home/exploit/.local/bin/ansible
  python version = 3.8.10 (default, Jun 22 2022, 20:18:18) [GCC 9.4.0]
  jinja version = 3.1.2
  libyaml = True

Steps to Reproduce

Expected Results

the snapshot should be restore

Actual Results

the snapshot is not restore

bug: kubectl not installed on the first server

Update: Not a bug. Solved by running the Ansible controller on a Linux VM (instead of the problematic WSL2) and disable firewalld in every VMs as indicated in the document:

Firewalld conflicts with RKE2's default Canal (Calico + Flannel) networking stack. To avoid unexpected behavior, firewalld should be disabled on systems running RKE2.

Summary

I tried to install lablabs.rke2 (1.12.0) on 5 VMs running Rocky Linux 8.6 (3 masters and 2 workers) with HA+kubevip. However, the Wait for the first server be ready task failed after 40 retries. The error message is

/bin/bash: line 1: /var/lib/rancher/rke2/bin/kubectl: No such file or directory

I checked the first server but didn't found the /var/lib/rancher directory. I only found a weird /etc/rancher/rke2/config.yaml in the first server.

Issue Type

Bug Report

Ansible Version

ansible [core 2.13.1]
  config file = None
  configured module search path = ['/home/<redacted>/.ansible/plugins/modules', '/usr/share/ansible/plugins/modules']
  ansible python module location = /home/<redacted>/.local/lib/python3.10/site-packages/ansible
  ansible collection location = /home/<redacted>/.ansible/collections:/usr/share/ansible/collections
  executable location = /home/<redacted>/.local/bin/ansible
  python version = 3.10.5 (main, Jun 11 2022, 16:53:24) [GCC 9.4.0]
  jinja version = 3.1.2
  libyaml = False

Steps to Reproduce

Commands:

$ ansible-galaxy install lablabs.rke2
$ ansible-playbook -i hosts playbook.yaml
$ ansible-playbook -i hosts playbook.yaml

P.S. The first run failed on the "Start RKE2 service on the first server" task becase rke2-server.service isn't ready yet.

playbook.yaml:

---
    - name: Deploy RKE2
      hosts: k8s_cluster
      vars:
        rke2_ha_mode: true
        rke2_ha_mode_keepalived: false
        rke2_ha_mode_kubevip: true
        rke2_interface: ens192
        rke2_loadbalancer_ip_range: 192.168.3.191-192.168.3.199
        rke2_server_taint: true
        rke2_api_ip : 192.168.3.190
        rke2_download_kubeconf: true
      roles:
         - role: lablabs.rke2

hosts:

k8s_cluster:
  children:
    masters:
      hosts:
        master01.mydomain.com:
          ansible_user: root
          ansible_ssh_private_key_file: ~/.ssh/master01.mydomain.com
          rke2_type: server
        master02.mydomain.com:
          ansible_user: root
          ansible_ssh_private_key_file: ~/.ssh/master02.mydomain.com
          rke2_type: server
        master03.mydomain.com:
          ansible_user: root
          ansible_ssh_private_key_file: ~/.ssh/master03.mydomain.com
          rke2_type: server
    workers:
      hosts:
        worker01.mydomain.com:
          ansible_user: root
          ansible_ssh_private_key_file: ~/.ssh/worker01.mydomain.com
          rke2_type: agent
        worker02.mydomain.com:
          ansible_user: root
          ansible_ssh_private_key_file: ~/.ssh/worker02.mydomain.com
          rke2_type: agent

Expected Results

The cluster is deployed.

Actual Results

TASK [lablabs.rke2 : Wait for the first server be ready] ***************************************************************
FAILED - RETRYING: [master01.mydomain.com]: Wait for the first server be ready (40 retries left).
FAILED - RETRYING: [master01.mydomain.com]: Wait for the first server be ready (39 retries left).
FAILED - RETRYING: [master01.mydomain.com]: Wait for the first server be ready (38 retries left).
...
FAILED - RETRYING: [master01.mydomain.com]: Wait for the first server be ready (3 retries left).
FAILED - RETRYING: [master01.mydomain.com]: Wait for the first server be ready (2 retries left).
FAILED - RETRYING: [master01.mydomain.com]: Wait for the first server be ready (1 retries left).
fatal: [master01.mydomain.com]: FAILED! => {"attempts": 40, "changed": false, "cmd": "set -o pipefail\n/var/lib/rancher/rke2/bin/kubectl --kubeconfig /etc/rancher/rke2/rke2.yaml get nodes | grep \"master01.mydomain.com\"\n", "delta": "0:00:00.005873", "end": "2022-07-12 11:50:39.491521", "msg": "non-zero return code", "rc": 1, "start": "2022-07-12 11:50:39.485648", "stderr": "/bin/bash: line 1: /var/lib/rancher/rke2/bin/kubectl: No such file or directory", "stderr_lines": ["/bin/bash: line 1: /var/lib/rancher/rke2/bin/kubectl: No such file or directory"], "stdout": "", "stdout_lines": []}

...

TASK [lablabs.rke2 : Download RKE2 kubeconfig to localhost] ************************************************************
fatal: [master02.mydomain.com -> master01.mydomain.com]: FAILED! => {"changed": false, "msg": "the remote file does not exist, not transferring, ignored"}

...

PLAY RECAP *************************************************************************************************************
master01.mydomain.com  : ok=15   changed=1    unreachable=0    failed=1    skipped=13   rescued=0    ignored=0
master02.mydomain.com  : ok=10   changed=0    unreachable=0    failed=1    skipped=15   rescued=0    ignored=0
master03.mydomain.com  : ok=10   changed=0    unreachable=0    failed=0    skipped=15   rescued=0    ignored=0
worker01.mydomain.com  : ok=10   changed=0    unreachable=0    failed=0    skipped=15   rescued=0    ignored=0
worker02.mydomain.com  : ok=10   changed=0    unreachable=0    failed=0    skipped=15   rescued=0    ignored=0

feature: Support for cni: none for cilium cli install's

Summary

The best way to install cilium is to install rke2 with no cni, and then use the cilium cli to install it but then the node will show a status of NotReady, you have checked for Status " Ready " on the pod network. I have tried commenting on this check out, and also changing the stdout to "NotReady" but still does not work. Am I missing

Issue Type

Feature Idea

Rolling restart task not health checking

The recently contributed rolling restart task is simply restarting the rke2 service on each node in order, but in my experience the service status being ready in systemd doesn't necessarily mean the node is actually ready and good to go, especially when upgrading master nodes.

I think we should add health checks and only restart the next node once the previous one is confirmed to be up and running again to avoid any potential cluster breakage, at least for the master nodes. This is less serious for worker nodes.

Currently we're using an extra playbook to run these tasks serially, because there isn't a straight forward way to integrate this into the ansible role.

Posting the playbook below for inspiration, maybe somebody has a better idea how to get this into the role.
Edit: The one minute pause is there, because I noticed that sometimes health checks will work directly after restarting the rke2 service, but then potentially fail for the next few minutes until being ready again

---
- name: Restart RKE2 service and check health
  hosts: masters
  become: yes
  serial: 1
  tasks:
    - name: Restart RKE2 server on master nodes
      ansible.builtin.service:
        name: "rke2-server.service"
        state: restarted
    - name: Pause for 1 minute
      pause:
        minutes: 1
    - name: Healthcheck Kube Apiserver
      ansible.builtin.command: curl -k https://localhost:6443/readyz --cert /var/lib/rancher/rke2/server/tls/client-ca.crt --key /var/lib/rancher/rke2/server/tls/client-ca.key
      register: healthcheck_result
      until: healthcheck_result.stdout == 'ok'
      retries: 100
      delay: 5
    - name: Healthcheck RKE2
      ansible.builtin.command: curl -k https://127.0.0.1:9345/v1-rke2/readyz
      register: healthcheck_rke_result
      until: healthcheck_rke_result.rc == 0
      retries: 100
      delay: 5

- name: Restart RKE2 service and check health
  hosts: workers
  become: yes
  serial: "30%"
  tasks:
    - name: Restart RKE2 server on worker nodes
      ansible.builtin.service:
        name: "rke2-agent.service"
        state: restarted
    - name: Pause for 1 minute
      pause:
        minutes: 1
    - name: Healthcheck Kube Apiserver
      ansible.builtin.command: curl -k https://localhost:6443/readyz --cert /var/lib/rancher/rke2/agent/client-kubelet.crt --key /var/lib/rancher/rke2/agent/client-kubelet.key
      register: healthcheck_result
      until: healthcheck_result.stdout == 'ok'
      retries: 100
      delay: 5

Query: Whether we can use this playbook can handle upgrade of existing clusters?

feature: Always look for active.

Summary

Imagine your invetory comes from ex netbox.
that means you dont control order of ex servers coming.
so if you add a new server and run your ansible too add it. it can be that the 2 existing servers will be mentioned after the first.
So better always look for active. and if no active is found use first server..

Issue Type

Feature Idea

Questions /etc/rancher/rke2/config.yaml

Hi, Thank you for this role.

I have a question about the generation of this config file /etc/rancher/rke2/config.yaml, i'm not sur how, but only the first server node receives the right config with tls-san. Other server nodes contains just:

server: https://<keepalived_IP>:9345
token: 
snapshotter: overlayfs

So with the rke2.yaml there might be issue with tls due to missing keepalived IP in other server nodes

Do you encounter this issue?

feature: Cluster/Service CIDR

Summary

By default RKE2 sets:

10.42.0.0/16 - for cluster network
10.43.0.0/16 - for service netwotk

In manual installation, those values can be altered with rke2/config.yaml:

cluster-cidr: 10.1.0.0/16
service-cidr: 10.2.0.0/16

The goal is to add a support of cluster-cidr and service-cidr opinions in this role.

Issue Type

Feature Idea

bug: Warning on token

Summary

Jul 01 13:23:15 server04 rke2[997]: time="2022-07-01T13:23:15+02:00" level=warning msg="Cluster CA certificate is not trusted by the host CA bundle, but the token does not include a CA hash. Use the full token from the server's node-token file to enable Cluster CA validation."

We get this error when token is just plain token for agents.
but should actually get
/var/lib/rancher/rke2/server/node-token
from one of the master nodes i guess.

Issue Type

Bug Report

Ansible Version

ansible [core 2.11.12]
  config file = /home/benji/code/ansible-infrastructure/ansible.cfg
  configured module search path = ['/home/benji/code/ansible-infrastructure/library']
  ansible python module location = /home/benji/.pyenv/versions/3.9.7/envs/ansible-infrastructure-3.9.7/lib/python3.9/site-packages/ansible
  ansible collection location = /home/benji/.ansible/collections:/usr/share/ansible/collections
  executable location = /home/benji/.pyenv/versions/ansible-infrastructure-3.9.7/bin/ansible
  python version = 3.9.7 (default, Apr  7 2022, 12:58:08) [GCC 9.4.0]
  jinja version = 3.0.1
  libyaml = True

Steps to Reproduce

Expected Results

Expect not go get the warning

Actual Results

...

bug: ansible_default_ipv4 fact & IPv6 only machine

Summary

Inside /templates/keepalived.conf.j2 the Ansible fact "ansible_default_ipv4" is used three times.
If you want to run an IPv6 / Hybrid cluster which has an IPv6 main-address you will run in either of those two problems:

When no IPv4 interface is configured:
The execution fails with a fatal error and Ansible will stop the provisioning.

When both IPv4 and IPv6 interfaces are configured:
It will setup the keepalived VRRP on the wrong interface.

I for myself have just changed all occurrence of "ansible_default_ipv4" to "ansible_default_ipv6" for a workaround.

If it is possible it would be nice if you can integrate a check which evaluates if provided node ip/s are IPv4 or IPv6 and then based on the check later call the corresponding Ansible fact. An even easier workaround would be to use a role variable to enable IPv6.

Issue Type

Bug Report

Ansible Version

ansible [core 2.14.2]
  config file = /etc/ansible/ansible.cfg
  configured module search path = ['/home/anon/.ansible/plugins/modules', '/usr/share/ansible/plugins/modules']
  ansible python module location = /usr/lib/python3.10/site-packages/ansible
  ansible collection location = /home/anon/.ansible/collections:/usr/share/ansible/collections
  executable location = /usr/bin/ansible
  python version = 3.10.9 (main, Dec 19 2022, 17:35:49) [GCC 12.2.0] (/usr/bin/python)
  jinja version = 3.1.2
  libyaml = True

Steps to Reproduce

Inventory:
[masters]
rancher-server01 ansible_host=[IPv6-Address] rke2_type=server
rancher-server02 ansible_host=[IPv6-Address] rke2_type=server
rancher-server03 ansible_host=[IPv6-Address] rke2_type=server

[workers]
rancher-node01 ansible_host=[IPv6-Address] rke2_type=agent
rancher-node02 ansible_host=[IPv6-Address] rke2_type=agent
rancher-node03 ansible_host=[IPv6-Address] rke2_type=agent

[k8s_cluster:children]
masters
workers

Playbook:

name: Deploy RKE2
hosts: all
become: yes
become_user: root
become_method: su
vars:
rke2_ha_mode: true
rke2_api_ip : [IPv6-Address]
rke2_server_options:
- "cluster-cidr: 2001:cafe:42:0::/56"
- "service-cidr: 2001:cafe:42:1::/112"
rke2_custom_registry_mirrors:
- name: docker.io
endpoint:
- "https://registry.ipv6.docker.com"
roles:
- role: lablabs.rke2

Expected Results

Provisioning will run through all steps as usual and will bind keepalived VRRP on IPv6 Interface.

Actual Results

Provisioning fails at keepalived VRRP if no IPv4 is setup or will use wrong interface in an IPv4/IPv6 hybrid setup.

feature: drain nodes on upgrade

Summary

Currently, on upgrades, this playbook only installs the new version and restarts the rke2 service without draining beforehand.
I think a cordon, then a drain could be a good feature before these tasks.

I could implement this feature.
Do you think it is a good idea? If it is, do you have any suggestions?

Issue Type

Feature Idea

feature: Support setting agent token

Summary

As described in server configuration documentation, it is possible to define a separate token for agent nodes to use, that does not expose all the etcd secrets like server token does.

Please allow setting agent token in addition to server token on server nodes.

Issue Type

Feature Idea

bug: Always receive condition check error on `Create the RKE2 etcd snapshot dir`

Summary

When try to install rke2 cluster, receive error

i'm installing single server

[masters]

team-edge1-k8s

fatal: [team-edge1-k8s]: FAILED! => {"msg": "The conditional check 'rke2_etcd_snapshot_file and ( \"rke2-server.service\" is not in ansible_facts.services )' failed. The error was: template error while templating string: expected token ')', got '.'. String: {% if rke2_etcd_snapshot_file and ( \"rke2-server.service\" is not in ansible_facts.services ) %} True {% else %} False {% endif %}\n\nThe error appears to be in '/root/.ansible/roles/lablabs.rke2/tasks/first_server.yml': line 40, column 7, but may\nbe elsewhere in the file depending on the exact syntax problem.\n\nThe offending line appears to be:\n\n  block:\n    - name: Create the RKE2 etcd snapshot dir\n      ^ here\n"}

Issue Type

Bug Report

Ansible Version

ansible [core 2.12.10]
  config file = /etc/ansible/ansible.cfg
  configured module search path = ['/root/.ansible/plugins/modules', '/usr/share/ansible/plugins/modules']
  ansible python module location = /usr/lib/python3/dist-packages/ansible
  ansible collection location = /root/.ansible/collections:/usr/share/ansible/collections
  executable location = /usr/bin/ansible
  python version = 3.8.10 (default, Jun 22 2022, 20:18:18) [GCC 9.4.0]
  jinja version = 2.10.1
  libyaml = True

Steps to Reproduce

- name: Deploy RKE2
  hosts: all
  become: yes
  vars:
    rke2_download_kubeconf: true
    rke2_interface: ens160
    rke2_version: v1.24.7+rke2r1
    rke2_disable: rke2-ingress-nginx
    rke2_airgap_mode: true
    rke2_airgap_implementation: copy
    rke2_artifact:
      - sha256sum-{{ rke2_architecture }}.txt
      - rke2.linux-{{ rke2_architecture }}.tar.gz
      - rke2-images.linux-{{ rke2_architecture }}.tar.zst
    rke2_custom_registry_mirrors:
      - name: docker.io
        endpoint:
         - 'https://harbor.intent.ai'
        rewrite: '"^rancher/(.*)": "harbor.int.ai/rancher/$1"'
  roles:
     - role: ansible-role-rke2

Expected Results

I expect clean installation but receive conditional check error

Actual Results

-edge1-k8s             : ok=14   changed=4    unreachable=0    failed=1    skipped=14   rescued=0    ignored=0

Master nodes fail to join cluster if multiple nodes are joined concurrently

If multiple master node servers try to join the cluster concurrently, the rke2-server service fails on one or more master nodes, with the following error:
"Failed to start Rancher Kubernetes Engine v2 (server)."
When the rke2-service fails on a host, Ansible considers the failure as "Fatal" and stops executing the following tasks on the respective host.

The results is that the node will then join the cluster, because systemd keep restarting the service until the activation succeeds, but the playbook stops executing the tasks on that particular host.

I resolved this issue adding a "retry" on the task "Start RKE2 service on the rest of the nodes" in the file "remaining_nodes.yml".
Here's the commit on my forked project: GabriFedi97@70abe0d

There is an issue on rke2 that explains the problem: rancher/rke2#349

bug: issues defining custom registry configs after cluster creation

Summary

Custom registry configs doesn't work without defining a mirror and manually restarting rke2-agent on all nodes.

Issue Type

Bug Report

Ansible Version

ansible [core 2.14.4]
  config file = /Users/moray/.ansible.cfg
  configured module search path = ['/Users/moray/.ansible/plugins/modules', '/usr/share/ansible/plugins/modules']
  ansible python module location = /opt/homebrew/Cellar/ansible/7.4.0/libexec/lib/python3.11/site-packages/ansible
  ansible collection location = /Users/moray/.ansible/collections:/usr/share/ansible/collections
  executable location = /opt/homebrew/bin/ansible
  python version = 3.11.3 (main, Apr  7 2023, 20:13:31) [Clang 14.0.0 (clang-1400.0.29.202)] (/opt/homebrew/Cellar/ansible/7.4.0/libexec/bin/python3.11)
  jinja version = 3.1.2
  libyaml = True

Steps to Reproduce

Set up a RKE2 cluster without any registry mirrors or configs.
Add a registry config to the existing cluster and rerun the playbook

    rke2_custom_registry_configs:
      - endpoint: registry.example.com
        config:
          auth:
            username: "REDACTED"
            password: "REDACTED"

Add a registry mirror to the existing cluster and rerun the playbook

    rke2_custom_registry_mirrors:
      - name: dummy.example.com
        endpoint:
          - https://dummy.example.com
    rke2_custom_registry_configs:
      - endpoint: registry.example.com
        config:
          auth:
            username: "REDACTED"
            password: "REDACTED"

Restart rke2-agent on all nodes

Expected Results

Stopping at step 2 should be enough to update the registry configuration.

Actual Results

- Stopping at step 2 doesn't update the registry configs
- Stopping at step 3 produces 401 errors from the registry on image pull

feature: Rollout restart after /etc/rancher/rke2/config.yaml changed

Summary

Changed made in /etc/rancher/rke2/config.yaml need restart.
ie added tls-san, changed network plugin or cluster-cidr.

Issue Type

Feature Idea

feature: use template mode instend of copy in the Copy Custom Manifests task

Summary

The role currently uses Ansible copy module to copy Custom Manifests. With this module we cannot use variables in the files/manifests.

if we need variable interpolation in copied files, we probably need to use the ansible.builtin.template module.

The main Idea is replace copy module with the template

Issue Type

Feature Idea

bug: disable_kube_proxy defaults to true

Summary

As of the latest release, disable_kube_proxy defaults to true, thus when creating an RKE2 cluster without specifying a CNI (default is Canal) you get a broken cluster.

I assume we'd want disable_kube_proxy to default to false.

Issue Type

Bug Report

Ansible Version

Steps to Reproduce

Leaving disable_kube_proxy and rke2_cni to their defaults.
rke2_cni: canal
disable_kube_proxy: true

Expected Results

Working cluster.

Actual Results

Broken cluster.

feature: Enable dual stack network

Summary

Hello,
How can I enable dual stack network when init k8s cluster (with input IP/IPv6 addresses from main.yaml)

Thank you.

Issue Type

Feature Idea

keepalived: not failing over when apiserver is not ready on first node

It seems the keepalived configuration is not working quite right. I have a cluster made up of 3 masters and switched off the first one, failover to the second one was working as expected. But when I turn the first master back on, the first master gets priority again, despite failing the script for port 6443, journald logs:

Apr 06 08:42:23 k8s-master-0 Keepalived_vrrp[1218]: Assigned address 192.168.1.221 for interface ens160
Apr 06 08:42:23 k8s-master-0 Keepalived_vrrp[1218]: Registering gratuitous ARP shared channel
Apr 06 08:42:23 k8s-master-0 Keepalived_vrrp[1218]: (VI_1) removing VIPs.
Apr 06 08:42:23 k8s-master-0 Keepalived_vrrp[1218]: (VI_1) Entering BACKUP STATE (init)
Apr 06 08:42:23 k8s-master-0 Keepalived_vrrp[1218]: VRRP sockpool: [ifindex(  2), family(IPv4), proto(112), fd(11,14), unicast, address(192.168.1.221)]
Apr 06 08:42:23 k8s-master-0 Keepalived_vrrp[1218]: Script `chk_rke2server` now returning 1
Apr 06 08:42:23 k8s-master-0 Keepalived_vrrp[1218]: VRRP_Script(chk_rke2server) failed (exited with status 1)
Apr 06 08:42:23 k8s-master-0 Keepalived_vrrp[1218]: (VI_1) Changing effective priority from 150 to 145
Apr 06 08:42:23 k8s-master-0 Keepalived_vrrp[1218]: Script `chk_apiserver` now returning 1
Apr 06 08:42:23 k8s-master-0 Keepalived_vrrp[1218]: VRRP_Script(chk_apiserver) failed (exited with status 1)
Apr 06 08:42:23 k8s-master-0 Keepalived_vrrp[1218]: (VI_1) Changing effective priority from 145 to 140
Apr 06 08:42:24 k8s-master-0 Keepalived_vrrp[1218]: (VI_1) received lower priority (139) advert from 192.168.1.222 - discarding

Even with both scripts failing priority stays at 140, while the first backup server only has a priority of 139 (not quite sure why that is actually, the scripts should be working there), so it will never switch to the backup server (except when the first one is completely offline)

logs of second master while the first one was shut down:

Apr 06 08:41:40 k8s-master-1 Keepalived_vrrp[86970]: (VI_1) Receive advertisement timeout
Apr 06 08:41:40 k8s-master-1 Keepalived_vrrp[86970]: (VI_1) Entering MASTER STATE
Apr 06 08:41:40 k8s-master-1 Keepalived_vrrp[86970]: (VI_1) setting VIPs.
Apr 06 08:41:40 k8s-master-1 Keepalived_vrrp[86970]: (VI_1) Sending/queueing gratuitous ARPs on ens160 for 192.168.1.226
Apr 06 08:41:40 k8s-master-1 Keepalived_vrrp[86970]: Sending gratuitous ARP on ens160 for 192.168.1.226
Apr 06 08:41:40 k8s-master-1 Keepalived_vrrp[86970]: Sending gratuitous ARP on ens160 for 192.168.1.226
Apr 06 08:41:40 k8s-master-1 Keepalived_vrrp[86970]: Sending gratuitous ARP on ens160 for 192.168.1.226
Apr 06 08:41:40 k8s-master-1 Keepalived_vrrp[86970]: Sending gratuitous ARP on ens160 for 192.168.1.226
Apr 06 08:41:40 k8s-master-1 Keepalived_vrrp[86970]: Sending gratuitous ARP on ens160 for 192.168.1.226
Apr 06 08:41:45 k8s-master-1 Keepalived_vrrp[86970]: (VI_1) Sending/queueing gratuitous ARPs on ens160 for 192.168.1.226
Apr 06 08:41:45 k8s-master-1 Keepalived_vrrp[86970]: Sending gratuitous ARP on ens160 for 192.168.1.226
Apr 06 08:41:45 k8s-master-1 Keepalived_vrrp[86970]: Sending gratuitous ARP on ens160 for 192.168.1.226
Apr 06 08:41:45 k8s-master-1 Keepalived_vrrp[86970]: Sending gratuitous ARP on ens160 for 192.168.1.226
Apr 06 08:41:45 k8s-master-1 Keepalived_vrrp[86970]: Sending gratuitous ARP on ens160 for 192.168.1.226
Apr 06 08:41:45 k8s-master-1 Keepalived_vrrp[86970]: Sending gratuitous ARP on ens160 for 192.168.1.226
Apr 06 08:42:23 k8s-master-1 Keepalived_vrrp[86970]: (VI_1) Master received advert from 192.168.1.221 with higher priority 140, ours 139
Apr 06 08:42:23 k8s-master-1 Keepalived_vrrp[86970]: (VI_1) Entering BACKUP STATE
Apr 06 08:42:23 k8s-master-1 Keepalived_vrrp[86970]: (VI_1) removing VIPs.

I suspect something is off with the weights in the keepalived config

feature: s3 snapshot

Summary

Can you implement etcd snapshot on minio or S3 ?

In rke2 there are:

`
etcd-expose-metrics: false

etcd-snapshot-name: "prefix_name"
etcd-snapshot-schedule-cron: "0 */1 * * *"
etcd-snapshot-retention: 360
etcd-s3: true
etcd-s3-region: "eu-west-1"
etcd-s3-endpoint: "s3-eu-west-1.amazonaws.com"
etcd-s3-bucket: "my-bucket"
etcd-s3-folder: "rke2-test"
etcd-s3-access-key: "AKIA2Q..."
etcd-s3-secret-key: "jkN0xL...."

Issue Type

Feature Idea

Use a custom rke2 config.yaml file instead of template

Hello,

Thanks for the role, working great !

I have a full config file for RKE2 server with multiple parameters (etcd-s3)
I tried to used it directly with
rke2_server_options: "{{ lookup('file', '../config.yml') | from_yaml }}"
But unfortunately it only writes the keys and not the values in the file generated from template

It would be marvellous to use own config file

Add restore snapshot functionnality

Hello,

First of all, thank you for this great work ! I love it !
May I suggest a new feature ?
I would be very interested by an option to restore an etcd snapshot at installation time.
I mean :

deploy rke2 on first node,
stop rke2-server on first node,
run "rke2 server --cluster-reset --cluster-reset-restore-path=", with "" as an ansible variable,
start rke2-server on first node,
deploy rke2 on remaining nodes

Do you think it could be possible ?

Best regards,

Olivier

bug: [BREAKING] disable_kube_proxy is commented in default/main.yaml

Summary

In the default values in default/main.yaml, disable_kube_proxy: true is commented. It breaks all stages concerning the first node as the variable is called upon.

Issue Type

Bug Report

Ansible Version

ansible [core 2.14.4]
  config file = None
  configured module search path = ['/home/cloudadm/.ansible/plugins/modules', '/usr/share/ansible/plugins/modules']
  ansible python module location = /home/cloudadm/.local/lib/python3.9/site-packages/ansible
  ansible collection location = /home/cloudadm/.ansible/collections:/usr/share/ansible/collections
  executable location = /home/cloudadm/.local/bin/ansible
  python version = 3.9.2 (default, Feb 28 2021, 17:03:44) [GCC 10.2.1 20210110] (/usr/bin/python3)
  jinja version = 3.1.2
  libyaml = True

Steps to Reproduce

Run the role without explicitly assigning a value to diasable_kube_proxy.

Expected Results

The role should run smoothly with disable_kube_proxy set to true

Actual Results

The role breaks when it calls the variable during a check.

feature: kube vip multiple range/CIDR

Summary

Hello,

Currently rke2_loadbalancer_ip_range set range-global and we can't add specific subnet to kube-vip.

rke2_loadbalancer_ip_range should be a dict like this :

rke2_loadbalancer_ip_range:
  range-global: 192.168.1.50-192.168.1.100
  range-namspeace: 192.168.2.50-192.168.2.100

If you agree with the idea I can open a PR

Issue Type

Feature Idea

bug: Molecule test is failing

Summary

CI job with molecule test is failing.

Originally the pipeline started to fail when Molecule 5.0.0 was available in pip.
One thing is the docker plugin (needs to be installed as molecule-plugins package instead of molecule[docker]).
But also after this change the pipeline was failing with different errors. Needs to be checked and fixed...

Some hints: ansible/molecule#3883

Issue Type

Bug Report

Ansible Version

Steps to Reproduce

Run github action CI job with molecule test
Example: https://github.com/lablabs/ansible-role-rke2/actions/runs/4847688181

Expected Results

Molecule test not failing with python error

Actual Results

https://github.com/lablabs/ansible-role-rke2/actions/runs/4847688181

running the playbook after rebooting the nodes

When I tried running the playbook again on the cluster to increase the number of nodes; i faced an issue where the playbook looks for /tmp/rke2.sh in the old nodes; but my cluster nodes rebooted since they were initialized; so this is what i am getting :

{"changed": false, "msg": "file (/tmp/rke2.sh) is absent, cannot continue", "path": "/tmp/rke2.sh", "state": "absent"}

any idea how to bypass this ?

Add rolling restart when upgrading RKE2

When changing the rke2_version variable, the new version is installed but doesn't take effect until the rke2-server service is restarted.

We should look at adding an ansible handler to check if the version has changed, and restart each host one by one if it has?

keepalived no switch IP when rke2-server process is down

RKE2 Server process listen on fixed IP on port 9345.
If it is down new nodes will not be able to provision.

Keepalived script is only configured to check apiserver. Proposition is to add curl to 9345 port to check_apiserver script.

eg.

check_apiserver.sh.j2

#!/bin/sh
errorExit() {
    echo "*** $*" 1>&2
    exit 1
}
curl --silent --max-time 2 --insecure https://localhost:9345/ -o /dev/null || errorExit "Error GET https://localhost:9345}/"
curl --silent --max-time 2 --insecure https://localhost:{{rke2_apiserver_dest_port}}/ -o /dev/null || errorExit "Error GET https://localhost:{{rke2_apiserver_dest_port}}/"
if ip addr | grep -q {{rke2_api_ip}}; then
    curl --silent --max-time 2 --insecure https://{{rke2_api_ip}}:{{rke2_apiserver_dest_port}}/ -o /dev/null || errorExit "Error GET https://{{rke2_api_ip}}:{{rke2_apiserver_dest_port}}/"
fi

feature: Support differing network interface names when using kube-vip

Summary

Currently every node in the rke2 cluster must share the same primary interface for the kube-vip DaemonSet to function. If there is a node with a different network card, etc, we need to be able to specify that unique interface.

Issue Type

Feature Idea

support that /usr is mounted readonly

At the moment when checking for existing version it uses the hardcoded path /usr/local/bin/rke2 in the task rke2.yaml
if you look at the install script from get.rke2.io it will change target if the /usr is mounted RO
# --- install tarball to /usr/local by default, except if /usr/local is on a separate partition or is read-only
# --- in which case we go into /opt/rke2.

So the task for rke2 should be able to handle another location also.

bug: Agents provisioned as Servers

Summary

Hi There,

I'm trying to expand an existent cluster that is formerly composed of 3x node performing Master/Worker node.

I would like to expand the cluster by adding Worker-only nodes and for that I've got the following definitions:

Ansible runs fine, but in the all nodes are reported as control-plane nodes:

# kubectl get nodes -w
NAME       STATUS     ROLES                       AGE     VERSION
master-1   Ready      control-plane,etcd,master   7h54m   v1.22.10+rke2r2
master-2   Ready      control-plane,etcd,master   6h45m   v1.22.10+rke2r2
master-3   Ready      control-plane,etcd,master   6h50m   v1.22.10+rke2r2
worker-1   Ready      control-plane,etcd,master   90s     v1.22.10+rke2r2

Am I doing something wrong or this can be a potential bug?

Issue Type

Bug Report

Ansible Version

ansible [core 2.13.2]
  config file = None
  configured module search path = ['/root/.ansible/plugins/modules', '/usr/share/ansible/plugins/modules']
  ansible python module location = /root/.local/lib/python3.9/site-packages/ansible
  ansible collection location = /root/.ansible/collections:/usr/share/ansible/collections
  executable location = /root/.local/bin/ansible
  python version = 3.9.13 (main, Jun 10 2022, 09:50:06) [GCC]
  jinja version = 3.1.2
  libyaml = True

Steps to Reproduce

inventory.yaml

[masters]
master-1 ansible_host=master-1 rke2_type=server
master-2 ansible_host=master-2 rke2_type=server
master-3 ansible_host=master-3 rke2_type=server

[workers]
worker-1 ansible_host=worker-1 rke2_type=agent

[k8s_cluster:children]
masters
workers

main.yaml

# Install RKE2
- name: RKE2 Setup on Cluster-wide
  hosts: k8s_cluster
  roles:
    - role: RKE2Cluster

vars.yaml

---
# RKE2 Settings
os_privileged_group: gok8sadm
rke2_type: server
rke2_airgap_mode: false
rke2_ha_mode: true
rke2_ha_mode_keepalived: false
rke2_ha_mode_kubevip: true
rke2_api_ip: 192.168.86.100
rke2_interface: eth0
rke2_loadbalancer_ip_range: 192.168.86.101-192.168.86.105
rke2_kubevip_cloud_provider_enable: true
rke2_kubevip_svc_enable: true
rke2_additional_sans: [ my-k8s-dev.hutger.xyz ]
rke2_apiserver_dest_port: 6443
rke2_disable:
  - rke2-ingress-nginx
rke2_server_taint: false
rke2_token: my-token
rke2_version: v1.22.10+rke2r2
rke2_data_path: /var/lib/rancher/rke2
rke2_channel: stable
rke2_cni: canal
rke2_download_kubeconf: true
rke2_download_kubeconf_file_name: rke2.yaml
rke2_download_kubeconf_path: /tmp
rke2_servers_group_name: masters
rke2_agents_group_name: workers

Expected Results

No roles associated to Worker-1

kubectl get nodes -w
NAME       STATUS     ROLES                       AGE     VERSION
master-1   Ready      control-plane,etcd,master   7h54m   v1.22.10+rke2r2
master-2   Ready      control-plane,etcd,master   6h45m   v1.22.10+rke2r2
master-3   Ready      control-plane,etcd,master   6h50m   v1.22.10+rke2r2
worker-1   Ready      <none>                                       90s     v1.22.10+rke2r2

Actual Results

Worker-1 being set as Master.

kubectl get nodes -w
NAME       STATUS     ROLES                       AGE     VERSION
master-1   Ready      control-plane,etcd,master   7h54m   v1.22.10+rke2r2
master-2   Ready      control-plane,etcd,master   6h45m   v1.22.10+rke2r2
master-3   Ready      control-plane,etcd,master   6h50m   v1.22.10+rke2r2
worker-1   Ready      control-plane,etcd,master   90s     v1.22.10+rke2r2