lablabs / ansible-role-rke2 Goto Github PK
View Code? Open in Web Editor NEWAnsible Role to install RKE2 Kubernetes.
Home Page: https://galaxy.ansible.com/ui/standalone/roles/lablabs/rke2/
License: MIT License
Ansible Role to install RKE2 Kubernetes.
Home Page: https://galaxy.ansible.com/ui/standalone/roles/lablabs/rke2/
License: MIT License
I am having this error while running the playbook.
{"msg": "The conditional check 'inventory_hostname is in groups.masters' failed. The error was: template error while templating string: expected token 'end of statement block', got '.'. String: {% if inventory_hostname is in groups.masters %} True {% else %} False {% endif %}\n\nThe error appears to be in '/root/.ansible/roles/lablabs.rke2/tasks/main.yml': line 3, column 3, but may\nbe elsewhere in the file depending on the exact syntax problem.\n\nThe offending line appears to be:\n\n\n- name: Install Keepalived when HA mode is enabled\n ^ here\n"}
I am using the latest ansible version; and I have tried an older version as well, any idea whats going on here ?
I just made the part for taints.
I could do something similar for nodelabels to set specific labels per node.
Feature Idea
Hi,
Thank you for a great role, made my life easier when migrating from RKE1 to RKE2.
First looking into rancherd but then moving back to RKE2 I was happy when I found this role, so thank you for that
I have a question regarding node labels.
I found that it was possible to add node labels using the var k8s_node_label
.
My question is regarding the documentation for the labels (or rather the k8s_node_label
var).
I haven't seen any documentation for it. I found it when looking through the source files (to see if it was possible to add nodes).
The missing documentation makes me a bit worried that it's not ready for use?
Is there a reason for me not to use it? Or is it that it just hasn't been documented (or did I miss it somewhere)?
During the initial installation of a cluster using RKE2 version 1.27.1+rke2r1, kubevip, cilium and kube proxy disabled, the first node is stuck in the NOTREADY state preventing the successful completion of the cluster installation process.
The workaround I found :
ip a a 192.0.2.20 dev ens224
systemctl restart rke2-server.service
Not sure why this is happening so far, possibly due to the disabling of kube proxy.
Bug Report
Ansible 2.14.8
Deploy RKE2 with the following variables :
rke2_version: v1.27.1+rke2r1
rke2_cluster_group_name: kubernetes_cluster
rke2_servers_group_name: kubernetes_masters
rke2_agents_group_name: kubernetes_workers
rke2_ha_mode: true
rke2_ha_mode_keepalived: false
rke2_ha_mode_kubevip: true
rke2_additional_sans:
- kubernetes-api.example.net
rke2_api_ip: "192.0.2.20"
rke2_kubevip_svc_enable: false
rke2_interface: "ens224"
rke2_kubevip_cloud_provider_enable: false
rke2_cni: cilium
rke2_disable:
- rke2-canal
- rke2-ingress-nginx
rke2_custom_manifests:
- rke2-cilium-proxy.yaml
disable_kube_proxy: true
rke2_drain_node_during_upgrade: true
rke2_wait_for_all_pods_to_be_ready: true
Here is the content of rke2-cilium-proxy.yaml :
apiVersion: helm.cattle.io/v1
kind: HelmChartConfig
metadata:
name: rke2-cilium
namespace: kube-system
spec:
valuesContent: |-
kubeProxyReplacement: strict
k8sServiceHost: {{ rke2_api_ip }}
k8sServicePort: {{ rke2_apiserver_dest_port }}
ipv4NativeRoutingCIDR: 10.43.0.0/15
hubble:
enabled: true
metrics:
enabled:
- dns:query;ignoreAAAA
- drop
- tcp
- flow
- icmp
- http
relay:
enabled: true
replicas: 3
ui:
enabled: true
replicas: 3
ingress:
enabled: false
The first server should at some point be in the READY state, so the installation of the cluster succeed.
[…]
FAILED - RETRYING: [k8s01.example.net]: Wait for the first server be ready (1 retries left).
fatal: [k8s01.example.net]: FAILED! => changed=false
attempts: 40
cmd: |-
set -o pipefail
/var/lib/rancher/rke2/bin/kubectl --kubeconfig /etc/rancher/rke2/rke2.yaml get nodes | grep "k8s01.example.net"
delta: '0:00:00.096538'
end: '2023-08-23 09:31:26.649490'
msg: ''
rc: 0
start: '2023-08-23 09:31:26.552952'
stderr: ''
stderr_lines: <omitted>
stdout: k8s01.example.net NotReady control-plane,etcd,master 10m v1.27.1+rke2r1
stdout_lines: <omitted>
Environmental Info:
RKE2 Version: v1.23.7+rke2r2
Node(s) CPU architecture, OS, and Version:
Linux master-01 5.4.0-117-generic #132-Ubuntu SMP Thu Jun 2 00:39:06 UTC 2022 x86_64 x86_64 x86_64 GNU/Linu
Cluster Configuration:
1 master
4 workers
airgap environment
CNI: multus,calico
Describe the bug:
When bootstraping RKE2 airgap cluster with CNI plugins other than default (in my case, multus and calico), the CNI images are copied into the /var/lib/rancher/rke2/artifacts/
. But, in order to deploy CNIs, the tarballs (compressed zst or tar.gz) have to be also copied in /var/lib/rancher/rke2/agent/images
.
Please, note that I mentioned the default settings for artifacts and images.
I already opened an issue for RKE2: rancher/rke2#3147
And the response was that it is the expected behaviour, so the images of the CNIs have to be manually copied to this path.
Bug Report
ansible [core 2.12.6]
config file = /home/test/ANA/Offline-RKE2-ANA/ansible.cfg
configured module search path = ['/home/test/.ansible/plugins/modules', '/usr/share/ansible/plugins/modules']
ansible python module location = /home/test/.local/lib/python3.8/site-packages/ansible
ansible collection location = /home/test/ANA/Offline-RKE2/collections
executable location = /home/test/.local/bin/ansible
python version = 3.8.10 (default, Mar 15 2022, 12:22:08) [GCC 9.4.0]
jinja version = 3.1.2
libyaml = True
---
- name: RKE2 K8S BOOTSTRAPPING
hosts: all
gather_facts: true
become: true
vars_files:
- vars/global.yaml
vars:
rke2_ha_mode: false
rke2_airgap_mode: true
rke2_airgap_implementation: copy
rke2_ha_mode_keepalived: false
rke2_ha_mode_kubevip: false
rke2_additional_sans:
- ana.pt
- k8s.ana.pt
- k8s-vmware.ana.pt
rke2_apiserver_dest_port: 6443
rke_server_taint: true
rke2_token: GIzY5kxm9WRGxBekiifQ
rke2_version: v1.23.7+rke2r2
rke2_channel: stable
rke2_artifact_path: /var/lib/rancher/rke2/artifacts
rke2_airgap_copy_sourcepath: local_artifacts/local_artifacts_rke2
rke2_cni:
- multus
- calico
rke2_download_kubeconf: true
rke2_download_kubeconf_file_name: rke2.yaml
rke2_download_kubeconf_path: /tmp
nexus_container_registry: "{{ nexus_ingress_cr_host }}"
rke2_custom_registry_mirrors:
- name: docker.io
endpoint:
- "https://{{ nexus_container_registry }}"
- name: quay.io
endpoint:
- "https://{{ nexus_container_registry }}"
- name: docker.elastic.co
endpoint:
- "https://{{ nexus_container_registry }}"
- name: cr.fluentbit.io
endpoint:
- "https://{{ nexus_container_registry }}"
- name: registry.gitlab.com
endpoint:
- "https://{{ nexus_container_registry }}"
rke2_custom_registry_configs:
- endpoint: "\"{{ nexus_container_registry }}\""
config:
tls:
insecure_skip_verify: true
rke2_custom_manifests:
- roles/lablabs.rke2/files/rke2-ingress-nginx-config.yml
rke2_artifact:
- sha256sum-{{ rke2_architecture }}.txt
- rke2.linux-{{ rke2_architecture }}.tar.gz
- rke2-images.linux-{{ rke2_architecture }}.tar.zst
- rke2-images-multus.linux-{{ rke2_architecture }}.tar.zst
- rke2-images-calico.linux-{{ rke2_architecture }}.tar.zst
roles:
- role: lablabs.rke2
RKE2 nodes in Ready state.
First server can't find the multus or calico images. The artifacts files are copied to `{{ rke2_artifact_path: }}` but not to the `agent/images`
Hey, I've tried out this role and am already a fan of it over Rancher's role as it is published on ansible-galaxy, which doesn't seem to be on the roadmap for that repo.
One thing I'd love is support for running CIS hardening as part of RKE2 Security Hardening guide. It's included in the rancherfederal repo here.
I can add the following as a var:
rke2_server_options:
- "profile: cis-1.6"
But I get the following error following the logs:
Jun 28 04:12:21 [HOSTNAME] sh[90630]: Failed to get unit file state for nm-cloud-setup.service: No such file or directory
Jun 28 04:12:22 [HOSTNAME] rke2[90634]: time="2022-06-28T04:12:22Z" level=fatal msg="missing required: user: unknown user etcd\nmissing required: group: unknown group etcd\ninvalid kernel parameter value vm.overcommit_memory=0 - expected 1\ninvalid kernel parameter value kernel.panic=0 - expected 10\ninvalid kernel parameter value kernel.panic_on_oops=0 - expected 1\n"
Jun 28 04:12:22 [HOSTNAME] systemd[1]: rke2-server.service: Main process exited, code=exited, status=1/FAILURE
Jun 28 04:12:22 [HOSTNAME] systemd[1]: rke2-server.service: Failed with result 'exit-code'.
Jun 28 04:12:22 [HOSTNAME] systemd[1]: Failed to start Rancher Kubernetes Engine v2 (server).
which is because it's failing in the CIS checks, as noted from the docs:
Checks that host-level requirements have been met. If they haven't, RKE2 will exit with a fatal error describing the unmet requirements.
Side note: what is the background of this role? Was it created in parallel/independently of the rancherfederal
repo?
Feature Idea
While creating a new server ansible was stuck in "Restore etcd" block execution for a long time.
Bug Report
ansible [core 2.12.7]
config file = /home/mgmt/ansible/rke2-setup/ansible.cfg
configured module search path = ['/home/mgmt/.ansible/plugins/modules', '/usr/share/ansible/plugins/modules']
ansible python module location = /usr/lib/python3/dist-packages/ansible
ansible collection location = /home/mgmt/.ansible/collections:/usr/share/ansible/collections
executable location = /usr/bin/ansible
python version = 3.8.10 (default, Jun 22 2022, 20:18:18) [GCC 9.4.0]
jinja version = 2.10.1
libyaml = True
When I try to use the below to deploy RKE2, I am facing an error mentioning syntax error.
\:$ cat playbook.yaml
---
- name: Deploy RKE2
hosts: All
become: yes
vars:
rke2_ha_mode: true
rke2_ha_mode_keepalived: false
rke2_ha_mode_kubevip: true
rke2_api_ip: 192.168.5.220
rke2_api_cidr: 24
rke2_interface: eno1
rke2_kubevip_svc_enable: true
rke2_loadbalancer_ip_range: 192.168.5.221-192.168.5.254
rke2_additional_sans: [infra.example.com]
rke2_version: v1.21.11+rke2r1
rke2_artifact_path: /rke2/artifact
rke2_airgap_copy_sourcepath: /rke2/artifact
rke2_airgap_mode: true
rke2_airgap_implementation: copy
rke2_disable:
- rke2-canal
- rke2-calico
- rke2-kube-proxy
- rke2-multus
rke2_cni:
- cilium
rke2_airgap_copy_additional_tarballs:
- rke2-images-cilium.linux-{{ rke2_architecture }}.tar.zst
rke2_drain_node_during_upgrade: false
roles:
- role: lablabs.rke2
Inventory file used
[masters]
server1 ansible_host=192.168.3.11 rke2_type=server
server2 ansible_host=192.168.3.12 rke2_type=server
server2 ansible_host=192.168.3.13 rke2_type=server
[workers]
agent-2 ansible_host=192.168.3.23 rke2_type=agent
agent-2 ansible_host=192.168.3.43 rke2_type=agent
agent-3 ansible_host=192.168.3.34 rke2_type=agent
[k8s_cluster:children]
server
agent
The entire cluster should be deployed with the required manifests and config information.
During the deployment of the server "Restore etcd" block of first_server.yml task execution stuck for long time. Commented on the block and proceeded to complete the deployment.
I've installed rke2 with ansible, installed there prometheus-stack chart, in the targets - etcd, Scheduler, Controller-manager can't be reached
Bug Report
ansible 2.10.17
config file = None
configured module search path = ['/root/.ansible/plugins/modules', '/usr/share/ansible/plugins/modules']
ansible python module location = /usr/local/lib/python3.8/dist-packages/ansible
executable location = /usr/local/bin/ansible
python version = 3.8.10 (default, Jun 22 2022, 20:18:18) [GCC 9.4.0]
### Steps to Reproduce
Install rke2.
Install prmetheus-stack chart
Check targets for monitoring in prometheus
### Expected Results
All services are reachable
### Actual Results
```console
can't get targets Etcd, Scheduler, Controller-manager
I have a Vagrantfile that provisions three boxes running AlmaLinux 8 via libvirt, which use the Ansible provisioner to include this role.
I have no agent nodes, thus I'm not tainting the server nodes. I ran into no issues when provisioning a single node cluster, but I run into issues when specifying multiple server nodes in my Ansible inventory.
In High Availability mode:
I run into the following error on the task: Create keepalived config file
An exception occurred during task execution. To see the full traceback, use -vvv.
The error was: ansible.errors.AnsibleUndefinedVariable: 'ansible.vars.hostvars.HostVarsVars object' has no attribute 'ansible_default_ipv4'.
'ansible.vars.hostvars.HostVarsVars object' has no attribute 'ansible_default_ipv4'
I was able to get past this issue by changing the below line to {{ hostvars[host].ansible_host }}
Bug Report
ansible [core 2.14.1]
config file = None
configured module search path = ['/home/austin/.ansible/plugins/modules', '/usr/share/ansible/plugins/modules']
ansible python module location = /home/austin/.local/lib/python3.11/site-packages/ansible
ansible collection location = /home/austin/.ansible/collections:/usr/share/ansible/collections
executable location = /home/austin/.local/bin/ansible
python version = 3.11.1 (main, Dec 11 2022, 15:18:51) [GCC 10.2.1 20201203] (/usr/bin/python3)
jinja version = 3.1.2
libyaml = True
Have libvirt setup and the vagrant-libvirt plugin installed along with Vagrant, Ansible, and this role.
Below are the three files necessary when running vagrant up
:
Vagrantfile
NODES = [
{ hostname: "controller1", ip: "192.168.111.2", ram: 4096, cpu: 2 },
{ hostname: "controller2", ip: "192.168.111.3", ram: 4096, cpu: 2 },
{ hostname: "controller3", ip: "192.168.111.4", ram: 4096, cpu: 2 }
]
Vagrant.configure(2) do |config|
NODES.each do |node|
config.vm.define node[:hostname] do |config|
config.vm.hostname = node[:hostname]
config.vm.box = "almalinux/8"
config.vm.network :private_network, ip: node[:ip]
config.vm.provider :libvirt do |domain|
domain.memory = node[:ram]
domain.cpus = node[:cpu]
end
config.vm.provision :ansible do |ansible|
ansible.playbook = "playbooks/provision.yml"
ansible.inventory_path = "inventory/hosts.ini"
end
end
end
end
playbooks/provision.yml
- hosts: all
become: true
vars:
rke2_channel: stable
rke2_servers_group_name: rke2_servers
rke2_agents_group_name: rke2_agents
rke2_ha_mode: true
roles:
- lablabs.rke2
inventory/hosts.ini
[rke2_servers]
controller1 ansible_host=192.168.111.2 rke2_type=server
controller2 ansible_host=192.168.111.3 rke2_type=server
controller3 ansible_host=192.168.111.4 rke2_type=server
[rke2_agents]
[k8s_cluster:children]
rke2_servers
rke2_agents
For three server nodes to be provisioned after running vagrant up
All servers fail to provision rke2 Ansible role.
Hi,
on the other rke2_version than default : v1.21.2+rke2r1
Tested on:
rke2_version: v1.21.5+rke2r2
rke2_version: v1.21.6+rke2r1
It is failing on wait for the first server to be ready.
TASK [lablabs.rke2 : Wait for the first server be ready] ***************************************************************************************************************
FAILED - RETRYING: Wait for the first server be ready (40 retries left).
FAILED - RETRYING: Wait for the first server be ready (39 retries left).
Log from rke2-server:
Nov 04 14:29:38 rke-test-1-dc1-mgmt rke2[4881]: time="2021-11-04T14:29:38+01:00" level=info msg="Event(v1.ObjectReference{Kind:\"Addon\", Namespace:\"kube-system\", Name:\"rke2-canal\", UID:\"45e39394-4aa4-4c41-9d6c-6400d2fed972\", APIVersion:\"k3s.cattle.io/v1\", ResourceVersion:\"297\", FieldPath:\"\"}): type: 'Normal' reason: 'ApplyingManifest' Applying manifest at \"/var/lib/rancher/rke2/server/manifests/rke2-canal.yaml\""
Nov 04 14:29:38 rke-test-1-dc1-mgmt rke2[4881]: time="2021-11-04T14:29:38+01:00" level=info msg="Active TLS secret rke2-serving (ver=296) (count 9): map[listener.cattle.io/cn-10.149.100.141:10.149.100.141 listener.cattle.io/cn-10.43.0.1:10.43.0.1 listener.cattle.io/cn-127.0.0.1:127.0.0.1 listener.cattle.io/cn-kubernetes:kubernetes listener.cattle.io/cn-kubernetes.default:kubernetes.default listener.cattle.io/cn-kubernetes.default.svc:kubernetes.default.svc listener.cattle.io/cn-kubernetes.default.svc.cluster.local:kubernetes.default.svc.cluster.local listener.cattle.io/cn-localhost:localhost listener.cattle.io/cn-rke-test-1-dc1-mgmt:rke-test-1-dc1-mgmt listener.cattle.io/fingerprint:SHA1=9110B4A56B425CE1ADF5FD2C7E4536E2E0097175]"
Nov 04 14:29:38 rke-test-1-dc1-mgmt rke2[4881]: time="2021-11-04T14:29:38+01:00" level=info msg="Event(v1.ObjectReference{Kind:\"Addon\", Namespace:\"kube-system\", Name:\"rke2-canal\", UID:\"45e39394-4aa4-4c41-9d6c-6400d2fed972\", APIVersion:\"k3s.cattle.io/v1\", ResourceVersion:\"297\", FieldPath:\"\"}): type: 'Normal' reason: 'AppliedManifest' Applied manifest at \"/var/lib/rancher/rke2/server/manifests/rke2-canal.yaml\""
Nov 04 14:29:38 rke-test-1-dc1-mgmt rke2[4881]: time="2021-11-04T14:29:38+01:00" level=info msg="Event(v1.ObjectReference{Kind:\"Addon\", Namespace:\"kube-system\", Name:\"rke2-cilium\", UID:\"\", APIVersion:\"k3s.cattle.io/v1\", ResourceVersion:\"\", FieldPath:\"\"}): type: 'Normal' reason: 'DeletingManifest' Deleting manifest at \"/var/lib/rancher/rke2/server/manifests/rke2-cilium.yaml\""
Nov 04 14:29:39 rke-test-1-dc1-mgmt rke2[4881]: time="2021-11-04T14:29:39+01:00" level=info msg="Event(v1.ObjectReference{Kind:\"Addon\", Namespace:\"kube-system\", Name:\"rke2-coredns\", UID:\"89e3aa88-84a6-49be-9a0f-668817ed3d61\", APIVersion:\"k3s.cattle.io/v1\", ResourceVersion:\"327\", FieldPath:\"\"}): type: 'Normal' reason: 'ApplyingManifest' Applying manifest at \"/var/lib/rancher/rke2/server/manifests/rke2-coredns.yaml\""
Nov 04 14:29:39 rke-test-1-dc1-mgmt rke2[4881]: time="2021-11-04T14:29:39+01:00" level=info msg="Event(v1.ObjectReference{Kind:\"Addon\", Namespace:\"kube-system\", Name:\"rke2-coredns\", UID:\"89e3aa88-84a6-49be-9a0f-668817ed3d61\", APIVersion:\"k3s.cattle.io/v1\", ResourceVersion:\"327\", FieldPath:\"\"}): type: 'Normal' reason: 'AppliedManifest' Applied manifest at \"/var/lib/rancher/rke2/server/manifests/rke2-coredns.yaml\""
Nov 04 14:29:39 rke-test-1-dc1-mgmt rke2[4881]: time="2021-11-04T14:29:39+01:00" level=info msg="Event(v1.ObjectReference{Kind:\"Addon\", Namespace:\"kube-system\", Name:\"rke2-ingress-nginx\", UID:\"3ae85ff0-c634-4186-b9cb-79d0f3f8aecc\", APIVersion:\"k3s.cattle.io/v1\", ResourceVersion:\"343\", FieldPath:\"\"}): type: 'Normal' reason: 'ApplyingManifest' Applying manifest at \"/var/lib/rancher/rke2/server/manifests/rke2-ingress-nginx.yaml\""
Nov 04 14:29:39 rke-test-1-dc1-mgmt rke2[4881]: time="2021-11-04T14:29:39+01:00" level=info msg="Event(v1.ObjectReference{Kind:\"Addon\", Namespace:\"kube-system\", Name:\"rke2-ingress-nginx\", UID:\"3ae85ff0-c634-4186-b9cb-79d0f3f8aecc\", APIVersion:\"k3s.cattle.io/v1\", ResourceVersion:\"343\", FieldPath:\"\"}): type: 'Normal' reason: 'AppliedManifest' Applied manifest at \"/var/lib/rancher/rke2/server/manifests/rke2-ingress-nginx.yaml\""
Nov 04 14:29:39 rke-test-1-dc1-mgmt rke2[4881]: time="2021-11-04T14:29:39+01:00" level=info msg="Event(v1.ObjectReference{Kind:\"Addon\", Namespace:\"kube-system\", Name:\"rke2-kube-proxy\", UID:\"\", APIVersion:\"k3s.cattle.io/v1\", ResourceVersion:\"\", FieldPath:\"\"}): type: 'Normal' reason: 'DeletingManifest' Deleting manifest at \"/var/lib/rancher/rke2/server/manifests/rke2-kube-proxy.yaml\""
Nov 04 14:29:39 rke-test-1-dc1-mgmt rke2[4881]: time="2021-11-04T14:29:39+01:00" level=info msg="Event(v1.ObjectReference{Kind:\"Addon\", Namespace:\"kube-system\", Name:\"rke2-metrics-server\", UID:\"f61febdf-962f-4d43-a7ac-a00806bd8f42\", APIVersion:\"k3s.cattle.io/v1\", ResourceVersion:\"364\", FieldPath:\"\"}): type: 'Normal' reason: 'ApplyingManifest' Applying manifest at \"/var/lib/rancher/rke2/server/manifests/rke2-metrics-server.yaml\""
Nov 04 14:29:39 rke-test-1-dc1-mgmt rke2[4881]: time="2021-11-04T14:29:39+01:00" level=info msg="Event(v1.ObjectReference{Kind:\"Addon\", Namespace:\"kube-system\", Name:\"rke2-metrics-server\", UID:\"f61febdf-962f-4d43-a7ac-a00806bd8f42\", APIVersion:\"k3s.cattle.io/v1\", ResourceVersion:\"364\", FieldPath:\"\"}): type: 'Normal' reason: 'AppliedManifest' Applied manifest at \"/var/lib/rancher/rke2/server/manifests/rke2-metrics-server.yaml\""
Nov 04 14:29:39 rke-test-1-dc1-mgmt rke2[4881]: time="2021-11-04T14:29:39+01:00" level=info msg="Event(v1.ObjectReference{Kind:\"Addon\", Namespace:\"kube-system\", Name:\"rke2-multus\", UID:\"\", APIVersion:\"k3s.cattle.io/v1\", ResourceVersion:\"\", FieldPath:\"\"}): type: 'Normal' reason: 'DeletingManifest' Deleting manifest at \"/var/lib/rancher/rke2/server/manifests/rke2-multus.yaml\""
Nov 04 14:29:40 rke-test-1-dc1-mgmt rke2[4881]: time="2021-11-04T14:29:40+01:00" level=info msg="Running kube-proxy --cluster-cidr=10.42.0.0/16 --conntrack-max-per-core=0 --conntrack-tcp-timeout-close-wait=0s --conntrack-tcp-timeout-established=0s --healthz-bind-address=127.0.0.1 --hostname-override=rke-test-1-dc1-mgmt --kubeconfig=/var/lib/rancher/rke2/agent/kubeproxy.kubeconfig --proxy-mode=iptables"
Nov 04 14:29:40 rke-test-1-dc1-mgmt rke2[4881]: time="2021-11-04T14:29:40+01:00" level=info msg="Stopped tunnel to 127.0.0.1:9345"
Nov 04 14:29:40 rke-test-1-dc1-mgmt rke2[4881]: time="2021-11-04T14:29:40+01:00" level=info msg="Connecting to proxy" url="wss://10.149.100.141:9345/v1-rke2/connect"
Nov 04 14:29:40 rke-test-1-dc1-mgmt rke2[4881]: time="2021-11-04T14:29:40+01:00" level=info msg="Proxy done" err="context canceled" url="wss://127.0.0.1:9345/v1-rke2/connect"
Nov 04 14:29:40 rke-test-1-dc1-mgmt rke2[4881]: time="2021-11-04T14:29:40+01:00" level=info msg="error in remotedialer server [400]: websocket: close 1006 (abnormal closure): unexpected EOF"
Nov 04 14:29:40 rke-test-1-dc1-mgmt rke2[4881]: time="2021-11-04T14:29:40+01:00" level=info msg="Handling backend connection request [rke-test-1-dc1-mgmt]"
I upgraded rke2 from v1.22.9 to v1.23.9 which actually worked fine, but I noticed that some worker nodes were upgraded in between the master nodes which goes against RKE2 recommendations:
Note: Upgrade the server nodes first, one at a time. Once all servers have been upgraded, you may then upgrade agent nodes.
see https://docs.rke2.io/upgrade/basic_upgrade/
Ansible Output:
TASK [lablabs.rke2 : Cordon and Drain the node platform-rancher-master-k8s-master-0] ***
skipping: [platform-rancher-master-k8s-master-0]
TASK [lablabs.rke2 : Restart RKE2 service on platform-rancher-master-k8s-master-0] ***
changed: [platform-rancher-master-k8s-master-0]
TASK [lablabs.rke2 : Wait for all nodes to be ready again] *********************
FAILED - RETRYING: [platform-rancher-master-k8s-master-0 -> platform-rancher-master-k8s-master-2]: Wait for all nodes to be ready again (100 retries left).
ok: [platform-rancher-master-k8s-master-0 -> platform-rancher-master-k8s-master-2(10.10.50.103)]
TASK [lablabs.rke2 : Uncordon the node platform-rancher-master-k8s-master-0] ***
skipping: [platform-rancher-master-k8s-master-0]
TASK [lablabs.rke2 : Cordon and Drain the node platform-rancher-master-k8s-master-1] ***
skipping: [platform-rancher-master-k8s-master-1]
TASK [lablabs.rke2 : Restart RKE2 service on platform-rancher-master-k8s-master-1] ***
changed: [platform-rancher-master-k8s-master-1]
TASK [lablabs.rke2 : Wait for all nodes to be ready again] *********************
ok: [platform-rancher-master-k8s-master-1 -> platform-rancher-master-k8s-master-2(10.10.50.103)]
TASK [lablabs.rke2 : Uncordon the node platform-rancher-master-k8s-master-1] ***
skipping: [platform-rancher-master-k8s-master-1]
TASK [lablabs.rke2 : Cordon and Drain the node platform-rancher-master-k8s-worker-1] ***
skipping: [platform-rancher-master-k8s-worker-1]
TASK [lablabs.rke2 : Restart RKE2 service on platform-rancher-master-k8s-worker-1] ***
changed: [platform-rancher-master-k8s-worker-1]
TASK [lablabs.rke2 : Wait for all nodes to be ready again] *********************
FAILED - RETRYING: [platform-rancher-master-k8s-worker-1 -> platform-rancher-master-k8s-master-2]: Wait for all nodes to be ready again (100 retries left).
ok: [platform-rancher-master-k8s-worker-1 -> platform-rancher-master-k8s-master-2(10.10.50.103)]
TASK [lablabs.rke2 : Uncordon the node platform-rancher-master-k8s-worker-1] ***
skipping: [platform-rancher-master-k8s-worker-1]
TASK [lablabs.rke2 : Cordon and Drain the node platform-rancher-master-k8s-master-2] ***
skipping: [platform-rancher-master-k8s-master-2]
TASK [lablabs.rke2 : Restart RKE2 service on platform-rancher-master-k8s-master-2] ***
changed: [platform-rancher-master-k8s-master-2]
TASK [lablabs.rke2 : Wait for all nodes to be ready again] *********************
ok: [platform-rancher-master-k8s-master-2]
TASK [lablabs.rke2 : Uncordon the node platform-rancher-master-k8s-master-2] ***
skipping: [platform-rancher-master-k8s-master-2]
TASK [lablabs.rke2 : Cordon and Drain the node platform-rancher-master-k8s-worker-0] ***
skipping: [platform-rancher-master-k8s-worker-0]
TASK [lablabs.rke2 : Restart RKE2 service on platform-rancher-master-k8s-worker-0] ***
Bug Report
ansible [core 2.12.7]
config file = None
configured module search path = ['/root/.ansible/plugins/modules', '/usr/share/ansible/plugins/modules']
ansible python module location = /usr/local/lib/python3.10/site-packages/ansible
ansible collection location = /root/.ansible/collections:/usr/share/ansible/collections
executable location = /usr/local/bin/ansible
python version = 3.10.5 (main, Jul 13 2022, 05:45:22) [GCC 10.2.1 20210110]
jinja version = 3.1.2
libyaml = True
trigger a RKE2 upgrade, i.e. from 1.22.9 to 1.23.9
Master nodes should be upgraded first, then the worker nodes
Nodes are upgraded seemingly randomly
When using HA setup with Keeplived, the server certificates provisioned for Kubelet does not include the Keepalived VIP. This causes TLS verification issues when performing various operations like viewing logs or port forwarding on the current leader.
Bug Report
ansible [core 2.14.6]
config file = /Users/moray/.ansible.cfg
configured module search path = ['/Users/moray/.ansible/plugins/modules', '/usr/share/ansible/plugins/modules']
ansible python module location = /opt/homebrew/Cellar/ansible/7.6.0/libexec/lib/python3.11/site-packages/ansible
ansible collection location = /Users/moray/.ansible/collections:/usr/share/ansible/collections
executable location = /opt/homebrew/bin/ansible
python version = 3.11.4 (main, Jul 25 2023, 17:36:13) [Clang 14.0.3 (clang-1403.0.22.14.1)] (/opt/homebrew/Cellar/ansible/7.6.0/libexec/bin/python3.11)
jinja version = 3.1.2
libyaml = True
- hosts: rke
become: true
roles:
- role: lablabs.rke2
vars:
rke2_ha_mode: true
rke2_ha_mode_keepalived: true
rke2_version: v1.26.7+rke2r1
rke2_install_bash_url: https://get.rke2.io
rke2_api_ip: 10.64.0.9
rke2_disable:
- rke2-ingress-nginx
rke2_cni: canal
rke2_cluster_group_name: rke
rke2_servers_group_name: rke_master
# Ansible group including worker nodes
rke2_agents_group_name: rke_worker
rke2_server_options:
- "disable-cloud-controller: true"
The TLS certificate generated for Kubelet includes the Keepalived VIP (10.64.0.9 in the example above), issuing kubectl logs
and kubectl port-forward
command on pods on the current leader works without problem.
The TLS certificate for Kubelet does not include the Keepalived VIP (10.64.0.9 in the example above). Issuing kubectl logs
or kubectl port-forward
commands on pods on the current leader results in the following error:
Error from server: Get "https://10.64.0.9:10250/containerLogs/kube-system/kube-proxy-master-0/kube-proxy": tls: failed to verify certificate: x509: certificate is valid for 127.0.0.1, 10.64.0.10, not 10.64.0.9
Additional information:
node-ip
and advertise-address
to the non-virtual ip but to no avail.We are currently running kubespray in ha with haproxy. Is it possible to do same here?
I saw that keepalived is mandatory in ha-mode
In the troubleshooting section here: https://github.com/lablabs/ansible-role-rke2#troubleshooting, it mentions that it might be a network limitation.
The problem is that the RKE2 script is never executed on the agent which has condition with the variable installed_rke2_version
. While that variable is depends on condition "rke2-server.service" in ansible_facts.services
.
Below is the changes I made to fix the issue:
Before the Run AirGap RKE2 script
task (
ansible-role-rke2/tasks/rke2.yml
Line 91 in dc6d426
ansible-role-rke2/tasks/rke2.yml
Line 89 in dc6d426
- name: Check rke2 bin exists
ansible.builtin.stat:
path: "{{ rke2_bin_path }}"
register: rke2_exists
- name: Check RKE2 version
ansible.builtin.shell: |
set -o pipefail
{{ rke2_bin_path }} --version | grep -E "rke2 version" | awk '{print $3}'
args:
executable: /bin/bash
changed_when: false
register: installed_rke2_version
when: rke2_exists.stat.exists
Bug Report
ansible [core 2.14.2]
config file = /etc/ansible/ansible.cfg
configured module search path = ['/root/.ansible/plugins/modules', '/usr/share/ansible/plugins/modules']
ansible python module location = /usr/lib/python3/dist-packages/ansible
ansible collection location = /root/.ansible/collections:/usr/share/ansible/collections
executable location = /usr/bin/ansible
python version = 3.10.6 (main, Nov 14 2022, 16:10:14) [GCC 11.3.0] (/usr/bin/python3)
jinja version = 3.0.3
libyaml = True
- name: Deploy RKE2
hosts: all
become: yes
vars:
rke2_version: v1.26.0+rke2r2
rke2_api_ip : 192.168.1.10
rke2_download_kubeconf: true
rke2_server_node_taints:
- 'CriticalAddonsOnly=true:NoExecute'
rke2_cni:
- cilium
roles:
- role: lablabs.rke2
[masters]
master-01 ansible_host=192.168.1.10 rke2_type=server
master-02 ansible_host=192.168.1.11 rke2_type=server
master-03 ansible_host=192.168.1.12 rke2_type=server
[workers]
worker-01 ansible_host=192.168.1.20 rke2_type=agent
worker-02 ansible_host=192.168.1.21 rke2_type=agent
[k8s_cluster:children]
masters
workers
Worker nodes should be provisioned if the rke2.sh script have been executed on the following task
ansible-role-rke2/tasks/rke2.yml
Line 100 in dc6d426
It's just hanging until timeout.
I'm using the user with paswordless sudo,
TASK [lablabs.rke2 : Replace loopback IP by master server IP] **********************************************************************************************************
fatal: [rke-test-3-dc3-mgmt -> localhost]: FAILED! => {"changed": false, "module_stderr": "sudo: a password is required\n", "module_stdout": "", "msg": "MODULE FAILURE\nSee stdout/stderr for the exact error", "rc": 1}
fatal: [rke-test-1-dc1-mgmt -> localhost]: FAILED! => {"changed": false, "module_stderr": "sudo: a password is required\n", "module_stdout": "", "msg": "MODULE FAILURE\nSee stdout/stderr for the exact error", "rc": 1}
While trying to add a new worker node to the existing HA cluster, it restarts the RKE2 services of all existing master and worker nodes. Also, it takes a long time to complete the ansible execution which can be improved.
While adding new nodes I have commented "- name: Wait for remaining nodes to be ready" in remaining_nodes.yml, and also commented "Rolling restart" task from "main.yml".
This improved start of services in the newly added worker node.
We can have parameters to support adding new workers to the existing cluster?
Question: Even during new server deployment do we need to have "- name: Wait for remaining nodes to be ready" in remaining_nodes.yml?
Feature Idea
While trying to deploy rke2 in HA mode with kube-vip it's failing.
I would like to confirm playbook.yml parameters are correct for a kubevip deployment in baremetal servers?
Bug Report
ansible [core 2.12.7]
config file = /home/mgmt/ansible/rke2-setup/ansible.cfg
configured module search path = ['/home/mgmt/.ansible/plugins/modules', '/usr/share/ansible/plugins/modules']
ansible python module location = /usr/lib/python3/dist-packages/ansible
ansible collection location = /home/mgmt/.ansible/collections:/usr/share/ansible/collections
executable location = /usr/bin/ansible
python version = 3.8.10 (default, Jun 22 2022, 20:18:18) [GCC 9.4.0]
jinja version = 2.10.1
libyaml = True
When I try to use the below to deploy RKE2, I am facing an error mentioning syntax error.
\:$ cat playbook.yaml
---
- name: Deploy RKE2
hosts: All
become: yes
vars:
rke2_ha_mode: true
rke2_ha_mode_keepalived: false
rke2_ha_mode_kubevip: true
rke2_api_ip: 192.168.5.220
rke2_api_cidr: 24
rke2_interface: eno1
rke2_kubevip_svc_enable: true
rke2_loadbalancer_ip_range: 192.168.5.221-192.168.5.254
rke2_additional_sans: [infra.example.com]
rke2_version: v1.21.11+rke2r1
rke2_artifact_path: /rke2/artifact
rke2_airgap_copy_sourcepath: /rke2/artifact
rke2_airgap_mode: true
rke2_airgap_implementation: copy
rke2_disable:
- rke2-canal
- rke2-calico
- rke2-kube-proxy
- rke2-multus
rke2_cni:
- cilium
rke2_airgap_copy_additional_tarballs:
- rke2-images-cilium.linux-{{ rke2_architecture }}.tar.zst
rke2_drain_node_during_upgrade: false
roles:
- role: lablabs.rke2
Inventory file used
[server]
server1 ansible_host=192.168.3.11 rke2_type=server
server2 ansible_host=192.168.3.12 rke2_type=server
server2 ansible_host=192.168.3.13 rke2_type=server
[agent]
agent-2 ansible_host=192.168.3.23 rke2_type=agent
agent-2 ansible_host=192.168.3.43 rke2_type=agent
agent-3 ansible_host=192.168.3.34 rke2_type=agent
[All:children]
server
agent
Deploy RKE2 servers with HA
After deploying the mentioned playbook facing proxy error and eventually services goes down.
Setting server taint leads to a broken config.yaml. At least for me the task
- name: Set server taints
ansible.builtin.set_fact:
combined_node_taints: "{{ node_taints}} + [ 'CriticalAddonsOnly=true:NoExecute' ] "
when: rke2_server_taint and rke2_type == 'server'
leads to a broken list in the config.yaml, as the combined_node_taints
variable ist treated as string instead of list.
The template line in question:
{% for taint in combined_node_taints %}
- {{ taint }}
The correct syntax should be:
- name: Set server taints
ansible.builtin.set_fact:
combined_node_taints: "{{ node_taints + [ 'CriticalAddonsOnly=true:NoExecute' ] }}"
when: rke2_server_taint and rke2_type == 'server'
Both in the task tasks/first_server.yml
and vim tasks/remaining_nodes.yml
I do not understand why this seems to be only a problem for me?
Bug Report
ansible [core 2.13.1]
config file = None
configured module search path = ['/home/user/.ansible/plugins/modules', '/usr/share/ansible/plugins/modules']
ansible python module location = /home/user/.local/lib/python3.9/site-packages/ansible
ansible collection location = /home/user/.ansible/collections:/usr/share/ansible/collections
executable location = /home/user/.local/bin/ansible
python version = 3.9.5 (default, Jun 4 2021, 12:28:51) [GCC 7.5.0]
jinja version = 3.0.3
libyaml = True
Run the playbook with server taint enabled
in config.yaml:
node-taint:
- CriticalAddonsOnly=true:NoExecute
in config.yaml:
node-taint:
- [
- ]
-
- [
- C
- r
...
When trying to restore etcd snapshot, the block "Restore etcd" in the first_server.yml file is skipped due to the following condition : 'and ( "rke2-server.service" is not in ansible_facts.services )'.
Trying the without this second condition and it's working, only with 'when: rke2_etcd_snapshot_file'.
Bug Report
ansible [core 2.13.3]
config file = /etc/ansible/ansible.cfg
configured module search path = ['/home/exploit/.ansible/plugins/modules', '/usr/share/ansible/plugins/modules']
ansible python module location = /home/exploit/.local/lib/python3.8/site-packages/ansible
ansible collection location = /home/exploit/.ansible/collections:/usr/share/ansible/collections
executable location = /home/exploit/.local/bin/ansible
python version = 3.8.10 (default, Jun 22 2022, 20:18:18) [GCC 9.4.0]
jinja version = 3.1.2
libyaml = True
the snapshot should be restore
the snapshot is not restore
Update: Not a bug. Solved by running the Ansible controller on a Linux VM (instead of the problematic WSL2) and disable firewalld in every VMs as indicated in the document:
Firewalld conflicts with RKE2's default Canal (Calico + Flannel) networking stack. To avoid unexpected behavior, firewalld should be disabled on systems running RKE2.
I tried to install lablabs.rke2 (1.12.0)
on 5 VMs running Rocky Linux 8.6 (3 masters and 2 workers) with HA+kubevip. However, the Wait for the first server be ready
task failed after 40 retries. The error message is
/bin/bash: line 1: /var/lib/rancher/rke2/bin/kubectl: No such file or directory
I checked the first server but didn't found the /var/lib/rancher
directory. I only found a weird /etc/rancher/rke2/config.yaml
in the first server.
Bug Report
ansible [core 2.13.1]
config file = None
configured module search path = ['/home/<redacted>/.ansible/plugins/modules', '/usr/share/ansible/plugins/modules']
ansible python module location = /home/<redacted>/.local/lib/python3.10/site-packages/ansible
ansible collection location = /home/<redacted>/.ansible/collections:/usr/share/ansible/collections
executable location = /home/<redacted>/.local/bin/ansible
python version = 3.10.5 (main, Jun 11 2022, 16:53:24) [GCC 9.4.0]
jinja version = 3.1.2
libyaml = False
Commands:
$ ansible-galaxy install lablabs.rke2
$ ansible-playbook -i hosts playbook.yaml
$ ansible-playbook -i hosts playbook.yaml
P.S. The first run failed on the "Start RKE2 service on the first server" task becase rke2-server.service
isn't ready yet.
playbook.yaml:
---
- name: Deploy RKE2
hosts: k8s_cluster
vars:
rke2_ha_mode: true
rke2_ha_mode_keepalived: false
rke2_ha_mode_kubevip: true
rke2_interface: ens192
rke2_loadbalancer_ip_range: 192.168.3.191-192.168.3.199
rke2_server_taint: true
rke2_api_ip : 192.168.3.190
rke2_download_kubeconf: true
roles:
- role: lablabs.rke2
hosts:
k8s_cluster:
children:
masters:
hosts:
master01.mydomain.com:
ansible_user: root
ansible_ssh_private_key_file: ~/.ssh/master01.mydomain.com
rke2_type: server
master02.mydomain.com:
ansible_user: root
ansible_ssh_private_key_file: ~/.ssh/master02.mydomain.com
rke2_type: server
master03.mydomain.com:
ansible_user: root
ansible_ssh_private_key_file: ~/.ssh/master03.mydomain.com
rke2_type: server
workers:
hosts:
worker01.mydomain.com:
ansible_user: root
ansible_ssh_private_key_file: ~/.ssh/worker01.mydomain.com
rke2_type: agent
worker02.mydomain.com:
ansible_user: root
ansible_ssh_private_key_file: ~/.ssh/worker02.mydomain.com
rke2_type: agent
The cluster is deployed.
TASK [lablabs.rke2 : Wait for the first server be ready] ***************************************************************
FAILED - RETRYING: [master01.mydomain.com]: Wait for the first server be ready (40 retries left).
FAILED - RETRYING: [master01.mydomain.com]: Wait for the first server be ready (39 retries left).
FAILED - RETRYING: [master01.mydomain.com]: Wait for the first server be ready (38 retries left).
...
FAILED - RETRYING: [master01.mydomain.com]: Wait for the first server be ready (3 retries left).
FAILED - RETRYING: [master01.mydomain.com]: Wait for the first server be ready (2 retries left).
FAILED - RETRYING: [master01.mydomain.com]: Wait for the first server be ready (1 retries left).
fatal: [master01.mydomain.com]: FAILED! => {"attempts": 40, "changed": false, "cmd": "set -o pipefail\n/var/lib/rancher/rke2/bin/kubectl --kubeconfig /etc/rancher/rke2/rke2.yaml get nodes | grep \"master01.mydomain.com\"\n", "delta": "0:00:00.005873", "end": "2022-07-12 11:50:39.491521", "msg": "non-zero return code", "rc": 1, "start": "2022-07-12 11:50:39.485648", "stderr": "/bin/bash: line 1: /var/lib/rancher/rke2/bin/kubectl: No such file or directory", "stderr_lines": ["/bin/bash: line 1: /var/lib/rancher/rke2/bin/kubectl: No such file or directory"], "stdout": "", "stdout_lines": []}
...
TASK [lablabs.rke2 : Download RKE2 kubeconfig to localhost] ************************************************************
fatal: [master02.mydomain.com -> master01.mydomain.com]: FAILED! => {"changed": false, "msg": "the remote file does not exist, not transferring, ignored"}
...
PLAY RECAP *************************************************************************************************************
master01.mydomain.com : ok=15 changed=1 unreachable=0 failed=1 skipped=13 rescued=0 ignored=0
master02.mydomain.com : ok=10 changed=0 unreachable=0 failed=1 skipped=15 rescued=0 ignored=0
master03.mydomain.com : ok=10 changed=0 unreachable=0 failed=0 skipped=15 rescued=0 ignored=0
worker01.mydomain.com : ok=10 changed=0 unreachable=0 failed=0 skipped=15 rescued=0 ignored=0
worker02.mydomain.com : ok=10 changed=0 unreachable=0 failed=0 skipped=15 rescued=0 ignored=0
The best way to install cilium is to install rke2 with no cni, and then use the cilium cli to install it but then the node will show a status of NotReady, you have checked for Status " Ready " on the pod network. I have tried commenting on this check out, and also changing the stdout to "NotReady" but still does not work. Am I missing
Feature Idea
The recently contributed rolling restart task is simply restarting the rke2 service on each node in order, but in my experience the service status being ready in systemd doesn't necessarily mean the node is actually ready and good to go, especially when upgrading master nodes.
I think we should add health checks and only restart the next node once the previous one is confirmed to be up and running again to avoid any potential cluster breakage, at least for the master nodes. This is less serious for worker nodes.
Currently we're using an extra playbook to run these tasks serially, because there isn't a straight forward way to integrate this into the ansible role.
Posting the playbook below for inspiration, maybe somebody has a better idea how to get this into the role.
Edit: The one minute pause is there, because I noticed that sometimes health checks will work directly after restarting the rke2 service, but then potentially fail for the next few minutes until being ready again
---
- name: Restart RKE2 service and check health
hosts: masters
become: yes
serial: 1
tasks:
- name: Restart RKE2 server on master nodes
ansible.builtin.service:
name: "rke2-server.service"
state: restarted
- name: Pause for 1 minute
pause:
minutes: 1
- name: Healthcheck Kube Apiserver
ansible.builtin.command: curl -k https://localhost:6443/readyz --cert /var/lib/rancher/rke2/server/tls/client-ca.crt --key /var/lib/rancher/rke2/server/tls/client-ca.key
register: healthcheck_result
until: healthcheck_result.stdout == 'ok'
retries: 100
delay: 5
- name: Healthcheck RKE2
ansible.builtin.command: curl -k https://127.0.0.1:9345/v1-rke2/readyz
register: healthcheck_rke_result
until: healthcheck_rke_result.rc == 0
retries: 100
delay: 5
- name: Restart RKE2 service and check health
hosts: workers
become: yes
serial: "30%"
tasks:
- name: Restart RKE2 server on worker nodes
ansible.builtin.service:
name: "rke2-agent.service"
state: restarted
- name: Pause for 1 minute
pause:
minutes: 1
- name: Healthcheck Kube Apiserver
ansible.builtin.command: curl -k https://localhost:6443/readyz --cert /var/lib/rancher/rke2/agent/client-kubelet.crt --key /var/lib/rancher/rke2/agent/client-kubelet.key
register: healthcheck_result
until: healthcheck_result.stdout == 'ok'
retries: 100
delay: 5
Imagine your invetory comes from ex netbox.
that means you dont control order of ex servers coming.
so if you add a new server and run your ansible too add it. it can be that the 2 existing servers will be mentioned after the first.
So better always look for active. and if no active is found use first server..
Feature Idea
Hi, Thank you for this role.
I have a question about the generation of this config file /etc/rancher/rke2/config.yaml
, i'm not sur how, but only the first server node receives the right config with tls-san. Other server nodes contains just:
server: https://<keepalived_IP>:9345
token:
snapshotter: overlayfs
So with the rke2.yaml there might be issue with tls due to missing keepalived IP in other server nodes
Do you encounter this issue?
By default RKE2 sets:
In manual installation, those values can be altered with rke2/config.yaml
:
cluster-cidr: 10.1.0.0/16
service-cidr: 10.2.0.0/16
The goal is to add a support of cluster-cidr
and service-cidr
opinions in this role.
Feature Idea
Jul 01 13:23:15 server04 rke2[997]: time="2022-07-01T13:23:15+02:00" level=warning msg="Cluster CA certificate is not trusted by the host CA bundle, but the token does not include a CA hash. Use the full token from the server's node-token file to enable Cluster CA validation."
We get this error when token is just plain token for agents.
but should actually get
/var/lib/rancher/rke2/server/node-token
from one of the master nodes i guess.
Bug Report
ansible [core 2.11.12]
config file = /home/benji/code/ansible-infrastructure/ansible.cfg
configured module search path = ['/home/benji/code/ansible-infrastructure/library']
ansible python module location = /home/benji/.pyenv/versions/3.9.7/envs/ansible-infrastructure-3.9.7/lib/python3.9/site-packages/ansible
ansible collection location = /home/benji/.ansible/collections:/usr/share/ansible/collections
executable location = /home/benji/.pyenv/versions/ansible-infrastructure-3.9.7/bin/ansible
python version = 3.9.7 (default, Apr 7 2022, 12:58:08) [GCC 9.4.0]
jinja version = 3.0.1
libyaml = True
Expect not go get the warning
...
Inside /templates/keepalived.conf.j2 the Ansible fact "ansible_default_ipv4" is used three times.
If you want to run an IPv6 / Hybrid cluster which has an IPv6 main-address you will run in either of those two problems:
When no IPv4 interface is configured:
The execution fails with a fatal error and Ansible will stop the provisioning.
When both IPv4 and IPv6 interfaces are configured:
It will setup the keepalived VRRP on the wrong interface.
I for myself have just changed all occurrence of "ansible_default_ipv4" to "ansible_default_ipv6" for a workaround.
If it is possible it would be nice if you can integrate a check which evaluates if provided node ip/s are IPv4 or IPv6 and then based on the check later call the corresponding Ansible fact. An even easier workaround would be to use a role variable to enable IPv6.
Bug Report
ansible [core 2.14.2]
config file = /etc/ansible/ansible.cfg
configured module search path = ['/home/anon/.ansible/plugins/modules', '/usr/share/ansible/plugins/modules']
ansible python module location = /usr/lib/python3.10/site-packages/ansible
ansible collection location = /home/anon/.ansible/collections:/usr/share/ansible/collections
executable location = /usr/bin/ansible
python version = 3.10.9 (main, Dec 19 2022, 17:35:49) [GCC 12.2.0] (/usr/bin/python)
jinja version = 3.1.2
libyaml = True
Inventory:
[masters]
rancher-server01 ansible_host=[IPv6-Address] rke2_type=server
rancher-server02 ansible_host=[IPv6-Address] rke2_type=server
rancher-server03 ansible_host=[IPv6-Address] rke2_type=server
[workers]
rancher-node01 ansible_host=[IPv6-Address] rke2_type=agent
rancher-node02 ansible_host=[IPv6-Address] rke2_type=agent
rancher-node03 ansible_host=[IPv6-Address] rke2_type=agent
[k8s_cluster:children]
masters
workers
Provisioning will run through all steps as usual and will bind keepalived VRRP on IPv6 Interface.
Provisioning fails at keepalived VRRP if no IPv4 is setup or will use wrong interface in an IPv4/IPv6 hybrid setup.
Currently, on upgrades, this playbook only installs the new version and restarts the rke2 service without draining beforehand.
I think a cordon, then a drain could be a good feature before these tasks.
I could implement this feature.
Do you think it is a good idea? If it is, do you have any suggestions?
Feature Idea
As described in server configuration documentation, it is possible to define a separate token for agent nodes to use, that does not expose all the etcd
secrets like server token does.
Please allow setting agent token in addition to server token on server nodes.
Feature Idea
When try to install rke2 cluster, receive error
i'm installing single server
[masters]
team-edge1-k8s
fatal: [team-edge1-k8s]: FAILED! => {"msg": "The conditional check 'rke2_etcd_snapshot_file and ( \"rke2-server.service\" is not in ansible_facts.services )' failed. The error was: template error while templating string: expected token ')', got '.'. String: {% if rke2_etcd_snapshot_file and ( \"rke2-server.service\" is not in ansible_facts.services ) %} True {% else %} False {% endif %}\n\nThe error appears to be in '/root/.ansible/roles/lablabs.rke2/tasks/first_server.yml': line 40, column 7, but may\nbe elsewhere in the file depending on the exact syntax problem.\n\nThe offending line appears to be:\n\n block:\n - name: Create the RKE2 etcd snapshot dir\n ^ here\n"}
Bug Report
ansible [core 2.12.10]
config file = /etc/ansible/ansible.cfg
configured module search path = ['/root/.ansible/plugins/modules', '/usr/share/ansible/plugins/modules']
ansible python module location = /usr/lib/python3/dist-packages/ansible
ansible collection location = /root/.ansible/collections:/usr/share/ansible/collections
executable location = /usr/bin/ansible
python version = 3.8.10 (default, Jun 22 2022, 20:18:18) [GCC 9.4.0]
jinja version = 2.10.1
libyaml = True
- name: Deploy RKE2
hosts: all
become: yes
vars:
rke2_download_kubeconf: true
rke2_interface: ens160
rke2_version: v1.24.7+rke2r1
rke2_disable: rke2-ingress-nginx
rke2_airgap_mode: true
rke2_airgap_implementation: copy
rke2_artifact:
- sha256sum-{{ rke2_architecture }}.txt
- rke2.linux-{{ rke2_architecture }}.tar.gz
- rke2-images.linux-{{ rke2_architecture }}.tar.zst
rke2_custom_registry_mirrors:
- name: docker.io
endpoint:
- 'https://harbor.intent.ai'
rewrite: '"^rancher/(.*)": "harbor.int.ai/rancher/$1"'
roles:
- role: ansible-role-rke2
I expect clean installation but receive conditional check error
-edge1-k8s : ok=14 changed=4 unreachable=0 failed=1 skipped=14 rescued=0 ignored=0
If multiple master node servers try to join the cluster concurrently, the rke2-server service fails on one or more master nodes, with the following error:
"Failed to start Rancher Kubernetes Engine v2 (server)."
When the rke2-service fails on a host, Ansible considers the failure as "Fatal" and stops executing the following tasks on the respective host.
The results is that the node will then join the cluster, because systemd keep restarting the service until the activation succeeds, but the playbook stops executing the tasks on that particular host.
I resolved this issue adding a "retry" on the task "Start RKE2 service on the rest of the nodes" in the file "remaining_nodes.yml".
Here's the commit on my forked project: GabriFedi97@70abe0d
There is an issue on rke2 that explains the problem: rancher/rke2#349
Custom registry configs doesn't work without defining a mirror and manually restarting rke2-agent
on all nodes.
Bug Report
ansible [core 2.14.4]
config file = /Users/moray/.ansible.cfg
configured module search path = ['/Users/moray/.ansible/plugins/modules', '/usr/share/ansible/plugins/modules']
ansible python module location = /opt/homebrew/Cellar/ansible/7.4.0/libexec/lib/python3.11/site-packages/ansible
ansible collection location = /Users/moray/.ansible/collections:/usr/share/ansible/collections
executable location = /opt/homebrew/bin/ansible
python version = 3.11.3 (main, Apr 7 2023, 20:13:31) [Clang 14.0.0 (clang-1400.0.29.202)] (/opt/homebrew/Cellar/ansible/7.4.0/libexec/bin/python3.11)
jinja version = 3.1.2
libyaml = True
rke2_custom_registry_configs:
- endpoint: registry.example.com
config:
auth:
username: "REDACTED"
password: "REDACTED"
rke2_custom_registry_mirrors:
- name: dummy.example.com
endpoint:
- https://dummy.example.com
rke2_custom_registry_configs:
- endpoint: registry.example.com
config:
auth:
username: "REDACTED"
password: "REDACTED"
rke2-agent
on all nodesStopping at step 2 should be enough to update the registry configuration.
- Stopping at step 2 doesn't update the registry configs
- Stopping at step 3 produces 401 errors from the registry on image pull
Changed made in /etc/rancher/rke2/config.yaml need restart.
ie added tls-san, changed network plugin or cluster-cidr.
Feature Idea
The role currently uses Ansible copy module to copy Custom Manifests. With this module we cannot use variables in the files/manifests.
if we need variable interpolation in copied files, we probably need to use the ansible.builtin.template module.
The main Idea is replace copy module with the template
Feature Idea
As of the latest release, disable_kube_proxy defaults to true, thus when creating an RKE2 cluster without specifying a CNI (default is Canal) you get a broken cluster.
I assume we'd want disable_kube_proxy to default to false.
Bug Report
-
Leaving disable_kube_proxy and rke2_cni to their defaults.
rke2_cni: canal
disable_kube_proxy: true
Working cluster.
Broken cluster.
Hello,
How can I enable dual stack network when init k8s cluster (with input IP/IPv6 addresses from main.yaml)
Thank you.
Feature Idea
It seems the keepalived configuration is not working quite right. I have a cluster made up of 3 masters and switched off the first one, failover to the second one was working as expected. But when I turn the first master back on, the first master gets priority again, despite failing the script for port 6443, journald logs:
Apr 06 08:42:23 k8s-master-0 Keepalived_vrrp[1218]: Assigned address 192.168.1.221 for interface ens160
Apr 06 08:42:23 k8s-master-0 Keepalived_vrrp[1218]: Registering gratuitous ARP shared channel
Apr 06 08:42:23 k8s-master-0 Keepalived_vrrp[1218]: (VI_1) removing VIPs.
Apr 06 08:42:23 k8s-master-0 Keepalived_vrrp[1218]: (VI_1) Entering BACKUP STATE (init)
Apr 06 08:42:23 k8s-master-0 Keepalived_vrrp[1218]: VRRP sockpool: [ifindex( 2), family(IPv4), proto(112), fd(11,14), unicast, address(192.168.1.221)]
Apr 06 08:42:23 k8s-master-0 Keepalived_vrrp[1218]: Script `chk_rke2server` now returning 1
Apr 06 08:42:23 k8s-master-0 Keepalived_vrrp[1218]: VRRP_Script(chk_rke2server) failed (exited with status 1)
Apr 06 08:42:23 k8s-master-0 Keepalived_vrrp[1218]: (VI_1) Changing effective priority from 150 to 145
Apr 06 08:42:23 k8s-master-0 Keepalived_vrrp[1218]: Script `chk_apiserver` now returning 1
Apr 06 08:42:23 k8s-master-0 Keepalived_vrrp[1218]: VRRP_Script(chk_apiserver) failed (exited with status 1)
Apr 06 08:42:23 k8s-master-0 Keepalived_vrrp[1218]: (VI_1) Changing effective priority from 145 to 140
Apr 06 08:42:24 k8s-master-0 Keepalived_vrrp[1218]: (VI_1) received lower priority (139) advert from 192.168.1.222 - discarding
Even with both scripts failing priority stays at 140, while the first backup server only has a priority of 139 (not quite sure why that is actually, the scripts should be working there), so it will never switch to the backup server (except when the first one is completely offline)
logs of second master while the first one was shut down:
Apr 06 08:41:40 k8s-master-1 Keepalived_vrrp[86970]: (VI_1) Receive advertisement timeout
Apr 06 08:41:40 k8s-master-1 Keepalived_vrrp[86970]: (VI_1) Entering MASTER STATE
Apr 06 08:41:40 k8s-master-1 Keepalived_vrrp[86970]: (VI_1) setting VIPs.
Apr 06 08:41:40 k8s-master-1 Keepalived_vrrp[86970]: (VI_1) Sending/queueing gratuitous ARPs on ens160 for 192.168.1.226
Apr 06 08:41:40 k8s-master-1 Keepalived_vrrp[86970]: Sending gratuitous ARP on ens160 for 192.168.1.226
Apr 06 08:41:40 k8s-master-1 Keepalived_vrrp[86970]: Sending gratuitous ARP on ens160 for 192.168.1.226
Apr 06 08:41:40 k8s-master-1 Keepalived_vrrp[86970]: Sending gratuitous ARP on ens160 for 192.168.1.226
Apr 06 08:41:40 k8s-master-1 Keepalived_vrrp[86970]: Sending gratuitous ARP on ens160 for 192.168.1.226
Apr 06 08:41:40 k8s-master-1 Keepalived_vrrp[86970]: Sending gratuitous ARP on ens160 for 192.168.1.226
Apr 06 08:41:45 k8s-master-1 Keepalived_vrrp[86970]: (VI_1) Sending/queueing gratuitous ARPs on ens160 for 192.168.1.226
Apr 06 08:41:45 k8s-master-1 Keepalived_vrrp[86970]: Sending gratuitous ARP on ens160 for 192.168.1.226
Apr 06 08:41:45 k8s-master-1 Keepalived_vrrp[86970]: Sending gratuitous ARP on ens160 for 192.168.1.226
Apr 06 08:41:45 k8s-master-1 Keepalived_vrrp[86970]: Sending gratuitous ARP on ens160 for 192.168.1.226
Apr 06 08:41:45 k8s-master-1 Keepalived_vrrp[86970]: Sending gratuitous ARP on ens160 for 192.168.1.226
Apr 06 08:41:45 k8s-master-1 Keepalived_vrrp[86970]: Sending gratuitous ARP on ens160 for 192.168.1.226
Apr 06 08:42:23 k8s-master-1 Keepalived_vrrp[86970]: (VI_1) Master received advert from 192.168.1.221 with higher priority 140, ours 139
Apr 06 08:42:23 k8s-master-1 Keepalived_vrrp[86970]: (VI_1) Entering BACKUP STATE
Apr 06 08:42:23 k8s-master-1 Keepalived_vrrp[86970]: (VI_1) removing VIPs.
I suspect something is off with the weights in the keepalived config
Can you implement etcd snapshot on minio or S3 ?
In rke2 there are:
`
etcd-expose-metrics: false
etcd-snapshot-name: "prefix_name"
etcd-snapshot-schedule-cron: "0 */1 * * *"
etcd-snapshot-retention: 360
etcd-s3: true
etcd-s3-region: "eu-west-1"
etcd-s3-endpoint: "s3-eu-west-1.amazonaws.com"
etcd-s3-bucket: "my-bucket"
etcd-s3-folder: "rke2-test"
etcd-s3-access-key: "AKIA2Q..."
etcd-s3-secret-key: "jkN0xL...."
`
Feature Idea
Hello,
Thanks for the role, working great !
I have a full config file for RKE2 server with multiple parameters (etcd-s3)
I tried to used it directly with
rke2_server_options: "{{ lookup('file', '../config.yml') | from_yaml }}"
But unfortunately it only writes the keys and not the values in the file generated from template
It would be marvellous to use own config file
Hello,
First of all, thank you for this great work ! I love it !
May I suggest a new feature ?
I would be very interested by an option to restore an etcd snapshot at installation time.
I mean :
Do you think it could be possible ?
Best regards,
Olivier
In the default values in default/main.yaml, disable_kube_proxy: true is commented. It breaks all stages concerning the first node as the variable is called upon.
Bug Report
ansible [core 2.14.4]
config file = None
configured module search path = ['/home/cloudadm/.ansible/plugins/modules', '/usr/share/ansible/plugins/modules']
ansible python module location = /home/cloudadm/.local/lib/python3.9/site-packages/ansible
ansible collection location = /home/cloudadm/.ansible/collections:/usr/share/ansible/collections
executable location = /home/cloudadm/.local/bin/ansible
python version = 3.9.2 (default, Feb 28 2021, 17:03:44) [GCC 10.2.1 20210110] (/usr/bin/python3)
jinja version = 3.1.2
libyaml = True
Run the role without explicitly assigning a value to diasable_kube_proxy.
The role should run smoothly with disable_kube_proxy set to true
The role breaks when it calls the variable during a check.
Hello,
Currently rke2_loadbalancer_ip_range
set range-global
and we can't add specific subnet to kube-vip.
rke2_loadbalancer_ip_range
should be a dict like this :
rke2_loadbalancer_ip_range:
range-global: 192.168.1.50-192.168.1.100
range-namspeace: 192.168.2.50-192.168.2.100
If you agree with the idea I can open a PR
Feature Idea
CI job with molecule test
is failing.
Originally the pipeline started to fail when Molecule 5.0.0 was available in pip.
One thing is the docker plugin (needs to be installed as molecule-plugins
package instead of molecule[docker]
).
But also after this change the pipeline was failing with different errors. Needs to be checked and fixed...
Some hints: ansible/molecule#3883
Bug Report
-
Run github action CI job with molecule test
Example: https://github.com/lablabs/ansible-role-rke2/actions/runs/4847688181
Molecule test not failing with python error
https://github.com/lablabs/ansible-role-rke2/actions/runs/4847688181
When I tried running the playbook again on the cluster to increase the number of nodes; i faced an issue where the playbook looks for /tmp/rke2.sh in the old nodes; but my cluster nodes rebooted since they were initialized; so this is what i am getting :
{"changed": false, "msg": "file (/tmp/rke2.sh) is absent, cannot continue", "path": "/tmp/rke2.sh", "state": "absent"}
any idea how to bypass this ?
When changing the rke2_version
variable, the new version is installed but doesn't take effect until the rke2-server service is restarted.
We should look at adding an ansible handler to check if the version has changed, and restart each host one by one if it has?
RKE2 Server process listen on fixed IP on port 9345.
If it is down new nodes will not be able to provision.
Keepalived script is only configured to check apiserver. Proposition is to add curl to 9345 port to check_apiserver script.
eg.
check_apiserver.sh.j2
#!/bin/sh
errorExit() {
echo "*** $*" 1>&2
exit 1
}
curl --silent --max-time 2 --insecure https://localhost:9345/ -o /dev/null || errorExit "Error GET https://localhost:9345}/"
curl --silent --max-time 2 --insecure https://localhost:{{rke2_apiserver_dest_port}}/ -o /dev/null || errorExit "Error GET https://localhost:{{rke2_apiserver_dest_port}}/"
if ip addr | grep -q {{rke2_api_ip}}; then
curl --silent --max-time 2 --insecure https://{{rke2_api_ip}}:{{rke2_apiserver_dest_port}}/ -o /dev/null || errorExit "Error GET https://{{rke2_api_ip}}:{{rke2_apiserver_dest_port}}/"
fi
Currently every node in the rke2 cluster must share the same primary interface for the kube-vip DaemonSet to function. If there is a node with a different network card, etc, we need to be able to specify that unique interface.
Feature Idea
At the moment when checking for existing version it uses the hardcoded path /usr/local/bin/rke2 in the task rke2.yaml
if you look at the install script from get.rke2.io it will change target if the /usr is mounted RO
# --- install tarball to /usr/local by default, except if /usr/local is on a separate partition or is read-only
# --- in which case we go into /opt/rke2.
So the task for rke2 should be able to handle another location also.
Hi There,
I'm trying to expand an existent cluster that is formerly composed of 3x node performing Master/Worker node.
I would like to expand the cluster by adding Worker-only nodes and for that I've got the following definitions:
Ansible runs fine, but in the all nodes are reported as control-plane nodes:
# kubectl get nodes -w
NAME STATUS ROLES AGE VERSION
master-1 Ready control-plane,etcd,master 7h54m v1.22.10+rke2r2
master-2 Ready control-plane,etcd,master 6h45m v1.22.10+rke2r2
master-3 Ready control-plane,etcd,master 6h50m v1.22.10+rke2r2
worker-1 Ready control-plane,etcd,master 90s v1.22.10+rke2r2
Am I doing something wrong or this can be a potential bug?
Bug Report
ansible [core 2.13.2]
config file = None
configured module search path = ['/root/.ansible/plugins/modules', '/usr/share/ansible/plugins/modules']
ansible python module location = /root/.local/lib/python3.9/site-packages/ansible
ansible collection location = /root/.ansible/collections:/usr/share/ansible/collections
executable location = /root/.local/bin/ansible
python version = 3.9.13 (main, Jun 10 2022, 09:50:06) [GCC]
jinja version = 3.1.2
libyaml = True
inventory.yaml
[masters]
master-1 ansible_host=master-1 rke2_type=server
master-2 ansible_host=master-2 rke2_type=server
master-3 ansible_host=master-3 rke2_type=server
[workers]
worker-1 ansible_host=worker-1 rke2_type=agent
[k8s_cluster:children]
masters
workers
main.yaml
# Install RKE2
- name: RKE2 Setup on Cluster-wide
hosts: k8s_cluster
roles:
- role: RKE2Cluster
vars.yaml
---
# RKE2 Settings
os_privileged_group: gok8sadm
rke2_type: server
rke2_airgap_mode: false
rke2_ha_mode: true
rke2_ha_mode_keepalived: false
rke2_ha_mode_kubevip: true
rke2_api_ip: 192.168.86.100
rke2_interface: eth0
rke2_loadbalancer_ip_range: 192.168.86.101-192.168.86.105
rke2_kubevip_cloud_provider_enable: true
rke2_kubevip_svc_enable: true
rke2_additional_sans: [ my-k8s-dev.hutger.xyz ]
rke2_apiserver_dest_port: 6443
rke2_disable:
- rke2-ingress-nginx
rke2_server_taint: false
rke2_token: my-token
rke2_version: v1.22.10+rke2r2
rke2_data_path: /var/lib/rancher/rke2
rke2_channel: stable
rke2_cni: canal
rke2_download_kubeconf: true
rke2_download_kubeconf_file_name: rke2.yaml
rke2_download_kubeconf_path: /tmp
rke2_servers_group_name: masters
rke2_agents_group_name: workers
No roles associated to Worker-1
kubectl get nodes -w
NAME STATUS ROLES AGE VERSION
master-1 Ready control-plane,etcd,master 7h54m v1.22.10+rke2r2
master-2 Ready control-plane,etcd,master 6h45m v1.22.10+rke2r2
master-3 Ready control-plane,etcd,master 6h50m v1.22.10+rke2r2
worker-1 Ready <none> 90s v1.22.10+rke2r2
Worker-1 being set as Master.
kubectl get nodes -w
NAME STATUS ROLES AGE VERSION
master-1 Ready control-plane,etcd,master 7h54m v1.22.10+rke2r2
master-2 Ready control-plane,etcd,master 6h45m v1.22.10+rke2r2
master-3 Ready control-plane,etcd,master 6h50m v1.22.10+rke2r2
worker-1 Ready control-plane,etcd,master 90s v1.22.10+rke2r2
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.