Giter VIP home page Giter VIP logo

ansible-role-rke2's Introduction

RKE2 Ansible Role

Role version Role downloads GitHub Actions License

This Ansible role will deploy RKE2 Kubernetes Cluster. RKE2 will be installed using the tarball method.

The Role can install the RKE2 in 3 modes:

  • RKE2 single node

  • RKE2 Cluster with one Server(Master) node and one or more Agent(Worker) nodes

  • RKE2 Cluster with Server(Master) in High Availability mode and zero or more Agent(Worker) nodes. In HA mode you should have an odd number (three recommended) of server(master) nodes that will run etcd, the Kubernetes API (Keepalived VIP or Kube-VIP address), and other control plane services.


  • Additionally it is possible to install the RKE2 Cluster (all 3 modes) in Air-Gapped functionality with the use of local artifacts.

It is possible to upgrade RKE2 by changing rke2_version variable and re-running the playbook with this role. During the upgrade process the RKE2 service on the nodes will be restarted one by one. The Ansible Role will check if the node on which the service was restarted is in Ready state and only then proceed with restarting service on another Kubernetes node.

Requirements

  • Ansible 2.10+

Tested on

  • Rocky Linux 8
  • Ubuntu 20.04 LTS
  • Ubuntu 22.04 LTS

Role Variables

This is a copy of defaults/main.yml

---
# The node type - server or agent
rke2_type: server

# Deploy the control plane in HA mode
rke2_ha_mode: false

# Install and configure Keepalived on Server nodes
# Can be disabled if you are using pre-configured Load Balancer
rke2_ha_mode_keepalived: true

# Install and configure kube-vip LB and VIP for cluster
# rke2_ha_mode_keepalived needs to be false
rke2_ha_mode_kubevip: false

# Kubernetes API and RKE2 registration IP address. The default Address is the IPv4 of the Server/Master node.
# In HA mode choose a static IP which will be set as VIP in keepalived.
# Or if the keepalived is disabled, use IP address of your LB.
rke2_api_ip: "{{ hostvars[groups[rke2_servers_group_name].0]['ansible_default_ipv4']['address'] }}"

# optional option for RKE2 Server to listen on a private IP address on port 9345
# rke2_api_private_ip:

# optional option for kubevip IP subnet
# rke2_api_cidr: 24

# optional option for kubevip
# rke2_interface: eth0
# optional option for IPv4/IPv6 addresses to advertise for node
# rke2_bind_address: "{{ hostvars[inventory_hostname]['ansible_' + rke2_interface]['ipv4']['address'] }}"

# kubevip load balancer IP range
rke2_loadbalancer_ip_range: {}
#  range-global: 192.168.1.50-192.168.1.100
#  cidr-finance: 192.168.0.220/29,192.168.0.230/29

# Install kubevip cloud provider if rke2_ha_mode_kubevip is true
rke2_kubevip_cloud_provider_enable: true

# Enable kube-vip to watch Services of type LoadBalancer
rke2_kubevip_svc_enable: true

# Specify which image is used for kube-vip container
rke2_kubevip_image: ghcr.io/kube-vip/kube-vip:v0.6.4

# Specify which image is used for kube-vip cloud provider container
rke2_kubevip_cloud_provider_image: ghcr.io/kube-vip/kube-vip-cloud-provider:v0.0.4

# Enable kube-vip IPVS load balancer for control plane
rke2_kubevip_ipvs_lb_enable: false
# Enable layer 4 load balancing for control plane using IPVS kernel module
# Must use kube-vip version 0.4.0 or later

rke2_kubevip_service_election_enable: true
# By default ARP mode provides a HA implementation of a VIP (your service IP address) which will receive traffic on the kube-vip leader.
# To circumvent this kube-vip has implemented a new function which is "leader election per service",
# instead of one node becoming the leader for all services an election is held across all kube-vip instances and the leader from that election becomes the holder of that service. Ultimately,
# this means that every service can end up on a different node when it is created in theory preventing a bottleneck in the initial deployment.
# minimum kube-vip version 0.5.0

# (Optional) A list of kube-vip flags
# All flags can be found here https://kube-vip.io/docs/installation/flags/
# rke2_kubevip_args: []
# - param: lb_enable
#   value: true
# - param: lb_port
#   value: 6443

# Prometheus metrics port for kube-vip
rke2_kubevip_metrics_port: 2112

# Add additional SANs in k8s API TLS cert
rke2_additional_sans: []

# API Server destination port
rke2_apiserver_dest_port: 6443

# Server nodes taints
rke2_server_node_taints: []
  # - 'CriticalAddonsOnly=true:NoExecute'

# Agent nodes taints
rke2_agent_node_taints: []

# Pre-shared secret token that other server or agent nodes will register with when connecting to the cluster
rke2_token: defaultSecret12345

# RKE2 version
rke2_version: v1.25.3+rke2r1

# URL to RKE2 repository
rke2_channel_url: https://update.rke2.io/v1-release/channels

# URL to RKE2 install bash script
# e.g. rancher chinase mirror http://rancher-mirror.rancher.cn/rke2/install.sh
rke2_install_bash_url: https://get.rke2.io

# Local data directory for RKE2
rke2_data_path: /var/lib/rancher/rke2

# Default URL to fetch artifacts
rke2_artifact_url: https://github.com/rancher/rke2/releases/download/

# Local path to store artifacts
rke2_artifact_path: /rke2/artifact

# Airgap required artifacts
rke2_artifact:
  - sha256sum-{{ rke2_architecture }}.txt
  - rke2.linux-{{ rke2_architecture }}.tar.gz
  - rke2-images.linux-{{ rke2_architecture }}.tar.zst

# Changes the deploy strategy to install based on local artifacts
rke2_airgap_mode: false

# Airgap implementation type - download, copy or exists
# - 'download' will fetch the artifacts on each node,
# - 'copy' will transfer local files in 'rke2_artifact' to the nodes,
# - 'exists' assumes 'rke2_artifact' files are already stored in 'rke2_artifact_path'
rke2_airgap_implementation: download

# Local source path where artifacts are stored
rke2_airgap_copy_sourcepath: local_artifacts

# Tarball images for additional components to be copied from rke2_airgap_copy_sourcepath to the nodes
# (File extensions in the list and on the real files must be retained)
rke2_airgap_copy_additional_tarballs: []

# Destination for airgap additional images tarballs ( see https://docs.rke2.io/install/airgap/#tarball-method )
rke2_tarball_images_path: "{{ rke2_data_path }}/agent/images"

# Architecture to be downloaded, currently there are releases for amd64 and s390x
rke2_architecture: amd64

# Destination directory for RKE2 installation script
rke2_install_script_dir: /var/tmp

# RKE2 channel
rke2_channel: stable

# Do not deploy packaged components and delete any deployed components
# Valid items: rke2-canal, rke2-coredns, rke2-ingress-nginx, rke2-metrics-server
rke2_disable:

# Option to disable kube-proxy
disable_kube_proxy: false

# Option to disable builtin cloud controller - mostly for onprem
rke2_disable_cloud_controller: false

# Cloud provider to use for the cluster (aws, azure, gce, openstack, vsphere, external)
# applicable only if rke2_disable_cloud_controller is true
rke2_cloud_provider_name: "rke2"

# Path to custom manifests deployed during the RKE2 installation
# It is possible to use Jinja2 templating in the manifests
rke2_custom_manifests:

# Path to static pods deployed during the RKE2 installation
rke2_static_pods:

# Configure custom Containerd Registry
rke2_custom_registry_mirrors: []
  # - name:
  #   endpoint: {}
#   rewrite: '"^rancher/(.*)": "mirrorproject/rancher-images/$1"'

# Configure custom Containerd Registry additional configuration
rke2_custom_registry_configs: []
#   - endpoint:
#     config:

# Path to Container registry config file template
rke2_custom_registry_path: templates/registries.yaml.j2

# Path to RKE2 config file template
rke2_config: templates/config.yaml.j2

# Etcd snapshot source directory
rke2_etcd_snapshot_source_dir: etcd_snapshots

# Etcd snapshot file name.
# When the file name is defined, the etcd will be restored on initial deployment Ansible run.
# The etcd will be restored only during the initial run, so even if you will leave the the file name specified,
# the etcd will remain untouched during the next runs.
# You can either use this or set options in `rke2_etcd_snapshot_s3_options`
rke2_etcd_snapshot_file:

# Etcd snapshot location
rke2_etcd_snapshot_destination_dir: "{{ rke2_data_path }}/server/db/snapshots"

# Etcd snapshot s3 options
# Set either all these values or `rke2_etcd_snapshot_file` and `rke2_etcd_snapshot_source_dir`

# rke2_etcd_snapshot_s3_options:
  # s3_endpoint: "" # required
  # access_key: "" # required
  # secret_key: "" # required
  # bucket: "" # required
  # snapshot_name: "" # required.
  # skip_ssl_verify: false # optional
  # endpoint_ca: "" # optional. Can skip if using defaults
  # region: "" # optional - defaults to us-east-1
  # folder: "" # optional - defaults to top level of bucket
# Override default containerd snapshotter
rke2_snapshooter: overlayfs

# Deploy RKE2 with default CNI canal
rke2_cni: canal

# Validate system configuration against the selected benchmark
# (Supported value is "cis-1.23" or eventually "cis-1.6" if you are running RKE2 prior 1.25)
rke2_cis_profile: ""

# Download Kubernetes config file to the Ansible controller
rke2_download_kubeconf: false

# Name of the Kubernetes config file will be downloaded to the Ansible controller
rke2_download_kubeconf_file_name: rke2.yaml

# Destination directory where the Kubernetes config file will be downloaded to the Ansible controller
rke2_download_kubeconf_path: /tmp

# Default Ansible Inventory Group name for RKE2 cluster
rke2_cluster_group_name: k8s_cluster

# Default Ansible Inventory Group name for RKE2 Servers
rke2_servers_group_name: masters

# Default Ansible Inventory Group name for RKE2 Agents
rke2_agents_group_name: workers

# (Optional) A list of Kubernetes API server flags
# All flags can be found here https://kubernetes.io/docs/reference/command-line-tools-reference/kube-apiserver
# rke2_kube_apiserver_args: []

# (Optional) List of Node labels
# k8s_node_label: []

# (Optional) Additional RKE2 server configuration options
# You could find the flags at https://docs.rke2.io/reference/server_config
# rke2_server_options:
#   - "option: value"
#   - "node-ip: {{ rke2_bind_address }}"  # ex: (agent/networking) IPv4/IPv6 addresses to advertise for node

# (Optional) Additional RKE2 agent configuration options
# You could find the flags at https://docs.rke2.io/reference/linux_agent_config
# rke2_agent_options:
#   - "option: value"
#   - "node-ip: {{ rke2_bind_address }}"  # ex: (agent/networking) IPv4/IPv6 addresses to advertise for node

# (Optional) Configure Proxy
# All flags can be found here https://docs.rke2.io/advanced#configuring-an-http-proxy
# rke2_environment_options: []
#   - "option=value"
#   - "HTTP_PROXY=http://your-proxy.example.com:8888"

# (Optional) Customize default kube-controller-manager arguments
# This functionality allows appending the argument if it is not present by default or replacing it if it already exists.
# rke2_kube_controller_manager_arg:
#   - "bind-address=0.0.0.0"

# (Optional) Customize default kube-scheduler arguments
# This functionality allows appending the argument if it is not present by default or replacing it if it already exists.
# rke2_kube_scheduler_arg:
#   - "bind-address=0.0.0.0"

# Cordon, drain the node which is being upgraded. Uncordon the node once the RKE2 upgraded
rke2_drain_node_during_upgrade: false

# Wait for all pods to be ready after rke2-service restart during rolling restart.
rke2_wait_for_all_pods_to_be_ready: false

# Enable debug mode (rke2-service)
rke2_debug: false

# (Optional) Customize default kubelet arguments
# rke2_kubelet_arg:
#   - "--system-reserved=cpu=100m,memory=100Mi"

# (Optional) Customize default kube-proxy arguments
# rke2_kube_proxy_arg:
#   - "proxy-mode=ipvs"

# The value for the node-name configuration item
rke2_node_name: "{{ inventory_hostname }}"

Inventory file example

This role relies on nodes distribution to masters and workers inventory groups. The RKE2 Kubernetes master/server nodes must belong to masters group and worker/agent nodes must be the members of workers group. Both groups has to be the children of k8s_cluster group.

[masters]
master-01 ansible_host=192.168.123.1 rke2_type=server
master-02 ansible_host=192.168.123.2 rke2_type=server
master-03 ansible_host=192.168.123.3 rke2_type=server

[workers]
worker-01 ansible_host=192.168.123.11 rke2_type=agent
worker-02 ansible_host=192.168.123.12 rke2_type=agent
worker-03 ansible_host=192.168.123.13 rke2_type=agent

[k8s_cluster:children]
masters
workers

Playbook example

This playbook will deploy RKE2 to a single node acting as both server and agent.

- name: Deploy RKE2
  hosts: node
  become: yes
  roles:
     - role: lablabs.rke2

This playbook will deploy RKE2 to a cluster with one server(master) and several agent(worker) nodes.

- name: Deploy RKE2
  hosts: all
  become: yes
  roles:
     - role: lablabs.rke2

This playbook will deploy RKE2 to a cluster with one server(master) and several agent(worker) nodes in air-gapped mode. It will use Multus and Calico as CNI.

- name: Deploy RKE2
  hosts: all
  become: yes
  vars:
    rke2_airgap_mode: true
    rke2_airgap_implementation: download
    rke2_cni:
      - multus
      - calico
    rke2_artifact:
      - sha256sum-{{ rke2_architecture }}.txt
      - rke2.linux-{{ rke2_architecture }}.tar.gz
      - rke2-images.linux-{{ rke2_architecture }}.tar.zst
    rke2_airgap_copy_additional_tarballs:
      - rke2-images-multus.linux-{{ rke2_architecture }}
      - rke2-images-calico.linux-{{ rke2_architecture }}
  roles:
     - role: lablabs.rke2

This playbook will deploy RKE2 to a cluster with HA server(master) control-plane and several agent(worker) nodes. The server(master) nodes will be tainted so the workload will be distributed only on worker(agent) nodes. The role will install also keepalived on the control-plane nodes and setup VIP address where the Kubernetes API will be reachable. it will also download the Kubernetes config file to the local machine.

- name: Deploy RKE2
  hosts: all
  become: yes
  vars:
    rke2_ha_mode: true
    rke2_api_ip : 192.168.123.100
    rke2_download_kubeconf: true
    rke2_server_node_taints:
      - 'CriticalAddonsOnly=true:NoExecute'
  roles:
     - role: lablabs.rke2

Having separate token for agent nodes

As per server configuration documentation it is possible to define an agent token, which will be used by agent nodes to connect to cluster, giving them less access to cluster than server nodes have. Following modifications to above configuration would be necessary:

  • remove rke2_token from global vars
  • add to group_vars/masters.yml:
rke2_token: defaultSecret12345
rke2_agent_token: agentSecret54321
  • add to group_vars/workers.yml:
rke2_token: agentSecret54321

While changing server token is problematic, agent token can be rotated at will, as long as servers and agents have the same value and the services (rke2-server and rke2-agent, as appropriate) have been restarted to make sure the processes use the new value.

Troubleshooting

Playbook stuck while starting the RKE2 service on agents

If the playbook starts to hang at the Start RKE2 service on the rest of the nodes task and then fails at the Wait for remaining nodes to be ready task, you probably have some limitations on you nodes' network.

Please check the required Inbound Rules for RKE2 Server Nodes at the following link: https://docs.rke2.io/install/requirements/#networking.

License

MIT

Author Information

Created in 2021 by Labyrinth Labs

ansible-role-rke2's People

Contributors

anjia0532 avatar apsu avatar aveek-saha avatar chuegel avatar cu12 avatar drustan avatar dungt9x avatar eliasp avatar glennpratt avatar gmarraff avatar hamatuni avatar hoangphuocbk avatar itscaro avatar jakuzure avatar koseb9 avatar lukapetrovic-git avatar m18unet avatar michaeltrip avatar monolithprojects avatar mueller-ma avatar mullardoch avatar paulfantom avatar pionerd avatar projectinitiative avatar r65535 avatar rmoreas avatar strelok899 avatar tchinmai7 avatar teddyphreak avatar viq avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

ansible-role-rke2's Issues

Error templating

I am having this error while running the playbook.

{"msg": "The conditional check 'inventory_hostname is in groups.masters' failed. The error was: template error while templating string: expected token 'end of statement block', got '.'. String: {% if inventory_hostname is in groups.masters %} True {% else %} False {% endif %}\n\nThe error appears to be in '/root/.ansible/roles/lablabs.rke2/tasks/main.yml': line 3, column 3, but may\nbe elsewhere in the file depending on the exact syntax problem.\n\nThe offending line appears to be:\n\n\n- name: Install Keepalived when HA mode is enabled\n ^ here\n"}

I am using the latest ansible version; and I have tried an older version as well, any idea whats going on here ?

feature: Support for cni: none for cilium cli install's

Summary

The best way to install cilium is to install rke2 with no cni, and then use the cilium cli to install it but then the node will show a status of NotReady, you have checked for Status " Ready " on the pod network. I have tried commenting on this check out, and also changing the stdout to "NotReady" but still does not work. Am I missing

Issue Type

Feature Idea

keepalived no switch IP when rke2-server process is down

RKE2 Server process listen on fixed IP on port 9345.
If it is down new nodes will not be able to provision.

Keepalived script is only configured to check apiserver. Proposition is to add curl to 9345 port to check_apiserver script.

eg.

check_apiserver.sh.j2

#!/bin/sh
errorExit() {
    echo "*** $*" 1>&2
    exit 1
}
curl --silent --max-time 2 --insecure https://localhost:9345/ -o /dev/null || errorExit "Error GET https://localhost:9345}/"
curl --silent --max-time 2 --insecure https://localhost:{{rke2_apiserver_dest_port}}/ -o /dev/null || errorExit "Error GET https://localhost:{{rke2_apiserver_dest_port}}/"
if ip addr | grep -q {{rke2_api_ip}}; then
    curl --silent --max-time 2 --insecure https://{{rke2_api_ip}}:{{rke2_apiserver_dest_port}}/ -o /dev/null || errorExit "Error GET https://{{rke2_api_ip}}:{{rke2_apiserver_dest_port}}/"
fi

bug: issues defining custom registry configs after cluster creation

Summary

Custom registry configs doesn't work without defining a mirror and manually restarting rke2-agent on all nodes.

Issue Type

Bug Report

Ansible Version

ansible [core 2.14.4]
  config file = /Users/moray/.ansible.cfg
  configured module search path = ['/Users/moray/.ansible/plugins/modules', '/usr/share/ansible/plugins/modules']
  ansible python module location = /opt/homebrew/Cellar/ansible/7.4.0/libexec/lib/python3.11/site-packages/ansible
  ansible collection location = /Users/moray/.ansible/collections:/usr/share/ansible/collections
  executable location = /opt/homebrew/bin/ansible
  python version = 3.11.3 (main, Apr  7 2023, 20:13:31) [Clang 14.0.0 (clang-1400.0.29.202)] (/opt/homebrew/Cellar/ansible/7.4.0/libexec/bin/python3.11)
  jinja version = 3.1.2
  libyaml = True

Steps to Reproduce

  1. Set up a RKE2 cluster without any registry mirrors or configs.
  2. Add a registry config to the existing cluster and rerun the playbook
    rke2_custom_registry_configs:
      - endpoint: registry.example.com
        config:
          auth:
            username: "REDACTED"
            password: "REDACTED"
  1. Add a registry mirror to the existing cluster and rerun the playbook
    rke2_custom_registry_mirrors:
      - name: dummy.example.com
        endpoint:
          - https://dummy.example.com
    rke2_custom_registry_configs:
      - endpoint: registry.example.com
        config:
          auth:
            username: "REDACTED"
            password: "REDACTED"
  1. Restart rke2-agent on all nodes

Expected Results

Stopping at step 2 should be enough to update the registry configuration.

Actual Results

- Stopping at step 2 doesn't update the registry configs
- Stopping at step 3 produces 401 errors from the registry on image pull

feature: drain nodes on upgrade

Summary

Currently, on upgrades, this playbook only installs the new version and restarts the rke2 service without draining beforehand.
I think a cordon, then a drain could be a good feature before these tasks.

I could implement this feature.
Do you think it is a good idea? If it is, do you have any suggestions?

Issue Type

Feature Idea

bug: Warning on token

Summary

Jul 01 13:23:15 server04 rke2[997]: time="2022-07-01T13:23:15+02:00" level=warning msg="Cluster CA certificate is not trusted by the host CA bundle, but the token does not include a CA hash. Use the full token from the server's node-token file to enable Cluster CA validation."

We get this error when token is just plain token for agents.
but should actually get
/var/lib/rancher/rke2/server/node-token
from one of the master nodes i guess.

Issue Type

Bug Report

Ansible Version

ansible [core 2.11.12]
  config file = /home/benji/code/ansible-infrastructure/ansible.cfg
  configured module search path = ['/home/benji/code/ansible-infrastructure/library']
  ansible python module location = /home/benji/.pyenv/versions/3.9.7/envs/ansible-infrastructure-3.9.7/lib/python3.9/site-packages/ansible
  ansible collection location = /home/benji/.ansible/collections:/usr/share/ansible/collections
  executable location = /home/benji/.pyenv/versions/ansible-infrastructure-3.9.7/bin/ansible
  python version = 3.9.7 (default, Apr  7 2022, 12:58:08) [GCC 9.4.0]
  jinja version = 3.0.1
  libyaml = True

Steps to Reproduce

Expected Results

Expect not go get the warning

Actual Results

...

support that /usr is mounted readonly

At the moment when checking for existing version it uses the hardcoded path /usr/local/bin/rke2 in the task rke2.yaml
if you look at the install script from get.rke2.io it will change target if the /usr is mounted RO
# --- install tarball to /usr/local by default, except if /usr/local is on a separate partition or is read-only
# --- in which case we go into /opt/rke2.

So the task for rke2 should be able to handle another location also.

bug: Unable to provision multiple nodes using Vagrant

Summary

I have a Vagrantfile that provisions three boxes running AlmaLinux 8 via libvirt, which use the Ansible provisioner to include this role.

I have no agent nodes, thus I'm not tainting the server nodes. I ran into no issues when provisioning a single node cluster, but I run into issues when specifying multiple server nodes in my Ansible inventory.

In High Availability mode:
I run into the following error on the task: Create keepalived config file

An exception occurred during task execution. To see the full traceback, use -vvv.
The error was: ansible.errors.AnsibleUndefinedVariable: 'ansible.vars.hostvars.HostVarsVars object' has no attribute 'ansible_default_ipv4'.
'ansible.vars.hostvars.HostVarsVars object' has no attribute 'ansible_default_ipv4'

I was able to get past this issue by changing the below line to {{ hostvars[host].ansible_host }}

{{ hostvars[host]['ansible_default_ipv4']['address'] }}

Issue Type

Bug Report

Ansible Version

ansible [core 2.14.1]
  config file = None
  configured module search path = ['/home/austin/.ansible/plugins/modules', '/usr/share/ansible/plugins/modules']
  ansible python module location = /home/austin/.local/lib/python3.11/site-packages/ansible
  ansible collection location = /home/austin/.ansible/collections:/usr/share/ansible/collections
  executable location = /home/austin/.local/bin/ansible
  python version = 3.11.1 (main, Dec 11 2022, 15:18:51) [GCC 10.2.1 20201203] (/usr/bin/python3)
  jinja version = 3.1.2
  libyaml = True

Steps to Reproduce

Have libvirt setup and the vagrant-libvirt plugin installed along with Vagrant, Ansible, and this role.

Below are the three files necessary when running vagrant up:

Vagrantfile

NODES = [
    { hostname: "controller1", ip: "192.168.111.2", ram: 4096, cpu: 2 },
    { hostname: "controller2", ip: "192.168.111.3", ram: 4096, cpu: 2 },
    { hostname: "controller3", ip: "192.168.111.4", ram: 4096, cpu: 2 }
]

Vagrant.configure(2) do |config|
  NODES.each do |node|
    config.vm.define node[:hostname] do |config|
      config.vm.hostname = node[:hostname]
      config.vm.box = "almalinux/8"
      config.vm.network :private_network, ip: node[:ip]

      config.vm.provider :libvirt do |domain|
        domain.memory = node[:ram]
        domain.cpus = node[:cpu]
      end

      config.vm.provision :ansible do |ansible|
        ansible.playbook = "playbooks/provision.yml"
        ansible.inventory_path = "inventory/hosts.ini"
      end
    end
  end
end

playbooks/provision.yml

- hosts: all
  become: true
  vars:
    rke2_channel: stable
    rke2_servers_group_name: rke2_servers
    rke2_agents_group_name: rke2_agents
    rke2_ha_mode: true
  roles:
  - lablabs.rke2

inventory/hosts.ini

[rke2_servers]
controller1 ansible_host=192.168.111.2 rke2_type=server
controller2 ansible_host=192.168.111.3 rke2_type=server
controller3 ansible_host=192.168.111.4 rke2_type=server

[rke2_agents]

[k8s_cluster:children]
rke2_servers
rke2_agents

Expected Results

For three server nodes to be provisioned after running vagrant up

Actual Results

All servers fail to provision rke2 Ansible role.

running the playbook after rebooting the nodes

When I tried running the playbook again on the cluster to increase the number of nodes; i faced an issue where the playbook looks for /tmp/rke2.sh in the old nodes; but my cluster nodes rebooted since they were initialized; so this is what i am getting :

{"changed": false, "msg": "file (/tmp/rke2.sh) is absent, cannot continue", "path": "/tmp/rke2.sh", "state": "absent"}

any idea how to bypass this ?

feature: Node labels

Summary

I just made the part for taints.
I could do something similar for nodelabels to set specific labels per node.

Issue Type

Feature Idea

bug: Monitoring with Prometheus stack chart (can't get targets Etcd, Scheduler, Controller-manager )

Summary

I've installed rke2 with ansible, installed there prometheus-stack chart, in the targets - etcd, Scheduler, Controller-manager can't be reached

image

Issue Type

Bug Report

Ansible Version

ansible 2.10.17
  config file = None
  configured module search path = ['/root/.ansible/plugins/modules', '/usr/share/ansible/plugins/modules']
  ansible python module location = /usr/local/lib/python3.8/dist-packages/ansible
  executable location = /usr/local/bin/ansible
  python version = 3.8.10 (default, Jun 22 2022, 20:18:18) [GCC 9.4.0]


### Steps to Reproduce

Install rke2. 
Install prmetheus-stack chart
Check targets for monitoring in prometheus


### Expected Results

All services are reachable

### Actual Results

```console
can't get targets Etcd, Scheduler, Controller-manager

keepalived: not failing over when apiserver is not ready on first node

It seems the keepalived configuration is not working quite right. I have a cluster made up of 3 masters and switched off the first one, failover to the second one was working as expected. But when I turn the first master back on, the first master gets priority again, despite failing the script for port 6443, journald logs:

Apr 06 08:42:23 k8s-master-0 Keepalived_vrrp[1218]: Assigned address 192.168.1.221 for interface ens160
Apr 06 08:42:23 k8s-master-0 Keepalived_vrrp[1218]: Registering gratuitous ARP shared channel
Apr 06 08:42:23 k8s-master-0 Keepalived_vrrp[1218]: (VI_1) removing VIPs.
Apr 06 08:42:23 k8s-master-0 Keepalived_vrrp[1218]: (VI_1) Entering BACKUP STATE (init)
Apr 06 08:42:23 k8s-master-0 Keepalived_vrrp[1218]: VRRP sockpool: [ifindex(  2), family(IPv4), proto(112), fd(11,14), unicast, address(192.168.1.221)]
Apr 06 08:42:23 k8s-master-0 Keepalived_vrrp[1218]: Script `chk_rke2server` now returning 1
Apr 06 08:42:23 k8s-master-0 Keepalived_vrrp[1218]: VRRP_Script(chk_rke2server) failed (exited with status 1)
Apr 06 08:42:23 k8s-master-0 Keepalived_vrrp[1218]: (VI_1) Changing effective priority from 150 to 145
Apr 06 08:42:23 k8s-master-0 Keepalived_vrrp[1218]: Script `chk_apiserver` now returning 1
Apr 06 08:42:23 k8s-master-0 Keepalived_vrrp[1218]: VRRP_Script(chk_apiserver) failed (exited with status 1)
Apr 06 08:42:23 k8s-master-0 Keepalived_vrrp[1218]: (VI_1) Changing effective priority from 145 to 140
Apr 06 08:42:24 k8s-master-0 Keepalived_vrrp[1218]: (VI_1) received lower priority (139) advert from 192.168.1.222 - discarding

Even with both scripts failing priority stays at 140, while the first backup server only has a priority of 139 (not quite sure why that is actually, the scripts should be working there), so it will never switch to the backup server (except when the first one is completely offline)

logs of second master while the first one was shut down:

Apr 06 08:41:40 k8s-master-1 Keepalived_vrrp[86970]: (VI_1) Receive advertisement timeout
Apr 06 08:41:40 k8s-master-1 Keepalived_vrrp[86970]: (VI_1) Entering MASTER STATE
Apr 06 08:41:40 k8s-master-1 Keepalived_vrrp[86970]: (VI_1) setting VIPs.
Apr 06 08:41:40 k8s-master-1 Keepalived_vrrp[86970]: (VI_1) Sending/queueing gratuitous ARPs on ens160 for 192.168.1.226
Apr 06 08:41:40 k8s-master-1 Keepalived_vrrp[86970]: Sending gratuitous ARP on ens160 for 192.168.1.226
Apr 06 08:41:40 k8s-master-1 Keepalived_vrrp[86970]: Sending gratuitous ARP on ens160 for 192.168.1.226
Apr 06 08:41:40 k8s-master-1 Keepalived_vrrp[86970]: Sending gratuitous ARP on ens160 for 192.168.1.226
Apr 06 08:41:40 k8s-master-1 Keepalived_vrrp[86970]: Sending gratuitous ARP on ens160 for 192.168.1.226
Apr 06 08:41:40 k8s-master-1 Keepalived_vrrp[86970]: Sending gratuitous ARP on ens160 for 192.168.1.226
Apr 06 08:41:45 k8s-master-1 Keepalived_vrrp[86970]: (VI_1) Sending/queueing gratuitous ARPs on ens160 for 192.168.1.226
Apr 06 08:41:45 k8s-master-1 Keepalived_vrrp[86970]: Sending gratuitous ARP on ens160 for 192.168.1.226
Apr 06 08:41:45 k8s-master-1 Keepalived_vrrp[86970]: Sending gratuitous ARP on ens160 for 192.168.1.226
Apr 06 08:41:45 k8s-master-1 Keepalived_vrrp[86970]: Sending gratuitous ARP on ens160 for 192.168.1.226
Apr 06 08:41:45 k8s-master-1 Keepalived_vrrp[86970]: Sending gratuitous ARP on ens160 for 192.168.1.226
Apr 06 08:41:45 k8s-master-1 Keepalived_vrrp[86970]: Sending gratuitous ARP on ens160 for 192.168.1.226
Apr 06 08:42:23 k8s-master-1 Keepalived_vrrp[86970]: (VI_1) Master received advert from 192.168.1.221 with higher priority 140, ours 139
Apr 06 08:42:23 k8s-master-1 Keepalived_vrrp[86970]: (VI_1) Entering BACKUP STATE
Apr 06 08:42:23 k8s-master-1 Keepalived_vrrp[86970]: (VI_1) removing VIPs.

I suspect something is off with the weights in the keepalived config

Question: Kubevip deployment clarification

Summary

While trying to deploy rke2 in HA mode with kube-vip it's failing.
I would like to confirm playbook.yml parameters are correct for a kubevip deployment in baremetal servers?

Issue Type

Bug Report

Ansible Version

ansible [core 2.12.7]
  config file = /home/mgmt/ansible/rke2-setup/ansible.cfg
  configured module search path = ['/home/mgmt/.ansible/plugins/modules', '/usr/share/ansible/plugins/modules']
  ansible python module location = /usr/lib/python3/dist-packages/ansible
  ansible collection location = /home/mgmt/.ansible/collections:/usr/share/ansible/collections
  executable location = /usr/bin/ansible
  python version = 3.8.10 (default, Jun 22 2022, 20:18:18) [GCC 9.4.0]
  jinja version = 2.10.1
  libyaml = True

Steps to Reproduce

When I try to use the below to deploy RKE2, I am facing an error mentioning syntax error.

\:$ cat playbook.yaml
---
- name: Deploy RKE2
  hosts: All
  become: yes
  vars:
    rke2_ha_mode: true
    rke2_ha_mode_keepalived: false
    rke2_ha_mode_kubevip: true
    rke2_api_ip: 192.168.5.220
    rke2_api_cidr: 24
    rke2_interface: eno1
    rke2_kubevip_svc_enable: true
    rke2_loadbalancer_ip_range: 192.168.5.221-192.168.5.254
    rke2_additional_sans: [infra.example.com]
    rke2_version: v1.21.11+rke2r1
    rke2_artifact_path: /rke2/artifact
    rke2_airgap_copy_sourcepath: /rke2/artifact
    rke2_airgap_mode: true
    rke2_airgap_implementation: copy
    rke2_disable:
      - rke2-canal
      - rke2-calico
      - rke2-kube-proxy
      - rke2-multus
    rke2_cni:
      - cilium
    rke2_airgap_copy_additional_tarballs:
      - rke2-images-cilium.linux-{{ rke2_architecture }}.tar.zst
    rke2_drain_node_during_upgrade: false
  roles:
     - role: lablabs.rke2

Inventory file used

[server]
server1 ansible_host=192.168.3.11 rke2_type=server
server2 ansible_host=192.168.3.12 rke2_type=server
server2 ansible_host=192.168.3.13 rke2_type=server

[agent]
agent-2 ansible_host=192.168.3.23 rke2_type=agent
agent-2 ansible_host=192.168.3.43 rke2_type=agent
agent-3 ansible_host=192.168.3.34 rke2_type=agent

[All:children]
server
agent

Expected Results

Deploy RKE2 servers with HA

Actual Results

After deploying the mentioned playbook facing proxy error and eventually services goes down. 

bug: kubectl not installed on the first server

Update: Not a bug. Solved by running the Ansible controller on a Linux VM (instead of the problematic WSL2) and disable firewalld in every VMs as indicated in the document:

Firewalld conflicts with RKE2's default Canal (Calico + Flannel) networking stack. To avoid unexpected behavior, firewalld should be disabled on systems running RKE2.


Summary

I tried to install lablabs.rke2 (1.12.0) on 5 VMs running Rocky Linux 8.6 (3 masters and 2 workers) with HA+kubevip. However, the Wait for the first server be ready task failed after 40 retries. The error message is

/bin/bash: line 1: /var/lib/rancher/rke2/bin/kubectl: No such file or directory

I checked the first server but didn't found the /var/lib/rancher directory. I only found a weird /etc/rancher/rke2/config.yaml in the first server.

Issue Type

Bug Report

Ansible Version

ansible [core 2.13.1]
  config file = None
  configured module search path = ['/home/<redacted>/.ansible/plugins/modules', '/usr/share/ansible/plugins/modules']
  ansible python module location = /home/<redacted>/.local/lib/python3.10/site-packages/ansible
  ansible collection location = /home/<redacted>/.ansible/collections:/usr/share/ansible/collections
  executable location = /home/<redacted>/.local/bin/ansible
  python version = 3.10.5 (main, Jun 11 2022, 16:53:24) [GCC 9.4.0]
  jinja version = 3.1.2
  libyaml = False

Steps to Reproduce

Commands:

$ ansible-galaxy install lablabs.rke2
$ ansible-playbook -i hosts playbook.yaml
$ ansible-playbook -i hosts playbook.yaml

P.S. The first run failed on the "Start RKE2 service on the first server" task becase rke2-server.service isn't ready yet.

playbook.yaml:

---
    - name: Deploy RKE2
      hosts: k8s_cluster
      vars:
        rke2_ha_mode: true
        rke2_ha_mode_keepalived: false
        rke2_ha_mode_kubevip: true
        rke2_interface: ens192
        rke2_loadbalancer_ip_range: 192.168.3.191-192.168.3.199
        rke2_server_taint: true
        rke2_api_ip : 192.168.3.190
        rke2_download_kubeconf: true
      roles:
         - role: lablabs.rke2

hosts:

k8s_cluster:
  children:
    masters:
      hosts:
        master01.mydomain.com:
          ansible_user: root
          ansible_ssh_private_key_file: ~/.ssh/master01.mydomain.com
          rke2_type: server
        master02.mydomain.com:
          ansible_user: root
          ansible_ssh_private_key_file: ~/.ssh/master02.mydomain.com
          rke2_type: server
        master03.mydomain.com:
          ansible_user: root
          ansible_ssh_private_key_file: ~/.ssh/master03.mydomain.com
          rke2_type: server
    workers:
      hosts:
        worker01.mydomain.com:
          ansible_user: root
          ansible_ssh_private_key_file: ~/.ssh/worker01.mydomain.com
          rke2_type: agent
        worker02.mydomain.com:
          ansible_user: root
          ansible_ssh_private_key_file: ~/.ssh/worker02.mydomain.com
          rke2_type: agent

Expected Results

The cluster is deployed.

Actual Results

TASK [lablabs.rke2 : Wait for the first server be ready] ***************************************************************
FAILED - RETRYING: [master01.mydomain.com]: Wait for the first server be ready (40 retries left).
FAILED - RETRYING: [master01.mydomain.com]: Wait for the first server be ready (39 retries left).
FAILED - RETRYING: [master01.mydomain.com]: Wait for the first server be ready (38 retries left).
...
FAILED - RETRYING: [master01.mydomain.com]: Wait for the first server be ready (3 retries left).
FAILED - RETRYING: [master01.mydomain.com]: Wait for the first server be ready (2 retries left).
FAILED - RETRYING: [master01.mydomain.com]: Wait for the first server be ready (1 retries left).
fatal: [master01.mydomain.com]: FAILED! => {"attempts": 40, "changed": false, "cmd": "set -o pipefail\n/var/lib/rancher/rke2/bin/kubectl --kubeconfig /etc/rancher/rke2/rke2.yaml get nodes | grep \"master01.mydomain.com\"\n", "delta": "0:00:00.005873", "end": "2022-07-12 11:50:39.491521", "msg": "non-zero return code", "rc": 1, "start": "2022-07-12 11:50:39.485648", "stderr": "/bin/bash: line 1: /var/lib/rancher/rke2/bin/kubectl: No such file or directory", "stderr_lines": ["/bin/bash: line 1: /var/lib/rancher/rke2/bin/kubectl: No such file or directory"], "stdout": "", "stdout_lines": []}

...

TASK [lablabs.rke2 : Download RKE2 kubeconfig to localhost] ************************************************************
fatal: [master02.mydomain.com -> master01.mydomain.com]: FAILED! => {"changed": false, "msg": "the remote file does not exist, not transferring, ignored"}

...

PLAY RECAP *************************************************************************************************************
master01.mydomain.com  : ok=15   changed=1    unreachable=0    failed=1    skipped=13   rescued=0    ignored=0
master02.mydomain.com  : ok=10   changed=0    unreachable=0    failed=1    skipped=15   rescued=0    ignored=0
master03.mydomain.com  : ok=10   changed=0    unreachable=0    failed=0    skipped=15   rescued=0    ignored=0
worker01.mydomain.com  : ok=10   changed=0    unreachable=0    failed=0    skipped=15   rescued=0    ignored=0
worker02.mydomain.com  : ok=10   changed=0    unreachable=0    failed=0    skipped=15   rescued=0    ignored=0

bug: while deploying new cluster its stuck in "Restore etcd" block execution

Summary

While creating a new server ansible was stuck in "Restore etcd" block execution for a long time.

Issue Type

Bug Report

Ansible Version

ansible [core 2.12.7]
  config file = /home/mgmt/ansible/rke2-setup/ansible.cfg
  configured module search path = ['/home/mgmt/.ansible/plugins/modules', '/usr/share/ansible/plugins/modules']
  ansible python module location = /usr/lib/python3/dist-packages/ansible
  ansible collection location = /home/mgmt/.ansible/collections:/usr/share/ansible/collections
  executable location = /usr/bin/ansible
  python version = 3.8.10 (default, Jun 22 2022, 20:18:18) [GCC 9.4.0]
  jinja version = 2.10.1
  libyaml = True

Steps to Reproduce

When I try to use the below to deploy RKE2, I am facing an error mentioning syntax error.

\:$ cat playbook.yaml
---
- name: Deploy RKE2
  hosts: All
  become: yes
  vars:
    rke2_ha_mode: true
    rke2_ha_mode_keepalived: false
    rke2_ha_mode_kubevip: true
    rke2_api_ip: 192.168.5.220
    rke2_api_cidr: 24
    rke2_interface: eno1
    rke2_kubevip_svc_enable: true
    rke2_loadbalancer_ip_range: 192.168.5.221-192.168.5.254
    rke2_additional_sans: [infra.example.com]
    rke2_version: v1.21.11+rke2r1
    rke2_artifact_path: /rke2/artifact
    rke2_airgap_copy_sourcepath: /rke2/artifact
    rke2_airgap_mode: true
    rke2_airgap_implementation: copy
    rke2_disable:
      - rke2-canal
      - rke2-calico
      - rke2-kube-proxy
      - rke2-multus
    rke2_cni:
      - cilium
    rke2_airgap_copy_additional_tarballs:
      - rke2-images-cilium.linux-{{ rke2_architecture }}.tar.zst
    rke2_drain_node_during_upgrade: false
  roles:
     - role: lablabs.rke2

Inventory file used

[masters]
server1 ansible_host=192.168.3.11 rke2_type=server
server2 ansible_host=192.168.3.12 rke2_type=server
server2 ansible_host=192.168.3.13 rke2_type=server

[workers]
agent-2 ansible_host=192.168.3.23 rke2_type=agent
agent-2 ansible_host=192.168.3.43 rke2_type=agent
agent-3 ansible_host=192.168.3.34 rke2_type=agent

[k8s_cluster:children]
server
agent

Expected Results

The entire cluster should be deployed with the required manifests and config information.

Actual Results

During the deployment of the server "Restore etcd" block of first_server.yml task execution stuck for long time. Commented on the block and proceeded to complete the deployment.

bug: ansible_default_ipv4 fact & IPv6 only machine

Summary

Inside /templates/keepalived.conf.j2 the Ansible fact "ansible_default_ipv4" is used three times.
If you want to run an IPv6 / Hybrid cluster which has an IPv6 main-address you will run in either of those two problems:

When no IPv4 interface is configured:
The execution fails with a fatal error and Ansible will stop the provisioning.

When both IPv4 and IPv6 interfaces are configured:
It will setup the keepalived VRRP on the wrong interface.

I for myself have just changed all occurrence of "ansible_default_ipv4" to "ansible_default_ipv6" for a workaround.

If it is possible it would be nice if you can integrate a check which evaluates if provided node ip/s are IPv4 or IPv6 and then based on the check later call the corresponding Ansible fact. An even easier workaround would be to use a role variable to enable IPv6.

Issue Type

Bug Report

Ansible Version

ansible [core 2.14.2]
  config file = /etc/ansible/ansible.cfg
  configured module search path = ['/home/anon/.ansible/plugins/modules', '/usr/share/ansible/plugins/modules']
  ansible python module location = /usr/lib/python3.10/site-packages/ansible
  ansible collection location = /home/anon/.ansible/collections:/usr/share/ansible/collections
  executable location = /usr/bin/ansible
  python version = 3.10.9 (main, Dec 19 2022, 17:35:49) [GCC 12.2.0] (/usr/bin/python)
  jinja version = 3.1.2
  libyaml = True

Steps to Reproduce

Inventory:
[masters]
rancher-server01 ansible_host=[IPv6-Address] rke2_type=server
rancher-server02 ansible_host=[IPv6-Address] rke2_type=server
rancher-server03 ansible_host=[IPv6-Address] rke2_type=server

[workers]
rancher-node01 ansible_host=[IPv6-Address] rke2_type=agent
rancher-node02 ansible_host=[IPv6-Address] rke2_type=agent
rancher-node03 ansible_host=[IPv6-Address] rke2_type=agent

[k8s_cluster:children]
masters
workers

Playbook:

  • name: Deploy RKE2
    hosts: all
    become: yes
    become_user: root
    become_method: su
    vars:
    rke2_ha_mode: true
    rke2_api_ip : [IPv6-Address]
    rke2_server_options:
    - "cluster-cidr: 2001:cafe:42:0::/56"
    - "service-cidr: 2001:cafe:42:1::/112"
    rke2_custom_registry_mirrors:
    - name: docker.io
    endpoint:
    - "https://registry.ipv6.docker.com"
    roles:
    • role: lablabs.rke2

Expected Results

Provisioning will run through all steps as usual and will bind keepalived VRRP on IPv6 Interface.

Actual Results

Provisioning fails at keepalived VRRP if no IPv4 is setup or will use wrong interface in an IPv4/IPv6 hybrid setup.

Timeout on rke2_version: v1.21.6+rke2r1

Hi,

on the other rke2_version than default : v1.21.2+rke2r1

Tested on:
rke2_version: v1.21.5+rke2r2
rke2_version: v1.21.6+rke2r1

It is failing on wait for the first server to be ready.

TASK [lablabs.rke2 : Wait for the first server be ready] ***************************************************************************************************************
FAILED - RETRYING: Wait for the first server be ready (40 retries left).
FAILED - RETRYING: Wait for the first server be ready (39 retries left).

Log from rke2-server:

Nov 04 14:29:38 rke-test-1-dc1-mgmt rke2[4881]: time="2021-11-04T14:29:38+01:00" level=info msg="Event(v1.ObjectReference{Kind:\"Addon\", Namespace:\"kube-system\", Name:\"rke2-canal\", UID:\"45e39394-4aa4-4c41-9d6c-6400d2fed972\", APIVersion:\"k3s.cattle.io/v1\", ResourceVersion:\"297\", FieldPath:\"\"}): type: 'Normal' reason: 'ApplyingManifest' Applying manifest at \"/var/lib/rancher/rke2/server/manifests/rke2-canal.yaml\""
Nov 04 14:29:38 rke-test-1-dc1-mgmt rke2[4881]: time="2021-11-04T14:29:38+01:00" level=info msg="Active TLS secret rke2-serving (ver=296) (count 9): map[listener.cattle.io/cn-10.149.100.141:10.149.100.141 listener.cattle.io/cn-10.43.0.1:10.43.0.1 listener.cattle.io/cn-127.0.0.1:127.0.0.1 listener.cattle.io/cn-kubernetes:kubernetes listener.cattle.io/cn-kubernetes.default:kubernetes.default listener.cattle.io/cn-kubernetes.default.svc:kubernetes.default.svc listener.cattle.io/cn-kubernetes.default.svc.cluster.local:kubernetes.default.svc.cluster.local listener.cattle.io/cn-localhost:localhost listener.cattle.io/cn-rke-test-1-dc1-mgmt:rke-test-1-dc1-mgmt listener.cattle.io/fingerprint:SHA1=9110B4A56B425CE1ADF5FD2C7E4536E2E0097175]"
Nov 04 14:29:38 rke-test-1-dc1-mgmt rke2[4881]: time="2021-11-04T14:29:38+01:00" level=info msg="Event(v1.ObjectReference{Kind:\"Addon\", Namespace:\"kube-system\", Name:\"rke2-canal\", UID:\"45e39394-4aa4-4c41-9d6c-6400d2fed972\", APIVersion:\"k3s.cattle.io/v1\", ResourceVersion:\"297\", FieldPath:\"\"}): type: 'Normal' reason: 'AppliedManifest' Applied manifest at \"/var/lib/rancher/rke2/server/manifests/rke2-canal.yaml\""
Nov 04 14:29:38 rke-test-1-dc1-mgmt rke2[4881]: time="2021-11-04T14:29:38+01:00" level=info msg="Event(v1.ObjectReference{Kind:\"Addon\", Namespace:\"kube-system\", Name:\"rke2-cilium\", UID:\"\", APIVersion:\"k3s.cattle.io/v1\", ResourceVersion:\"\", FieldPath:\"\"}): type: 'Normal' reason: 'DeletingManifest' Deleting manifest at \"/var/lib/rancher/rke2/server/manifests/rke2-cilium.yaml\""
Nov 04 14:29:39 rke-test-1-dc1-mgmt rke2[4881]: time="2021-11-04T14:29:39+01:00" level=info msg="Event(v1.ObjectReference{Kind:\"Addon\", Namespace:\"kube-system\", Name:\"rke2-coredns\", UID:\"89e3aa88-84a6-49be-9a0f-668817ed3d61\", APIVersion:\"k3s.cattle.io/v1\", ResourceVersion:\"327\", FieldPath:\"\"}): type: 'Normal' reason: 'ApplyingManifest' Applying manifest at \"/var/lib/rancher/rke2/server/manifests/rke2-coredns.yaml\""
Nov 04 14:29:39 rke-test-1-dc1-mgmt rke2[4881]: time="2021-11-04T14:29:39+01:00" level=info msg="Event(v1.ObjectReference{Kind:\"Addon\", Namespace:\"kube-system\", Name:\"rke2-coredns\", UID:\"89e3aa88-84a6-49be-9a0f-668817ed3d61\", APIVersion:\"k3s.cattle.io/v1\", ResourceVersion:\"327\", FieldPath:\"\"}): type: 'Normal' reason: 'AppliedManifest' Applied manifest at \"/var/lib/rancher/rke2/server/manifests/rke2-coredns.yaml\""
Nov 04 14:29:39 rke-test-1-dc1-mgmt rke2[4881]: time="2021-11-04T14:29:39+01:00" level=info msg="Event(v1.ObjectReference{Kind:\"Addon\", Namespace:\"kube-system\", Name:\"rke2-ingress-nginx\", UID:\"3ae85ff0-c634-4186-b9cb-79d0f3f8aecc\", APIVersion:\"k3s.cattle.io/v1\", ResourceVersion:\"343\", FieldPath:\"\"}): type: 'Normal' reason: 'ApplyingManifest' Applying manifest at \"/var/lib/rancher/rke2/server/manifests/rke2-ingress-nginx.yaml\""
Nov 04 14:29:39 rke-test-1-dc1-mgmt rke2[4881]: time="2021-11-04T14:29:39+01:00" level=info msg="Event(v1.ObjectReference{Kind:\"Addon\", Namespace:\"kube-system\", Name:\"rke2-ingress-nginx\", UID:\"3ae85ff0-c634-4186-b9cb-79d0f3f8aecc\", APIVersion:\"k3s.cattle.io/v1\", ResourceVersion:\"343\", FieldPath:\"\"}): type: 'Normal' reason: 'AppliedManifest' Applied manifest at \"/var/lib/rancher/rke2/server/manifests/rke2-ingress-nginx.yaml\""
Nov 04 14:29:39 rke-test-1-dc1-mgmt rke2[4881]: time="2021-11-04T14:29:39+01:00" level=info msg="Event(v1.ObjectReference{Kind:\"Addon\", Namespace:\"kube-system\", Name:\"rke2-kube-proxy\", UID:\"\", APIVersion:\"k3s.cattle.io/v1\", ResourceVersion:\"\", FieldPath:\"\"}): type: 'Normal' reason: 'DeletingManifest' Deleting manifest at \"/var/lib/rancher/rke2/server/manifests/rke2-kube-proxy.yaml\""
Nov 04 14:29:39 rke-test-1-dc1-mgmt rke2[4881]: time="2021-11-04T14:29:39+01:00" level=info msg="Event(v1.ObjectReference{Kind:\"Addon\", Namespace:\"kube-system\", Name:\"rke2-metrics-server\", UID:\"f61febdf-962f-4d43-a7ac-a00806bd8f42\", APIVersion:\"k3s.cattle.io/v1\", ResourceVersion:\"364\", FieldPath:\"\"}): type: 'Normal' reason: 'ApplyingManifest' Applying manifest at \"/var/lib/rancher/rke2/server/manifests/rke2-metrics-server.yaml\""
Nov 04 14:29:39 rke-test-1-dc1-mgmt rke2[4881]: time="2021-11-04T14:29:39+01:00" level=info msg="Event(v1.ObjectReference{Kind:\"Addon\", Namespace:\"kube-system\", Name:\"rke2-metrics-server\", UID:\"f61febdf-962f-4d43-a7ac-a00806bd8f42\", APIVersion:\"k3s.cattle.io/v1\", ResourceVersion:\"364\", FieldPath:\"\"}): type: 'Normal' reason: 'AppliedManifest' Applied manifest at \"/var/lib/rancher/rke2/server/manifests/rke2-metrics-server.yaml\""
Nov 04 14:29:39 rke-test-1-dc1-mgmt rke2[4881]: time="2021-11-04T14:29:39+01:00" level=info msg="Event(v1.ObjectReference{Kind:\"Addon\", Namespace:\"kube-system\", Name:\"rke2-multus\", UID:\"\", APIVersion:\"k3s.cattle.io/v1\", ResourceVersion:\"\", FieldPath:\"\"}): type: 'Normal' reason: 'DeletingManifest' Deleting manifest at \"/var/lib/rancher/rke2/server/manifests/rke2-multus.yaml\""
Nov 04 14:29:40 rke-test-1-dc1-mgmt rke2[4881]: time="2021-11-04T14:29:40+01:00" level=info msg="Running kube-proxy --cluster-cidr=10.42.0.0/16 --conntrack-max-per-core=0 --conntrack-tcp-timeout-close-wait=0s --conntrack-tcp-timeout-established=0s --healthz-bind-address=127.0.0.1 --hostname-override=rke-test-1-dc1-mgmt --kubeconfig=/var/lib/rancher/rke2/agent/kubeproxy.kubeconfig --proxy-mode=iptables"
Nov 04 14:29:40 rke-test-1-dc1-mgmt rke2[4881]: time="2021-11-04T14:29:40+01:00" level=info msg="Stopped tunnel to 127.0.0.1:9345"
Nov 04 14:29:40 rke-test-1-dc1-mgmt rke2[4881]: time="2021-11-04T14:29:40+01:00" level=info msg="Connecting to proxy" url="wss://10.149.100.141:9345/v1-rke2/connect"
Nov 04 14:29:40 rke-test-1-dc1-mgmt rke2[4881]: time="2021-11-04T14:29:40+01:00" level=info msg="Proxy done" err="context canceled" url="wss://127.0.0.1:9345/v1-rke2/connect"
Nov 04 14:29:40 rke-test-1-dc1-mgmt rke2[4881]: time="2021-11-04T14:29:40+01:00" level=info msg="error in remotedialer server [400]: websocket: close 1006 (abnormal closure): unexpected EOF"
Nov 04 14:29:40 rke-test-1-dc1-mgmt rke2[4881]: time="2021-11-04T14:29:40+01:00" level=info msg="Handling backend connection request [rke-test-1-dc1-mgmt]"

Rolling restart task not health checking

The recently contributed rolling restart task is simply restarting the rke2 service on each node in order, but in my experience the service status being ready in systemd doesn't necessarily mean the node is actually ready and good to go, especially when upgrading master nodes.

I think we should add health checks and only restart the next node once the previous one is confirmed to be up and running again to avoid any potential cluster breakage, at least for the master nodes. This is less serious for worker nodes.

Currently we're using an extra playbook to run these tasks serially, because there isn't a straight forward way to integrate this into the ansible role.

Posting the playbook below for inspiration, maybe somebody has a better idea how to get this into the role.
Edit: The one minute pause is there, because I noticed that sometimes health checks will work directly after restarting the rke2 service, but then potentially fail for the next few minutes until being ready again

---
- name: Restart RKE2 service and check health
  hosts: masters
  become: yes
  serial: 1
  tasks:
    - name: Restart RKE2 server on master nodes
      ansible.builtin.service:
        name: "rke2-server.service"
        state: restarted
    - name: Pause for 1 minute
      pause:
        minutes: 1
    - name: Healthcheck Kube Apiserver
      ansible.builtin.command: curl -k https://localhost:6443/readyz --cert /var/lib/rancher/rke2/server/tls/client-ca.crt --key /var/lib/rancher/rke2/server/tls/client-ca.key
      register: healthcheck_result
      until: healthcheck_result.stdout == 'ok'
      retries: 100
      delay: 5
    - name: Healthcheck RKE2
      ansible.builtin.command: curl -k https://127.0.0.1:9345/v1-rke2/readyz
      register: healthcheck_rke_result
      until: healthcheck_rke_result.rc == 0
      retries: 100
      delay: 5

- name: Restart RKE2 service and check health
  hosts: workers
  become: yes
  serial: "30%"
  tasks:
    - name: Restart RKE2 server on worker nodes
      ansible.builtin.service:
        name: "rke2-agent.service"
        state: restarted
    - name: Pause for 1 minute
      pause:
        minutes: 1
    - name: Healthcheck Kube Apiserver
      ansible.builtin.command: curl -k https://localhost:6443/readyz --cert /var/lib/rancher/rke2/agent/client-kubelet.crt --key /var/lib/rancher/rke2/agent/client-kubelet.key
      register: healthcheck_result
      until: healthcheck_result.stdout == 'ok'
      retries: 100
      delay: 5

bug: [BREAKING] disable_kube_proxy is commented in default/main.yaml

Summary

In the default values in default/main.yaml, disable_kube_proxy: true is commented. It breaks all stages concerning the first node as the variable is called upon.

Issue Type

Bug Report

Ansible Version

ansible [core 2.14.4]
  config file = None
  configured module search path = ['/home/cloudadm/.ansible/plugins/modules', '/usr/share/ansible/plugins/modules']
  ansible python module location = /home/cloudadm/.local/lib/python3.9/site-packages/ansible
  ansible collection location = /home/cloudadm/.ansible/collections:/usr/share/ansible/collections
  executable location = /home/cloudadm/.local/bin/ansible
  python version = 3.9.2 (default, Feb 28 2021, 17:03:44) [GCC 10.2.1 20210110] (/usr/bin/python3)
  jinja version = 3.1.2
  libyaml = True

Steps to Reproduce

Run the role without explicitly assigning a value to diasable_kube_proxy.

Expected Results

The role should run smoothly with disable_kube_proxy set to true

Actual Results

The role breaks when it calls the variable during a check.

Use a custom rke2 config.yaml file instead of template

Hello,

Thanks for the role, working great !

I have a full config file for RKE2 server with multiple parameters (etcd-s3)
I tried to used it directly with
rke2_server_options: "{{ lookup('file', '../config.yml') | from_yaml }}"
But unfortunately it only writes the keys and not the values in the file generated from template

It would be marvellous to use own config file

Playbook stuck while starting the RKE2 service on agents

Summary

In the troubleshooting section here: https://github.com/lablabs/ansible-role-rke2#troubleshooting, it mentions that it might be a network limitation.

The problem is that the RKE2 script is never executed on the agent which has condition with the variable installed_rke2_version. While that variable is depends on condition "rke2-server.service" in ansible_facts.services.

Below is the changes I made to fix the issue:

Before the Run AirGap RKE2 scripttask (

- name: Run AirGap RKE2 script
), I added the following tasks by checking that the rke2 binary path exists and don't relying on this line
when: '"rke2-server.service" in ansible_facts.services'
.

- name: Check rke2 bin exists
  ansible.builtin.stat:
    path: "{{ rke2_bin_path }}"
  register: rke2_exists

- name: Check RKE2 version
  ansible.builtin.shell: |
    set -o pipefail
    {{ rke2_bin_path }} --version | grep -E "rke2 version" | awk '{print $3}'
  args:
    executable: /bin/bash
  changed_when: false
  register: installed_rke2_version
  when: rke2_exists.stat.exists

Issue Type

Bug Report

Ansible Version

ansible [core 2.14.2]
  config file = /etc/ansible/ansible.cfg
  configured module search path = ['/root/.ansible/plugins/modules', '/usr/share/ansible/plugins/modules']
  ansible python module location = /usr/lib/python3/dist-packages/ansible
  ansible collection location = /root/.ansible/collections:/usr/share/ansible/collections
  executable location = /usr/bin/ansible
  python version = 3.10.6 (main, Nov 14 2022, 16:10:14) [GCC 11.3.0] (/usr/bin/python3)
  jinja version = 3.0.3
  libyaml = True

Steps to Reproduce

- name: Deploy RKE2
  hosts: all
  become: yes
  vars:
    rke2_version: v1.26.0+rke2r2    
    rke2_api_ip : 192.168.1.10
    rke2_download_kubeconf: true    
    rke2_server_node_taints:
      - 'CriticalAddonsOnly=true:NoExecute'
    rke2_cni:
      - cilium
  roles:
     - role: lablabs.rke2
[masters]
master-01 ansible_host=192.168.1.10 rke2_type=server
master-02 ansible_host=192.168.1.11 rke2_type=server
master-03 ansible_host=192.168.1.12 rke2_type=server

[workers]
worker-01 ansible_host=192.168.1.20 rke2_type=agent
worker-02 ansible_host=192.168.1.21 rke2_type=agent

[k8s_cluster:children]
masters
workers

Expected Results

Worker nodes should be provisioned if the rke2.sh script have been executed on the following task

- name: Run RKE2 script

Actual Results

It's just hanging until timeout.

Add restore snapshot functionnality

Hello,

First of all, thank you for this great work ! I love it !
May I suggest a new feature ?
I would be very interested by an option to restore an etcd snapshot at installation time.
I mean :

  • deploy rke2 on first node,
  • stop rke2-server on first node,
  • run "rke2 server --cluster-reset --cluster-reset-restore-path=", with "" as an ansible variable,
  • start rke2-server on first node,
  • deploy rke2 on remaining nodes

Do you think it could be possible ?

Best regards,

Olivier

feature: Cluster/Service CIDR

Summary

By default RKE2 sets:

  • 10.42.0.0/16 - for cluster network
  • 10.43.0.0/16 - for service netwotk

In manual installation, those values can be altered with rke2/config.yaml:

cluster-cidr: 10.1.0.0/16
service-cidr: 10.2.0.0/16

The goal is to add a support of cluster-cidr and service-cidr opinions in this role.

Issue Type

Feature Idea

bug: AIRGAP COPY - multus and calico images not being copied to agent/images path

Summary

Environmental Info:
RKE2 Version: v1.23.7+rke2r2

Node(s) CPU architecture, OS, and Version:
Linux master-01 5.4.0-117-generic #132-Ubuntu SMP Thu Jun 2 00:39:06 UTC 2022 x86_64 x86_64 x86_64 GNU/Linu

Cluster Configuration:
1 master
4 workers
airgap environment
CNI: multus,calico

Describe the bug:

When bootstraping RKE2 airgap cluster with CNI plugins other than default (in my case, multus and calico), the CNI images are copied into the /var/lib/rancher/rke2/artifacts/. But, in order to deploy CNIs, the tarballs (compressed zst or tar.gz) have to be also copied in /var/lib/rancher/rke2/agent/images.

Please, note that I mentioned the default settings for artifacts and images.

I already opened an issue for RKE2: rancher/rke2#3147

And the response was that it is the expected behaviour, so the images of the CNIs have to be manually copied to this path.

Issue Type

Bug Report

Ansible Version

ansible [core 2.12.6]
  config file = /home/test/ANA/Offline-RKE2-ANA/ansible.cfg
  configured module search path = ['/home/test/.ansible/plugins/modules', '/usr/share/ansible/plugins/modules']
  ansible python module location = /home/test/.local/lib/python3.8/site-packages/ansible
  ansible collection location = /home/test/ANA/Offline-RKE2/collections
  executable location = /home/test/.local/bin/ansible
  python version = 3.8.10 (default, Mar 15 2022, 12:22:08) [GCC 9.4.0]
  jinja version = 3.1.2
  libyaml = True

Steps to Reproduce

---
- name: RKE2 K8S BOOTSTRAPPING
  hosts: all
  gather_facts: true
  become: true

  vars_files:
    - vars/global.yaml

  vars:
    rke2_ha_mode: false
    rke2_airgap_mode: true
    rke2_airgap_implementation: copy
    rke2_ha_mode_keepalived: false
    rke2_ha_mode_kubevip: false
    rke2_additional_sans:
      - ana.pt
      - k8s.ana.pt
      - k8s-vmware.ana.pt
    rke2_apiserver_dest_port: 6443
    rke_server_taint: true
    rke2_token: GIzY5kxm9WRGxBekiifQ
    rke2_version: v1.23.7+rke2r2
    rke2_channel: stable
    rke2_artifact_path: /var/lib/rancher/rke2/artifacts
    rke2_airgap_copy_sourcepath: local_artifacts/local_artifacts_rke2
    rke2_cni:
      - multus
      - calico
    rke2_download_kubeconf: true
    rke2_download_kubeconf_file_name: rke2.yaml
    rke2_download_kubeconf_path: /tmp
    nexus_container_registry: "{{ nexus_ingress_cr_host }}"
    rke2_custom_registry_mirrors:
      - name: docker.io
        endpoint: 
          - "https://{{ nexus_container_registry }}"
      - name: quay.io
        endpoint: 
          - "https://{{ nexus_container_registry }}"
      - name: docker.elastic.co
        endpoint: 
          - "https://{{ nexus_container_registry }}"
      - name: cr.fluentbit.io
        endpoint:
          - "https://{{ nexus_container_registry }}"
      - name: registry.gitlab.com
        endpoint: 
          - "https://{{ nexus_container_registry }}"
    rke2_custom_registry_configs:
      - endpoint: "\"{{ nexus_container_registry }}\""
        config:
          tls: 
            insecure_skip_verify: true
    rke2_custom_manifests:
      - roles/lablabs.rke2/files/rke2-ingress-nginx-config.yml

    rke2_artifact:
      - sha256sum-{{ rke2_architecture }}.txt
      - rke2.linux-{{ rke2_architecture }}.tar.gz
      - rke2-images.linux-{{ rke2_architecture }}.tar.zst
      - rke2-images-multus.linux-{{ rke2_architecture }}.tar.zst
      - rke2-images-calico.linux-{{ rke2_architecture }}.tar.zst
      
  roles:
    - role: lablabs.rke2

Expected Results

RKE2 nodes in Ready state.

Actual Results

First server can't find the multus or calico images. The artifacts files are copied to `{{ rke2_artifact_path: }}` but not to the `agent/images`

feature: s3 snapshot

Summary

Can you implement etcd snapshot on minio or S3 ?

In rke2 there are:

`
etcd-expose-metrics: false

etcd-snapshot-name: "prefix_name"
etcd-snapshot-schedule-cron: "0 */1 * * *"
etcd-snapshot-retention: 360
etcd-s3: true
etcd-s3-region: "eu-west-1"
etcd-s3-endpoint: "s3-eu-west-1.amazonaws.com"
etcd-s3-bucket: "my-bucket"
etcd-s3-folder: "rke2-test"
etcd-s3-access-key: "AKIA2Q..."
etcd-s3-secret-key: "jkN0xL...."

`

Issue Type

Feature Idea

Questions /etc/rancher/rke2/config.yaml

Hi, Thank you for this role.

I have a question about the generation of this config file /etc/rancher/rke2/config.yaml, i'm not sur how, but only the first server node receives the right config with tls-san. Other server nodes contains just:

server: https://<keepalived_IP>:9345
token: 
snapshotter: overlayfs

So with the rke2.yaml there might be issue with tls due to missing keepalived IP in other server nodes

Do you encounter this issue?

bug: rke2 upgrade, agent nodes should be upgraded after all the master nodes

Summary

I upgraded rke2 from v1.22.9 to v1.23.9 which actually worked fine, but I noticed that some worker nodes were upgraded in between the master nodes which goes against RKE2 recommendations:

Note: Upgrade the server nodes first, one at a time. Once all servers have been upgraded, you may then upgrade agent nodes.

see https://docs.rke2.io/upgrade/basic_upgrade/

Ansible Output:

TASK [lablabs.rke2 : Cordon and Drain the node platform-rancher-master-k8s-master-0] ***
skipping: [platform-rancher-master-k8s-master-0]
TASK [lablabs.rke2 : Restart RKE2 service on platform-rancher-master-k8s-master-0] ***
changed: [platform-rancher-master-k8s-master-0]
TASK [lablabs.rke2 : Wait for all nodes to be ready again] *********************
FAILED - RETRYING: [platform-rancher-master-k8s-master-0 -> platform-rancher-master-k8s-master-2]: Wait for all nodes to be ready again (100 retries left).
ok: [platform-rancher-master-k8s-master-0 -> platform-rancher-master-k8s-master-2(10.10.50.103)]
TASK [lablabs.rke2 : Uncordon the node platform-rancher-master-k8s-master-0] ***
skipping: [platform-rancher-master-k8s-master-0]
TASK [lablabs.rke2 : Cordon and Drain the node platform-rancher-master-k8s-master-1] ***
skipping: [platform-rancher-master-k8s-master-1]
TASK [lablabs.rke2 : Restart RKE2 service on platform-rancher-master-k8s-master-1] ***
changed: [platform-rancher-master-k8s-master-1]
TASK [lablabs.rke2 : Wait for all nodes to be ready again] *********************
ok: [platform-rancher-master-k8s-master-1 -> platform-rancher-master-k8s-master-2(10.10.50.103)]
TASK [lablabs.rke2 : Uncordon the node platform-rancher-master-k8s-master-1] ***
skipping: [platform-rancher-master-k8s-master-1]
TASK [lablabs.rke2 : Cordon and Drain the node platform-rancher-master-k8s-worker-1] ***
skipping: [platform-rancher-master-k8s-worker-1]
TASK [lablabs.rke2 : Restart RKE2 service on platform-rancher-master-k8s-worker-1] ***
changed: [platform-rancher-master-k8s-worker-1]
TASK [lablabs.rke2 : Wait for all nodes to be ready again] *********************
FAILED - RETRYING: [platform-rancher-master-k8s-worker-1 -> platform-rancher-master-k8s-master-2]: Wait for all nodes to be ready again (100 retries left).
ok: [platform-rancher-master-k8s-worker-1 -> platform-rancher-master-k8s-master-2(10.10.50.103)]
TASK [lablabs.rke2 : Uncordon the node platform-rancher-master-k8s-worker-1] ***
skipping: [platform-rancher-master-k8s-worker-1]
TASK [lablabs.rke2 : Cordon and Drain the node platform-rancher-master-k8s-master-2] ***
skipping: [platform-rancher-master-k8s-master-2]
TASK [lablabs.rke2 : Restart RKE2 service on platform-rancher-master-k8s-master-2] ***
changed: [platform-rancher-master-k8s-master-2]
TASK [lablabs.rke2 : Wait for all nodes to be ready again] *********************
ok: [platform-rancher-master-k8s-master-2]
TASK [lablabs.rke2 : Uncordon the node platform-rancher-master-k8s-master-2] ***
skipping: [platform-rancher-master-k8s-master-2]
TASK [lablabs.rke2 : Cordon and Drain the node platform-rancher-master-k8s-worker-0] ***
skipping: [platform-rancher-master-k8s-worker-0]
TASK [lablabs.rke2 : Restart RKE2 service on platform-rancher-master-k8s-worker-0] ***

Issue Type

Bug Report

Ansible Version

ansible [core 2.12.7]
  config file = None
  configured module search path = ['/root/.ansible/plugins/modules', '/usr/share/ansible/plugins/modules']
  ansible python module location = /usr/local/lib/python3.10/site-packages/ansible
  ansible collection location = /root/.ansible/collections:/usr/share/ansible/collections
  executable location = /usr/local/bin/ansible
  python version = 3.10.5 (main, Jul 13 2022, 05:45:22) [GCC 10.2.1 20210110]
  jinja version = 3.1.2
  libyaml = True

Steps to Reproduce

trigger a RKE2 upgrade, i.e. from 1.22.9 to 1.23.9

Expected Results

Master nodes should be upgraded first, then the worker nodes

Actual Results

Nodes are upgraded seemingly randomly

bug: kubelet server certificates does not include keepalived VIP

Summary

When using HA setup with Keeplived, the server certificates provisioned for Kubelet does not include the Keepalived VIP. This causes TLS verification issues when performing various operations like viewing logs or port forwarding on the current leader.

Issue Type

Bug Report

Ansible Version

ansible [core 2.14.6]
  config file = /Users/moray/.ansible.cfg
  configured module search path = ['/Users/moray/.ansible/plugins/modules', '/usr/share/ansible/plugins/modules']
  ansible python module location = /opt/homebrew/Cellar/ansible/7.6.0/libexec/lib/python3.11/site-packages/ansible
  ansible collection location = /Users/moray/.ansible/collections:/usr/share/ansible/collections
  executable location = /opt/homebrew/bin/ansible
  python version = 3.11.4 (main, Jul 25 2023, 17:36:13) [Clang 14.0.3 (clang-1403.0.22.14.1)] (/opt/homebrew/Cellar/ansible/7.6.0/libexec/bin/python3.11)
  jinja version = 3.1.2
  libyaml = True

Steps to Reproduce

  1. Install RKE2 (sample playbook below)
- hosts: rke
  become: true
  roles:
    - role: lablabs.rke2
  vars:
    rke2_ha_mode: true
    rke2_ha_mode_keepalived: true
    rke2_version: v1.26.7+rke2r1
    rke2_install_bash_url: https://get.rke2.io
    rke2_api_ip: 10.64.0.9
    rke2_disable:
      - rke2-ingress-nginx
    rke2_cni: canal
    rke2_cluster_group_name: rke
    rke2_servers_group_name: rke_master
    # Ansible group including worker nodes
    rke2_agents_group_name: rke_worker
    rke2_server_options:
      - "disable-cloud-controller: true"
  1. Try viewing logs of any pod on the current Keepalived leader

Expected Results

The TLS certificate generated for Kubelet includes the Keepalived VIP (10.64.0.9 in the example above), issuing kubectl logs and kubectl port-forward command on pods on the current leader works without problem.

Actual Results

The TLS certificate for Kubelet does not include the Keepalived VIP (10.64.0.9 in the example above). Issuing kubectl logs or kubectl port-forward commands on pods on the current leader results in the following error:

Error from server: Get "https://10.64.0.9:10250/containerLogs/kube-system/kube-proxy-master-0/kube-proxy": tls: failed to verify certificate: x509: certificate is valid for 127.0.0.1, 10.64.0.10, not 10.64.0.9

Additional information:

  • The API server serving certificate does include the VIP.
  • The leader's internal ip address always shows up as the VIP.
  • I have tried setting RKE2 options node-ip and advertise-address to the non-virtual ip but to no avail.

Add rolling restart when upgrading RKE2

When changing the rke2_version variable, the new version is installed but doesn't take effect until the rke2-server service is restarted.

We should look at adding an ansible handler to check if the version has changed, and restart each host one by one if it has?

bug: Agents provisioned as Servers

Summary

Hi There,

I'm trying to expand an existent cluster that is formerly composed of 3x node performing Master/Worker node.

I would like to expand the cluster by adding Worker-only nodes and for that I've got the following definitions:

Ansible runs fine, but in the all nodes are reported as control-plane nodes:

# kubectl get nodes -w
NAME       STATUS     ROLES                       AGE     VERSION
master-1   Ready      control-plane,etcd,master   7h54m   v1.22.10+rke2r2
master-2   Ready      control-plane,etcd,master   6h45m   v1.22.10+rke2r2
master-3   Ready      control-plane,etcd,master   6h50m   v1.22.10+rke2r2
worker-1   Ready      control-plane,etcd,master   90s     v1.22.10+rke2r2 

Am I doing something wrong or this can be a potential bug?

Issue Type

Bug Report

Ansible Version

ansible [core 2.13.2]
  config file = None
  configured module search path = ['/root/.ansible/plugins/modules', '/usr/share/ansible/plugins/modules']
  ansible python module location = /root/.local/lib/python3.9/site-packages/ansible
  ansible collection location = /root/.ansible/collections:/usr/share/ansible/collections
  executable location = /root/.local/bin/ansible
  python version = 3.9.13 (main, Jun 10 2022, 09:50:06) [GCC]
  jinja version = 3.1.2
  libyaml = True

Steps to Reproduce

inventory.yaml

[masters]
master-1 ansible_host=master-1 rke2_type=server
master-2 ansible_host=master-2 rke2_type=server
master-3 ansible_host=master-3 rke2_type=server

[workers]
worker-1 ansible_host=worker-1 rke2_type=agent

[k8s_cluster:children]
masters
workers

main.yaml

# Install RKE2
- name: RKE2 Setup on Cluster-wide
  hosts: k8s_cluster
  roles:
    - role: RKE2Cluster

vars.yaml

---
# RKE2 Settings
os_privileged_group: gok8sadm
rke2_type: server
rke2_airgap_mode: false
rke2_ha_mode: true
rke2_ha_mode_keepalived: false
rke2_ha_mode_kubevip: true
rke2_api_ip: 192.168.86.100
rke2_interface: eth0
rke2_loadbalancer_ip_range: 192.168.86.101-192.168.86.105
rke2_kubevip_cloud_provider_enable: true
rke2_kubevip_svc_enable: true
rke2_additional_sans: [ my-k8s-dev.hutger.xyz ]
rke2_apiserver_dest_port: 6443
rke2_disable:
  - rke2-ingress-nginx
rke2_server_taint: false
rke2_token: my-token
rke2_version: v1.22.10+rke2r2
rke2_data_path: /var/lib/rancher/rke2
rke2_channel: stable
rke2_cni: canal
rke2_download_kubeconf: true
rke2_download_kubeconf_file_name: rke2.yaml
rke2_download_kubeconf_path: /tmp
rke2_servers_group_name: masters
rke2_agents_group_name: workers

Expected Results

No roles associated to Worker-1

kubectl get nodes -w
NAME       STATUS     ROLES                       AGE     VERSION
master-1   Ready      control-plane,etcd,master   7h54m   v1.22.10+rke2r2
master-2   Ready      control-plane,etcd,master   6h45m   v1.22.10+rke2r2
master-3   Ready      control-plane,etcd,master   6h50m   v1.22.10+rke2r2
worker-1   Ready      <none>                                       90s     v1.22.10+rke2r2 

Actual Results

Worker-1 being set as Master.

kubectl get nodes -w
NAME       STATUS     ROLES                       AGE     VERSION
master-1   Ready      control-plane,etcd,master   7h54m   v1.22.10+rke2r2
master-2   Ready      control-plane,etcd,master   6h45m   v1.22.10+rke2r2
master-3   Ready      control-plane,etcd,master   6h50m   v1.22.10+rke2r2
worker-1   Ready      control-plane,etcd,master   90s     v1.22.10+rke2r2 

feature: support for adding worker new node easily without impacting existing cluster.

Summary

While trying to add a new worker node to the existing HA cluster, it restarts the RKE2 services of all existing master and worker nodes. Also, it takes a long time to complete the ansible execution which can be improved.

While adding new nodes I have commented "- name: Wait for remaining nodes to be ready" in remaining_nodes.yml, and also commented "Rolling restart" task from "main.yml".
This improved start of services in the newly added worker node.

We can have parameters to support adding new workers to the existing cluster?

Question: Even during new server deployment do we need to have "- name: Wait for remaining nodes to be ready" in remaining_nodes.yml?

Issue Type

Feature Idea

Install stuck in "wait for the first server be ready" with kubevip, cilium and kube proxy disabled

Summary

During the initial installation of a cluster using RKE2 version 1.27.1+rke2r1, kubevip, cilium and kube proxy disabled, the first node is stuck in the NOTREADY state preventing the successful completion of the cluster installation process.

The workaround I found :

  • Connect to the first server with SSH
  • Manually set the $rke2_api_ip on the network interface ip a a 192.0.2.20 dev ens224
  • Restart rke2 service systemctl restart rke2-server.service

Not sure why this is happening so far, possibly due to the disabling of kube proxy.

Issue Type

Bug Report

Ansible Version

Ansible 2.14.8

Steps to Reproduce

Deploy RKE2 with the following variables :

rke2_version: v1.27.1+rke2r1
rke2_cluster_group_name: kubernetes_cluster
rke2_servers_group_name: kubernetes_masters
rke2_agents_group_name: kubernetes_workers
rke2_ha_mode: true
rke2_ha_mode_keepalived: false
rke2_ha_mode_kubevip: true
rke2_additional_sans:
  - kubernetes-api.example.net
rke2_api_ip: "192.0.2.20"
rke2_kubevip_svc_enable: false
rke2_interface: "ens224"
rke2_kubevip_cloud_provider_enable: false
rke2_cni: cilium
rke2_disable:
  - rke2-canal
  - rke2-ingress-nginx
rke2_custom_manifests:
  - rke2-cilium-proxy.yaml
disable_kube_proxy: true
rke2_drain_node_during_upgrade: true
rke2_wait_for_all_pods_to_be_ready: true

Here is the content of rke2-cilium-proxy.yaml :

apiVersion: helm.cattle.io/v1
kind: HelmChartConfig
metadata:
  name: rke2-cilium
  namespace: kube-system
spec:
  valuesContent: |-
    kubeProxyReplacement: strict
    k8sServiceHost: {{ rke2_api_ip }}
    k8sServicePort: {{ rke2_apiserver_dest_port }}
    ipv4NativeRoutingCIDR: 10.43.0.0/15
    hubble:
      enabled: true
      metrics:
        enabled:
        - dns:query;ignoreAAAA
        - drop
        - tcp
        - flow
        - icmp
        - http
      relay:
        enabled: true
        replicas: 3
      ui:
        enabled: true
        replicas: 3
        ingress:
          enabled: false

Expected Results

The first server should at some point be in the READY state, so the installation of the cluster succeed.

Actual Results

[…]
FAILED - RETRYING: [k8s01.example.net]: Wait for the first server be ready (1 retries left).
fatal: [k8s01.example.net]: FAILED! => changed=false 
attempts: 40
cmd: |-
set -o pipefail
  /var/lib/rancher/rke2/bin/kubectl --kubeconfig /etc/rancher/rke2/rke2.yaml get nodes | grep "k8s01.example.net"
delta: '0:00:00.096538'
end: '2023-08-23 09:31:26.649490'
msg: ''
rc: 0
start: '2023-08-23 09:31:26.552952'
stderr: ''
stderr_lines: <omitted>
stdout: k8s01.example.net   NotReady   control-plane,etcd,master   10m   v1.27.1+rke2r1
stdout_lines: <omitted>

Master nodes fail to join cluster if multiple nodes are joined concurrently

If multiple master node servers try to join the cluster concurrently, the rke2-server service fails on one or more master nodes, with the following error:
"Failed to start Rancher Kubernetes Engine v2 (server)."
When the rke2-service fails on a host, Ansible considers the failure as "Fatal" and stops executing the following tasks on the respective host.

image

The results is that the node will then join the cluster, because systemd keep restarting the service until the activation succeeds, but the playbook stops executing the tasks on that particular host.

I resolved this issue adding a "retry" on the task "Start RKE2 service on the rest of the nodes" in the file "remaining_nodes.yml".
Here's the commit on my forked project: GabriFedi97@70abe0d

There is an issue on rke2 that explains the problem: rancher/rke2#349

bug: Molecule test is failing

Summary

CI job with molecule test is failing.

Originally the pipeline started to fail when Molecule 5.0.0 was available in pip.
One thing is the docker plugin (needs to be installed as molecule-plugins package instead of molecule[docker]).
But also after this change the pipeline was failing with different errors. Needs to be checked and fixed...

Some hints: ansible/molecule#3883

Issue Type

Bug Report

Ansible Version

-

Steps to Reproduce

Run github action CI job with molecule test
Example: https://github.com/lablabs/ansible-role-rke2/actions/runs/4847688181

Expected Results

Molecule test not failing with python error

Actual Results

https://github.com/lablabs/ansible-role-rke2/actions/runs/4847688181

bug: restore etcd from snapshot not working

Summary

When trying to restore etcd snapshot, the block "Restore etcd" in the first_server.yml file is skipped due to the following condition : 'and ( "rke2-server.service" is not in ansible_facts.services )'.

Trying the without this second condition and it's working, only with 'when: rke2_etcd_snapshot_file'.

Issue Type

Bug Report

Ansible Version

ansible [core 2.13.3]
  config file = /etc/ansible/ansible.cfg
  configured module search path = ['/home/exploit/.ansible/plugins/modules', '/usr/share/ansible/plugins/modules']
  ansible python module location = /home/exploit/.local/lib/python3.8/site-packages/ansible
  ansible collection location = /home/exploit/.ansible/collections:/usr/share/ansible/collections
  executable location = /home/exploit/.local/bin/ansible
  python version = 3.8.10 (default, Jun 22 2022, 20:18:18) [GCC 9.4.0]
  jinja version = 3.1.2
  libyaml = True

Steps to Reproduce

Expected Results

the snapshot should be restore

Actual Results

the snapshot is not restore

feature: kube vip multiple range/CIDR

Summary

Hello,

Currently rke2_loadbalancer_ip_range set range-global and we can't add specific subnet to kube-vip.

rke2_loadbalancer_ip_range should be a dict like this :

rke2_loadbalancer_ip_range:
  range-global: 192.168.1.50-192.168.1.100
  range-namspeace: 192.168.2.50-192.168.2.100

If you agree with the idea I can open a PR

Issue Type

Feature Idea

feature: Enable dual stack network

Summary

Hello,
How can I enable dual stack network when init k8s cluster (with input IP/IPv6 addresses from main.yaml)

Thank you.

Issue Type

Feature Idea

bug: disable_kube_proxy defaults to true

Summary

As of the latest release, disable_kube_proxy defaults to true, thus when creating an RKE2 cluster without specifying a CNI (default is Canal) you get a broken cluster.

I assume we'd want disable_kube_proxy to default to false.

Issue Type

Bug Report

Ansible Version

-

Steps to Reproduce

Leaving disable_kube_proxy and rke2_cni to their defaults.
rke2_cni: canal
disable_kube_proxy: true

Expected Results

Working cluster.

Actual Results

Broken cluster.

bug: Always receive condition check error on `Create the RKE2 etcd snapshot dir`

Summary

When try to install rke2 cluster, receive error

i'm installing single server

[masters]

team-edge1-k8s
fatal: [team-edge1-k8s]: FAILED! => {"msg": "The conditional check 'rke2_etcd_snapshot_file and ( \"rke2-server.service\" is not in ansible_facts.services )' failed. The error was: template error while templating string: expected token ')', got '.'. String: {% if rke2_etcd_snapshot_file and ( \"rke2-server.service\" is not in ansible_facts.services ) %} True {% else %} False {% endif %}\n\nThe error appears to be in '/root/.ansible/roles/lablabs.rke2/tasks/first_server.yml': line 40, column 7, but may\nbe elsewhere in the file depending on the exact syntax problem.\n\nThe offending line appears to be:\n\n  block:\n    - name: Create the RKE2 etcd snapshot dir\n      ^ here\n"}

Issue Type

Bug Report

Ansible Version

ansible [core 2.12.10]
  config file = /etc/ansible/ansible.cfg
  configured module search path = ['/root/.ansible/plugins/modules', '/usr/share/ansible/plugins/modules']
  ansible python module location = /usr/lib/python3/dist-packages/ansible
  ansible collection location = /root/.ansible/collections:/usr/share/ansible/collections
  executable location = /usr/bin/ansible
  python version = 3.8.10 (default, Jun 22 2022, 20:18:18) [GCC 9.4.0]
  jinja version = 2.10.1
  libyaml = True

Steps to Reproduce

- name: Deploy RKE2
  hosts: all
  become: yes
  vars:
    rke2_download_kubeconf: true
    rke2_interface: ens160
    rke2_version: v1.24.7+rke2r1
    rke2_disable: rke2-ingress-nginx
    rke2_airgap_mode: true
    rke2_airgap_implementation: copy
    rke2_artifact:
      - sha256sum-{{ rke2_architecture }}.txt
      - rke2.linux-{{ rke2_architecture }}.tar.gz
      - rke2-images.linux-{{ rke2_architecture }}.tar.zst
    rke2_custom_registry_mirrors:
      - name: docker.io
        endpoint:
         - 'https://harbor.intent.ai'
        rewrite: '"^rancher/(.*)": "harbor.int.ai/rancher/$1"'
  roles:
     - role: ansible-role-rke2

Expected Results

I expect clean installation but receive conditional check error

Actual Results

-edge1-k8s             : ok=14   changed=4    unreachable=0    failed=1    skipped=14   rescued=0    ignored=0

feature: Always look for active.

Summary

Imagine your invetory comes from ex netbox.
that means you dont control order of ex servers coming.
so if you add a new server and run your ansible too add it. it can be that the 2 existing servers will be mentioned after the first.
So better always look for active. and if no active is found use first server..

Issue Type

Feature Idea

feature: support CIS security hardening guides

Summary

Hey, I've tried out this role and am already a fan of it over Rancher's role as it is published on ansible-galaxy, which doesn't seem to be on the roadmap for that repo.

One thing I'd love is support for running CIS hardening as part of RKE2 Security Hardening guide. It's included in the rancherfederal repo here.

I can add the following as a var:

rke2_server_options:
  - "profile: cis-1.6"

But I get the following error following the logs:

Jun 28 04:12:21 [HOSTNAME] sh[90630]: Failed to get unit file state for nm-cloud-setup.service: No such file or directory
Jun 28 04:12:22 [HOSTNAME]  rke2[90634]: time="2022-06-28T04:12:22Z" level=fatal msg="missing required: user: unknown user etcd\nmissing required: group: unknown group etcd\ninvalid kernel parameter value vm.overcommit_memory=0 - expected 1\ninvalid kernel parameter value kernel.panic=0 - expected 10\ninvalid kernel parameter value kernel.panic_on_oops=0 - expected 1\n"
Jun 28 04:12:22 [HOSTNAME]  systemd[1]: rke2-server.service: Main process exited, code=exited, status=1/FAILURE
Jun 28 04:12:22 [HOSTNAME]  systemd[1]: rke2-server.service: Failed with result 'exit-code'.
Jun 28 04:12:22 [HOSTNAME]  systemd[1]: Failed to start Rancher Kubernetes Engine v2 (server).

which is because it's failing in the CIS checks, as noted from the docs:

Checks that host-level requirements have been met. If they haven't, RKE2 will exit with a fatal error describing the unmet requirements.

Side note: what is the background of this role? Was it created in parallel/independently of the rancherfederal repo?

Issue Type

Feature Idea

sudo: a password is required

I'm using the user with paswordless sudo,

TASK [lablabs.rke2 : Replace loopback IP by master server IP] **********************************************************************************************************
fatal: [rke-test-3-dc3-mgmt -> localhost]: FAILED! => {"changed": false, "module_stderr": "sudo: a password is required\n", "module_stdout": "", "msg": "MODULE FAILURE\nSee stdout/stderr for the exact error", "rc": 1}
fatal: [rke-test-1-dc1-mgmt -> localhost]: FAILED! => {"changed": false, "module_stderr": "sudo: a password is required\n", "module_stdout": "", "msg": "MODULE FAILURE\nSee stdout/stderr for the exact error", "rc": 1}

bug: rke2 config.yaml for server taint

Summary

Setting server taint leads to a broken config.yaml. At least for me the task

- name: Set server taints
  ansible.builtin.set_fact:
    combined_node_taints: "{{ node_taints}} + [ 'CriticalAddonsOnly=true:NoExecute' ] "
  when: rke2_server_taint and rke2_type == 'server'

leads to a broken list in the config.yaml, as the combined_node_taints variable ist treated as string instead of list.
The template line in question:

{% for taint in combined_node_taints %}
  - {{ taint }}

The correct syntax should be:

- name: Set server taints
  ansible.builtin.set_fact:
    combined_node_taints: "{{ node_taints + [ 'CriticalAddonsOnly=true:NoExecute' ] }}"
  when: rke2_server_taint and rke2_type == 'server'

Both in the task tasks/first_server.yml and vim tasks/remaining_nodes.yml

I do not understand why this seems to be only a problem for me?

Issue Type

Bug Report

Ansible Version

ansible [core 2.13.1]
  config file = None
  configured module search path = ['/home/user/.ansible/plugins/modules', '/usr/share/ansible/plugins/modules']
  ansible python module location = /home/user/.local/lib/python3.9/site-packages/ansible
  ansible collection location = /home/user/.ansible/collections:/usr/share/ansible/collections
  executable location = /home/user/.local/bin/ansible
  python version = 3.9.5 (default, Jun  4 2021, 12:28:51) [GCC 7.5.0]
  jinja version = 3.0.3
  libyaml = True

Steps to Reproduce

Run the playbook with server taint enabled

Expected Results

in config.yaml:

node-taint:
  - CriticalAddonsOnly=true:NoExecute

Actual Results

in config.yaml:

node-taint:
  - [
  - ]
  - 
  - [
  - C
  - r
...

Node labels

Hi,

Thank you for a great role, made my life easier when migrating from RKE1 to RKE2.
First looking into rancherd but then moving back to RKE2 I was happy when I found this role, so thank you for that ☺️

I have a question regarding node labels.
I found that it was possible to add node labels using the var k8s_node_label.

My question is regarding the documentation for the labels (or rather the k8s_node_label var).
I haven't seen any documentation for it. I found it when looking through the source files (to see if it was possible to add nodes).

The missing documentation makes me a bit worried that it's not ready for use?
Is there a reason for me not to use it? Or is it that it just hasn't been documented (or did I miss it somewhere)?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.