Giter VIP home page Giter VIP logo

terraform-hcloud-kube-hetzner's Introduction


Logo

Kube-Hetzner

A highly optimized, easy-to-use, auto-upgradable, HA-default & Load-Balanced, Kubernetes cluster powered by k3s-on-MicroOS and deployed for peanuts on Hetzner Cloud ๐Ÿค‘


๐Ÿ”ฅ Introducing KH Assistant, our Custom-GPT kube.tf generator to get you going fast, just tell it what you need! ๐Ÿš€


About The Project

Hetzner Cloud is a good cloud provider that offers very affordable prices for cloud instances, with data center locations in both Europe and the US.

This project aims to create a highly optimized Kubernetes installation that is easy to maintain, secure, and automatically upgrades both the nodes and Kubernetes. We aimed for functionality as close as possible to GKE's Auto-Pilot. Please note that we are not affiliates of Hetzner, but we do strive to be an optimal solution for deploying and maintaining Kubernetes clusters on Hetzner Cloud.

To achieve this, we built up on the shoulders of giants by choosing openSUSE MicroOS as the base operating system and k3s as the k8s engine.

Product Name Screen Shot

Why OpenSUSE MicroOS (and not Ubuntu)?

  • Optimized container OS that is fully locked down, most of the filesystem is read-only!
  • Hardened by default with an automatic ban for abusive IPs on SSH for instance.
  • Evergreen release, your node will stay valid forever, as it piggybacks into OpenSUSE Tumbleweed's rolling release!
  • Automatic updates by default and automatic rollbacks if something breaks, thanks to its use of BTRFS snapshots.
  • Supports Kured to properly drain and reboot nodes in an HA fashion.

Why k3s?

  • Certified Kubernetes Distribution, it is automatically synced to k8s source.
  • Fast deployment, as it is a single binary and can be deployed with a single command.
  • Comes with batteries included, with its in-cluster helm-controller.
  • Easy automatic updates, via the system-upgrade-controller.

Features

  • Maintenance-free with auto-upgrades to the latest version of MicroOS and k3s.
  • Multi-architecture support, choose any Hetzner cloud instances, including the cheaper CAX ARM instances.
  • Proper use of the Hetzner private network to minimize latency.
  • Choose between Flannel, Calico, or Cilium as CNI.
  • Optional Wireguard encryption of the Kube network for added security.
  • Traefik or Nginx as ingress controller attached to a Hetzner load balancer with Proxy Protocol turned on.
  • Automatic HA with the default setting of three control-plane nodes and two agent nodes.
  • Autoscaling nodes via the kubernetes autoscaler.
  • Super-HA with Nodepools for both control-plane and agent nodes that can be in different locations.
  • Possibility to have a single node cluster with a proper ingress controller.
  • Can use Klipper as an on-metal LB or the Hetzner LB.
  • Ability to add nodes and nodepools when the cluster is running.
  • Possibility to toggle Longhorn and Hetzner CSI.
  • Encryption at rest fully functional in both Longhorn and Hetzner CSI.
  • Optional use of Floating IPs for use via Cilium's Egress Gateway.
  • Proper IPv6 support for inbound/outbound traffic.
  • Flexible configuration options via variables and an extra Kustomization option.

It uses Terraform to deploy as it's easy to use, and Hetzner has a great Hetzner Terraform Provider.

Getting Started

Follow those simple steps, and your world's cheapest Kubernetes cluster will be up and running.

โœ”๏ธ Prerequisites

First and foremost, you need to have a Hetzner Cloud account. You can sign up for free here.

Then you'll need to have terraform or tofu, packer (for the initial snapshot creation only, no longer needed once that's done), kubectl cli and hcloud the Hetzner cli for convenience. The easiest way is to use the homebrew package manager to install them (available on Linux, Mac, and Windows Linux Subsystem).

brew tap hashicorp/tap
brew install hashicorp/tap/terraform # OR brew install opentofu
brew install packer
brew install kubectl
brew install hcloud

๐Ÿ’ก [Do not skip] Creating your kube.tf file and the OpenSUSE MicroOS snapshot

  1. Create a project in your Hetzner Cloud Console, and go to Security > API Tokens of that project to grab the API key, it needs to be Read & Write. Take note of the key! โœ…

  2. Generate a passphrase-less ed25519 SSH key pair for your cluster; take note of the respective paths of your private and public keys. Or, see our detailed SSH options. โœ…

  3. Now navigate to where you want to have your project live and execute the following command, which will help you get started with a new folder along with the required files, and will propose you to create a needed MicroOS snapshot. โœ…

    tmp_script=$(mktemp) && curl -sSL -o "${tmp_script}" https://raw.githubusercontent.com/kube-hetzner/terraform-hcloud-kube-hetzner/master/scripts/create.sh && chmod +x "${tmp_script}" && "${tmp_script}" && rm "${tmp_script}"

    Or for fish shell:

    set tmp_script (mktemp); curl -sSL -o "{tmp_script}" https://raw.githubusercontent.com/kube-hetzner/terraform-hcloud-kube-hetzner/master/scripts/create.sh; chmod +x "{tmp_script}"; bash "{tmp_script}"; rm "{tmp_script}"

    Optionally, for future usage, save that command as an alias in your shell preferences, like so:

    alias createkh='tmp_script=$(mktemp) && curl -sSL -o "${tmp_script}" https://raw.githubusercontent.com/kube-hetzner/terraform-hcloud-kube-hetzner/master/scripts/create.sh && chmod +x "${tmp_script}" && "${tmp_script}" && rm "${tmp_script}"'

    Or for fish shell:

    alias createkh='set tmp_script (mktemp); curl -sSL -o "{tmp_script}" https://raw.githubusercontent.com/kube-hetzner/terraform-hcloud-kube-hetzner/master/scripts/create.sh; chmod +x "{tmp_script}"; bash "{tmp_script}"; rm "{tmp_script}"'

    For the curious, here is what the script does:

    mkdir /path/to/your/new/folder
    cd /path/to/your/new/folder
    curl -sL https://raw.githubusercontent.com/kube-hetzner/terraform-hcloud-kube-hetzner/master/kube.tf.example -o kube.tf
    curl -sL https://raw.githubusercontent.com/kube-hetzner/terraform-hcloud-kube-hetzner/master/packer-template/hcloud-microos-snapshots.pkr.hcl -o hcloud-microos-snapshots.pkr.hcl
    export HCLOUD_TOKEN="your_hcloud_token"
    packer init hcloud-microos-snapshots.pkr.hcl
    packer build hcloud-microos-snapshots.pkr.hcl
    hcloud context create <project-name>
  4. In that new project folder that gets created, you will find your kube.tf and it must be customized to suit your needs. โœ…

    A complete reference of all inputs, outputs, modules etc. can be found in the terraform.md file.

๐ŸŽฏ Installation

Now that you have your kube.tf file, along with the OS snapshot in Hetzner project, you can start the installation process:

cd <your-project-folder>
terraform init --upgrade
terraform validate
terraform apply -auto-approve

It will take around 5 minutes to complete, and then you should see a green output confirming a successful deployment.

Once you start with Terraform, it's best not to change the state of the project manually via the Hetzner UI; otherwise, you may get an error when you try to run terraform again for that cluster (when trying to change the number of nodes for instance). If you want to inspect your Hetzner project, learn to use the hcloud cli.

Usage

When your brand-new cluster is up and running, the sky is your limit! ๐ŸŽ‰

You can view all kinds of details about the cluster by running terraform output kubeconfig or terraform output -json kubeconfig | jq.

To manage your cluster with kubectl, you can either use SSH to connect to a control plane node or connect to the Kube API directly.

Connect via SSH

You can connect to one of the control plane nodes via SSH with ssh root@<control-plane-ip> -i /path/to/private_key -o StrictHostKeyChecking=no. Now you are able to use kubectl to manage your workloads right away. By default, the firewall allows SSH connections from everywhere. Best to change that to your own IP by configuring the firewall_ssh_source in your kube.tf file (don't worry, you can always change it for deploy if your IP changes).

Connect via Kube API

If you have access to the Kube API (depending on the value of your firewall_kube_api_source variable, best to have the value of your own IP and not open to the world), you can immediately kubectl into it (using the clustername_kubeconfig.yaml saved to the project's directory after the installation). By doing kubectl --kubeconfig clustername_kubeconfig.yaml, but for more convenience, either create a symlink from ~/.kube/config to clustername_kubeconfig.yaml or add an export statement to your ~/.bashrc or ~/.zshrc file, as follows (you can get the path of clustername_kubeconfig.yaml by running pwd):

export KUBECONFIG=/<path-to>/clustername_kubeconfig.yaml

If chose to turn create_kubeconfig to false in your kube.tf (good practice), you can still create this file by running terraform output --raw kubeconfig > clustername_kubeconfig.yaml and then use it as described above.

You can also use it in an automated flow, in which case create_kubeconfig should be set to false, and you can use the kubeconfig output variable to get the kubeconfig file in a structured data format.

CNI

The default is Flannel, but you can also choose Calico or Cilium, by setting the cni_plugin variable in kube.tf to "calico" or "cilium".

Cilium

As Cilium has a lot of interesting and powerful config possibilities, we give you the ability to configure Cilium with the helm cilium_values variable (see the cilium specific helm values) before you deploy your cluster.

Cilium supports full kube-proxy replacement. Cilium runs by default in hybrid kube-proxy replacement mode. To achieve a completely kube-proxy-free cluster, set disable_kube_proxy = true.

It is also possible to enable Hubble using cilium_hubble_enabled = true. In order to access the Hubble UI, you need to port-forward the Hubble UI service to your local machine. By default, you can do this by running kubectl port-forward -n kube-system service/hubble-ui 12000:80 and then opening http://localhost:12000 in your browser. However, it is recommended to use the Cilium CLI and Hubble Client and running the cilium hubble ui command.

Scaling Nodes

Two things can be scaled: the number of nodepools or the number of nodes in these nodepools.

There are some limitations (to scaling down mainly) that you need to be aware of:

Once the cluster is up; you can change any nodepool count and even set it to 0 (in the case of the first control-plane nodepool, the minimum is 1); you can also rename a nodepool (if the count is to 0), but should not remove a nodepool from the list after once the cluster is up. That is due to how subnets and IPs get allocated. The only nodepools you can remove are those at the end of each list of nodepools.

However, you can freely add other nodepools at the end of each list. And for each nodepools, you can freely increase or decrease the node count (if you want to decrease a nodepool node count make sure you drain the nodes in question before, you can use terraform show to identify the node names at the end of the nodepool list, otherwise, if you do not drain the nodes before removing them, it could leave your cluster in a bad state). The only nodepool that needs to have always at least a count of 1 is the first control-plane nodepool.

An advanced usecase is to replace the count of a nodepool by a map with each key representing a single node. In this case, you can add and remove individual nodes from a pool by adding and removing their entries in this map, and it allows you to set individual labels and other parameters on each node in the pool. See kube.tf.example for an example.

Autoscaling Node Pools

We support autoscaling node pools powered by the Kubernetes Cluster Autoscaler.

By adding at least one map to the array of autoscaler_nodepools the feature will be enabled. More on this in the corresponding section of kube.tf.example.

Important to know, the nodes are booted based on a snapshot that is created from the initial control_plane. So please ensure that the disk of your chosen server type is at least the same size (or bigger) as the one of the first control_plane.

High Availability

By default, we have three control planes and three agents configured, with automatic upgrades and reboots of the nodes.

If you want to remain HA (no downtime), it's essential to keep a count of control planes nodes of at least three (two minimum to maintain quorum when one goes down for automated upgrades and reboot), see Rancher's doc on HA.

Otherwise, it is essential to turn off automatic OS upgrades (k3s can continue to update without issue) for the control-plane nodes (when two or fewer control-plane nodes) and do the maintenance yourself.

Automatic Upgrade

The Default Setting

By default, MicroOS gets upgraded automatically on each node and reboot safely via Kured installed in the cluster.

As for k3s, it also automatically upgrades thanks to Rancher's system upgrade controller. By default, it will be set to the initial_k3s_channel, but you can also set it to stable, latest, or one more specific like v1.23 if needed or specify a target version to upgrade to via the upgrade plan (this also allows for downgrades).

You can copy and modify the one in the templates for that! More on the subject in k3s upgrades.

Configuring update timeframes

Per default, a node that installed updates will reboot within the next few minutes and updates are installed roughly every 24 hours. Kured can be instructed with specific timeframes for rebooting, to prevent too frequent drains and reboots. All options from the docs are available for modification.

โš ๏ธ Kured is also used to reboot nodes after configuration updates (registries.yaml, ...), so keep in mind that configuration changes can take some time to propagate!

Turning Off Automatic Upgrades

If you wish to turn off automatic MicroOS upgrades (Important if you are not launching an HA setup that requires at least 3 control-plane nodes), you need to set:

automatically_upgrade_os = false

Alternatively ssh into each node and issue the following command:

systemctl --now disable transactional-update.timer

If you wish to turn off automatic k3s upgrades, you need to set:

automatically_upgrade_k3s = false

Once disabled this way you selectively can enable the upgrade by setting the node label k3s_update=true and later disable it by removing the label or set it to false again.

# Enable upgrade for a node (use --all for all nodes)
kubectl label --overwrite node <node-name> k3s_upgrade=true

# Later disable upgrade by removing the label (use --all for all nodes)
kubectl label node <node-name> k3s_upgrade-

Alternatively, you can disable the k3s automatic upgrade without individually editing the labels on the nodes. Instead, you can just delete the two system controller upgrade plans with:

kubectl delete plan k3s-agent -n system-upgrade
kubectl delete plan k3s-server -n system-upgrade

Also, note that after turning off node upgrades, you will need to manually upgrade the nodes when needed. You can do so by SSH'ing into each node and running the following commands (and don't forget to drain the node before with kubectl drain <node-name>):

systemctl start transactional-update.service
reboot

Individual Components Upgrade

Rarely needed, but can be handy in the long run. During the installation, we automatically download a backup of the kustomization to a kustomization_backup.yaml file. You will find it next to your clustername_kubeconfig.yaml at the root of your project.

  1. First create a duplicate of that file and name it kustomization.yaml, keeping the original file intact, in case you need to restore the old config.
  2. Edit the kustomization.yaml file; you want to go to the very bottom where you have the links to the different source files; grab the latest versions for each on GitHub, and replace. If present, remove any local reference to traefik_config.yaml, as Traefik is updated automatically by the system upgrade controller.
  3. Apply the updated kustomization.yaml with kubectl apply -k ./.

Customizing the Cluster Components

Most cluster components of Kube-Hetzner are deployed with the Rancher Helm Chart yaml definition and managed by the Helm Controller inside k3s.

By default, we strive to give you optimal defaults, but if you wish, you can customize them.

For Traefik, Nginx, Rancher, Cilium, Traefik, and Longhorn, for maximum flexibility, we give you the ability to configure them even better via helm values variables (e.g. cilium_values, see the advanced section in the kube.tf.example for more).

Adding Extras

If you need to install additional Helm charts or Kubernetes manifests that are not provided by default, you can easily do so by using Kustomize. This is done by creating one or more extra-manifests/kustomization.yaml.tpl files beside your kube.tf.

These files need to be valid Kustomization manifests, additionally supporting terraform templating! (The templating parameters can be passed via the extra_kustomize_parameters variable (via a map) to the module).

All files in the extra-manifests directory including the rendered versions of the *.yaml.tpl will be applied to k3s with kubectl apply -k (which will be executed after and independently of the basic cluster configuration).

See a working example in examples/kustomization_user_deploy.

You can use the above to pass all kinds of Kubernetes YAML configs, including HelmChart and/or HelmChartConfig definitions (see the previous section if you do not know what those are in the context of k3s).

That said, you can also use pure Terraform and import the kube-hetzner module as part of a larger project, and then use things like the Terraform helm provider to add additional stuff, all up to you!

Examples

Custom post-install actions

After the initial bootstrapping of your Kubernetes cluster, you might want to deploy applications using the same terraform mechanism. For many scenarios it is sufficient to create a kustomization.yaml.tpl file (see Adding Extras). All applied kustomizations will be applied at once by executing a single kubectl apply -k command.

However, some applications that e.g. provide custom CRDs (e.g. ArgoCD) need a different deployment strategy: one has to deploy CRDs first, then wait for the deployment, before being able to install the actual application. In the ArgoCD case, not waiting for the CRD setup to finish will cause failures. Therefore, an additional mechanism is available to support these kind of deployments. Specify extra_kustomize_deployment_commands in your kube.tf file containing a series of commands to be executed, after the Kustomization step finished:

  extra_kustomize_deployment_commands = <<-EOT
    kubectl -n argocd wait --for condition=established --timeout=120s crd/appprojects.argoproj.io
    kubectl -n argocd wait --for condition=established --timeout=120s crd/applications.argoproj.io
    kubectl apply -f /var/user_kustomize/argocd-projects.yaml
    kubectl apply -f /var/user_kustomize/argocd-application-argocd.yaml
    ...
  EOT
Useful Cilium commands

With Kube-Hetzner, you have the possibility to use Cilium as a CNI. It's very powerful and has great observability features. Below you will find a few useful commands.

  • Check the status of cilium with the following commands (get the cilium pod name first and replace it in the command):
kubectl -n kube-system exec --stdin --tty cilium-xxxx -- cilium status
kubectl -n kube-system exec --stdin --tty cilium-xxxx -- cilium status --verbose
  • Monitor cluster traffic with:
kubectl -n kube-system exec --stdin --tty cilium-xxxx -- cilium monitor
  • See the list of kube services with:
kubectl -n kube-system exec --stdin --tty cilium-xxxx -- cilium service list

For more cilium commands, please refer to their corresponding Documentation.

Cilium Egress Gateway (via Floating IPs)

Cilium Egress Gateway provides the ability to control outgoing traffic from POD.

Using Floating IPs makes it possible to get rid of the problem of changing the primary IPs when recreating a node in the cluster.

To implement the Cilium Egress Gateway feature, you need to define a separate nodepool with the setting floating_ip = true in the nodepool configuration parameter block.

Example nodepool configuration:

{
  name        = "egress",
  server_type = "cpx11",
  location    = "fsn1",
  labels = [
    "node.kubernetes.io/role=egress"
  ],
  taints = [
    "node.kubernetes.io/role=egress:NoSchedule"
  ],
  floating_ip = true
  count = 1
},

Configure Cilium:

locals {
  cluster_ipv4_cidr = "10.42.0.0/16"
}

cluster_ipv4_cidr = local.cluster_ipv4_cidr

cilium_values = <<EOT
ipam:
  mode: kubernetes
k8s:
  requireIPv4PodCIDR: true
kubeProxyReplacement: true
routingMode: native
ipv4NativeRoutingCIDR: "10.0.0.0/8"
endpointRoutes:
  enabled: true
loadBalancer:
  acceleration: native
bpf:
  masquerade: true
egressGateway:
  enabled: true
MTU: 1450
EOT

Deploy the K8S cluster infrastructure.

See the Cilium documentation for further steps (policy writing and testing): Writing egress gateway policies

There are 3 different ways to define egress policies related to the gateway node. You can specify the interface, the egress IP (Floating IP) or nothing, which pics the first IPv4 address of the the interface of the default route.

CiliumEgressGatewayPolicy example:

apiVersion: cilium.io/v2
kind: CiliumEgressGatewayPolicy
metadata:
  name: egress-sample
spec:
  selectors:
    - podSelector:
        matchLabels:
          org: empire
          class: mediabot
          io.kubernetes.pod.namespace: default

  destinationCIDRs:
    - "0.0.0.0/0"
  excludedCIDRs:
    - "10.0.0.0/8"

  egressGateway:
    nodeSelector:
      matchLabels:
        node.kubernetes.io/role: egress

    # Specify the IP address used to SNAT traffic matched by the policy.
    # It must exist as an IP associated with a network interface on the instance.
    egressIP: { FLOATING_IP }
Ingress with TLS

We advise you to use Cert-Manager, as it supports HA setups without requiring you to use the enterprise version of Traefik. The reason for that is that according to Traefik themselves, Traefik CE (community edition) is stateless, and it's not possible to run multiple instances of Traefik CE with LetsEncrypt enabled. Meaning, you cannot have your ingress be HA with Traefik if you use the community edition and have activated the LetsEncrypt resolver. You could however use Traefik EE (enterprise edition) to achieve that. Long story short, if you are going to use Traefik CE (like most of us), you should use Cert-Manager to generate the certificates. Source here.

Via Cert-Manager (recommended)

Create your issuers as described here https://cert-manager.io/docs/configuration/acme/.

Then in your Ingress definition, just mentioning the issuer as an annotation and giving a secret name will take care of instructing Cert-Manager to generate a certificate for it! You just have to configure your issuer(s) first with the method of your choice. Detailed instructions on how to configure Cert-manager with Traefik can be found at https://traefik.io/blog/secure-web-applications-with-traefik-proxy-cert-manager-and-lets-encrypt/.

Ingress example:

apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: my-ingress
  annotations:
    cert-manager.io/cluster-issuer: letsencrypt
spec:
  tls:
    - hosts:
        - "*.example.com"
      secretName: example-com-letsencrypt-tls
  rules:
    - host: "*.example.com"
      http:
        paths:
          - path: /
            pathType: Prefix
            backend:
              service:
                name: my-service
                port:
                  number: 80

โš ๏ธ In case of using Ingress-Nginx as an ingress controller if you choose to use the HTTP challenge method you need to do an additional step of adding variable lb_hostname = "cluster.example.org" to your kube.tf. You must set it to an FQDN that points to your LB address.

This is to circumvent this known issue cert-manager/cert-manager/issues/466. Otherwise, you can just use the DNS challenge, which does not require any additional tweaks to work.

Create or delete a snapshot

Apart from the installation script, you can always create or delete the OS snapshot manually.

To create a snapshot, run the following command:

export HCLOUD_TOKEN=<your-token>
packer build ./packer-template/hcloud-microos-snapshots.pkr.hcl

To delete a snapshot, first find it with:

hcloud image list

Then delete it with:

hcloud image delete <image-id>
Single-node cluster

Running a development cluster on a single node without any high availability is also possible.

When doing so, automatically_upgrade_os should be set to false, especially with attached volumes the automatic reboots won't work properly. In this case, we don't deploy an external load-balancer but use the default k3s service load balancer on the host itself and open up port 80 & 443 in the firewall (done automatically).

Use in Terraform cloud

You can use Kube-Hetzner on Terraform cloud just as you would from a local deployment:

  1. Make sure you have the OS snapshot already created in your project (follow the installation script to achieve this).

  2. Use the content of your public and private key to configure ssh_public_key and ssh_private_key. Make sure the private key is not password protected. Since your private key is sensitive, it is recommended to add them as variables (make sure to mark the private key as a sensitive variable in Terraform Cloud!) and assign it in your kube.tf:

    ssh_public_key = var.ssh_public_key
    ssh_private_key = var.ssh_private_key

    Note: If you want to use a password protected private key, you will have to point ssh_private_key to a file containing this key. You must host this file in an environment that you control and a ssh-agent to decipher it for you. Hence, on Terraform Cloud, change the execution mode to local and run your own Terraform agent in this environment.

Configure add-ons with HelmChartConfig

For instance, to customize the Rancher install, if you choose to enable it, you can create and apply the following HelmChartConfig:

apiVersion: helm.cattle.io/v1
kind: HelmChartConfig
metadata:
  name: rancher
  namespace: kube-system
spec:
  valuesContent: |-
    **values.yaml content you want to customize**

The helm options for Rancher can be seen here https://github.com/rancher/rancher/blob/release/v2.6/chart/values.yaml.

The same goes for all add-ons, like Longhorn, Cert-manager, and Traefik.

Encryption at rest with HCloud CSI

The easiest way to get encrypted volumes working is actually to use the new encryption functionality of hcloud csi itself, see hetznercloud/csi-driver.

For this, you just need to create a secret containing the encryption key:

apiVersion: v1
kind: Secret
metadata:
  name: encryption-secret
  namespace: kube-system
stringData:
  encryption-passphrase: foobar

And to create a new storage class:

apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: hcloud-volumes-encrypted
provisioner: csi.hetzner.cloud
reclaimPolicy: Delete
volumeBindingMode: WaitForFirstConsumer
allowVolumeExpansion: true
parameters:
  csi.storage.k8s.io/node-publish-secret-name: encryption-secret
  csi.storage.k8s.io/node-publish-secret-namespace: kube-system
Encryption at rest with Longhorn To get started, use a cluster-wide key for all volumes like this:
apiVersion: v1
kind: Secret
metadata:
  name: longhorn-crypto
  namespace: longhorn-system
stringData:
  CRYPTO_KEY_VALUE: "I have nothing to hide."
  CRYPTO_KEY_PROVIDER: "secret"
  CRYPTO_KEY_CIPHER: "aes-xts-plain64"
  CRYPTO_KEY_HASH: "sha256"
  CRYPTO_KEY_SIZE: "256"
  CRYPTO_PBKDF: "argon2i"

And create a new storage class:

kind: StorageClass
apiVersion: storage.k8s.io/v1
metadata:
  name: longhorn-crypto-global
provisioner: driver.longhorn.io
allowVolumeExpansion: true
parameters:
  nodeSelector: "node-storage"
  numberOfReplicas: "1"
  staleReplicaTimeout: "2880" # 48 hours in minutes
  fromBackup: ""
  fsType: ext4
  encrypted: "true"
  # global secret that contains the encryption key that will be used for all volumes
  csi.storage.k8s.io/provisioner-secret-name: "longhorn-crypto"
  csi.storage.k8s.io/provisioner-secret-namespace: "longhorn-system"
  csi.storage.k8s.io/node-publish-secret-name: "longhorn-crypto"
  csi.storage.k8s.io/node-publish-secret-namespace: "longhorn-system"
  csi.storage.k8s.io/node-stage-secret-name: "longhorn-crypto"
  csi.storage.k8s.io/node-stage-secret-namespace: "longhorn-system"

For more details, see Longhorn's documentation.

Assign all pods in a namespace to either arm64 or amd64 nodes with admission controllers

To enable the PodNodeSelector and optionally the PodTolerationRestriction api modules, set the following value:

k3s_exec_server_args = "--kube-apiserver-arg enable-admission-plugins=PodTolerationRestriction,PodNodeSelector"

Next, you can set default nodeSelector values per namespace. This lets you assign namespaces to specific nodes. Note though, that this is the default as well as the whitelist, so if a pod sets its own nodeSelector value that must be a subset of the default. Otherwise, the pod will not be scheduled.

Then set the according annotations on your namespaces:

apiVersion: v1
kind: Namespace
metadata:
  annotations:
    scheduler.alpha.kubernetes.io/node-selector: kubernetes.io/arch=amd64
  name: this-runs-on-amd64

or with taints and tolerations:

apiVersion: v1
kind: Namespace
metadata:
  annotations:
    scheduler.alpha.kubernetes.io/node-selector: kubernetes.io/arch=arm64
    scheduler.alpha.kubernetes.io/defaultTolerations: '[{ "operator" : "Equal", "effect" : "NoSchedule", "key" : "workload-type", "value" : "machine-learning" }]'
  name: this-runs-on-arm64

This can be helpful when you set up a mixed-architecture cluster, and there are many other use cases.

Backup and restore a cluster

K3s allows for automated etcd backups to S3. Etcd is the default storage backend on kube-hetzner, even for a single control plane cluster, hence this should work for all cluster deployments.

For backup do:

  1. Fill the kube.tf config etcd_s3_backup, it will trigger a regular automated backup to S3.
  2. Add the k3s_token as an output to your kube.tf
output "k3s_token" {
  value     = module.kube-hetzner.k3s_token
  sensitive = true
}
  1. Make sure you can access the k3s_token via terraform output k3s_token.

For restoration do:

  1. Before cluster creation, add the following to your kube.tf. Replace the local variables to match your values.
locals {
  # ...

  k3s_token = var.k3s_token  # this is secret information, hence it is passed as an environment variable

  # to get the corresponding etcd_version for a k3s version you need to
  # - start k3s or have it running
  # - run `curl -L --cacert /var/lib/rancher/k3s/server/tls/etcd/server-ca.crt --cert /var/lib/rancher/k3s/server/tls/etcd/server-client.crt --key /var/lib/rancher/k3s/server/tls/etcd/server-client.key https://127.0.0.1:2379/version`
  # for details see https://gist.github.com/superseb/0c06164eef5a097c66e810fe91a9d408
  etcd_version = "v3.5.9"

  etcd_snapshot_name = "name-of-the-snapshot(no-path,just-the-name)"
  etcd_s3_endpoint = "your-s3-endpoint(without-https://)"
  etcd_s3_bucket = "your-s3-bucket"
  etcd_s3_access_key = "your-s3-access-key"
  etcd_s3_secret_key = var.etcd_s3_secret_key  # this is secret information, hence it is passed as an environment variable

  # ...
}

variable "k3s_token" {
  sensitive = true
  type      = string
}

variable "etcd_s3_secret_key" {
  sensitive = true
  type      = string
}

module "kube-hetzner" {
  # ...

  k3s_token = local.k3s_token

  # ...

  postinstall_exec = [
    (
      local.etcd_snapshot_name == "" ? "" :
      <<-EOF
      export CLUSTERINIT=$(cat /etc/rancher/k3s/config.yaml | grep -i '"cluster-init": true')
      if [ -n "$CLUSTERINIT" ]; then
        echo indeed this is the first control plane node > /tmp/restorenotes
        k3s server \
          --cluster-reset \
          --etcd-s3 \
          --cluster-reset-restore-path=${local.etcd_snapshot_name} \
          --etcd-s3-endpoint=${local.etcd_s3_endpoint} \
          --etcd-s3-bucket=${local.etcd_s3_bucket} \
          --etcd-s3-access-key=${local.etcd_s3_access_key} \
          --etcd-s3-secret-key=${local.etcd_s3_secret_key}
        # renaming the k3s.yaml because it is used as a trigger for further downstream
        # changes. Better to let `k3s server` create it as expected.
        mv /etc/rancher/k3s/k3s.yaml /etc/rancher/k3s/k3s.backup.yaml

        # download etcd/etcdctl for adapting the kubernetes config before starting k3s
        ETCD_VER=${local.etcd_version}
        case "$(uname -m)" in
            aarch64) ETCD_ARCH="arm64" ;;
            x86_64) ETCD_ARCH="amd64" ;;
        esac;
        DOWNLOAD_URL=https://github.com/etcd-io/etcd/releases/download
        rm -f /tmp/etcd-$ETCD_VER-linux-$ETCD_ARCH.tar.gz
        curl -L $DOWNLOAD_URL/$ETCD_VER/etcd-$ETCD_VER-linux-$ETCD_ARCH.tar.gz -o /tmp/etcd-$ETCD_VER-linux-$ETCD_ARCH.tar.gz
        tar xzvf /tmp/etcd-$ETCD_VER-linux-$ETCD_ARCH.tar.gz -C /usr/local/bin --strip-components=1
        rm -f /tmp/etcd-$ETCD_VER-linux-$ETCD_ARCH.tar.gz

        etcd --version
        etcdctl version

        # start etcd server in the background
        nohup etcd --data-dir /var/lib/rancher/k3s/server/db/etcd &
        echo $! > save_pid.txt

        # delete traefik service so that no load-balancer is accidently changed
        etcdctl del /registry/services/specs/traefik/traefik
        etcdctl del /registry/services/endpoints/traefik/traefik

        # delete old nodes (they interfere with load balancer)
        # minions is the old name for "nodes"
        OLD_NODES=$(etcdctl get "" --prefix --keys-only | grep /registry/minions/ | cut -c 19-)
        for NODE in $OLD_NODES; do
          for KEY in $(etcdctl get "" --prefix --keys-only | grep $NODE); do
            etcdctl del $KEY
          done
        done

        kill -9 `cat save_pid.txt`
        rm save_pid.txt
      else
        echo this is not the first control plane node > /tmp/restorenotes
      fi
      EOF
    )
  ]
  # ...
}
  1. Set the following sensible environment variables

    • export TF_VAR_k3s_token="..." (Be careful, this token is like an admin password to the entire cluster. You need to use the same k3s_token which you saved when creating the backup.)
    • export etcd_s3_secret_key="..."
  2. Create the cluster as usual. You can also change the cluster-name and deploy it next to the original backed up cluster.

Awesome! You restored a whole cluster from a backup.

Deploy in a pre-constructed private network (for proxies etc) If you want to deploy other machines on the private network before deploying the k3s cluster, you can. One use-case is if you want to setup a proxy or a NAT router on the private network, which is needed by the k3s cluster already at the time of construction.

It is important to get all the address ranges right in this case, although the number of changes needed is minimal. If your network is created with 10.0.0.0/8, and you use subnet 10.128.0.0/9 for your non-k3s business, then adapting network_ipv4_cidr = "10.0.0.0/9" should be all you need.

For example

resource "hcloud_network" "k3s_proxied" {
  name     = "k3s-proxied"
  ip_range = "10.0.0.0/8"
}

resource "hcloud_network_subnet" "k3s_proxy" {
  network_id   = hcloud_network.k3s_proxied.id
  type         = "cloud"
  network_zone = "eu-central"
  ip_range     = "10.128.0.0/9"
}
resource "hcloud_server" "your_proxy_server" {
  ...
}
resource "hcloud_server_network" "your_proxy_server" {
  depends_on = [
    hcloud_server.your_proxy_server
  ]
  server_id  = hcloud_server.your_proxy_server.id
  network_id = hcloud_network.k3s_proxied.id
  ip         = "10.128.0.1"
}
module "kube-hetzner" {
  ...
  existing_network_id = [hcloud_network.k3s_proxied.id]
  network_ipv4_cidr = "10.0.0.0/9"
  additional_k3s_environment = {
    "http_proxy" : "http://10.128.0.1:3128",
    "HTTP_PROXY" : "http://10.128.0.1:3128",
    "HTTPS_PROXY" : "http://10.128.0.1:3128",
    "CONTAINERD_HTTP_PROXY" : "http://10.128.0.1:3128",
    "CONTAINERD_HTTPS_PROXY" : "http://10.128.0.1:3128",
    "NO_PROXY" : "127.0.0.0/8,10.0.0.0/8,",
  }
}

NOTE: square brackets in existing_network_id! This must be a list of length 1.

Placement groups Up until release v2.11.8, there was an implementation error in the placement group logic.

If you have fewer than 10 agents and 10 control-plane nodes, you can continue using the code as is.

If you have a single pool with a count >= 10, you could only work with global setting in kube.tf:

placement_group_disable = true

Now you can assign each nodepool to its own placement group, preferrably using named groups:

  agent_nodepools = [
    {
      ...
      placement_group = "special"
    },
  ]

You can also continue using the previous code-base like this:

  agent_nodepools = [
    {
      ...
      placement_group_compat_idx = 1
    },
  ]

Finally, if you want to have a node-pool with more than 10 nodes, you have to use the map-based node definition and assign individual nodes to groups:

  agent_nodepools = [
    {
      ...
      nodes = {
        "0" : {
          placement_group = "pg-1",
        },
        ...
        "30" : {
          placement_group = "pg-2",
        },
      }
    },
  ]
Migratings from count-based nodepools to map-based

Migrating from count to map-based nodes is easy, but it is crucial that you set append_index_to_node_name to false, otherwise the nodes get replaced. The default for newly added nodes is true, so you can easily map between your nodes and your kube.tf file.

  agent_nodepools = [
    {
      name        = "agent-large",
      server_type = "cpx21",
      location    = "nbg1",
      labels      = [],
      taints      = [],
      # count       = 2
      nodes = {
        "0" : {
          append_index_to_node_name = false,
          labels = ["my.extra.label=special"],
          placement_group = "agent-large-pg-1",
        },
        "1" : {
          append_index_to_node_name = false,
          server_type = "cpx31",
          labels = ["my.extra.label=slightlybiggernode"]
          placement_group = "agent-large-pg-2",
        },
      }
    },
  ]
Use of delete protection

Use of delete protection feature in Hetzner Cloud on resources can be used to protect resources from deletion by putting a "lock" on them.

Please note, that this does not protect deletion from Terraform itself, as the Provider will lift the lock in that case. The resources will only be protected from deletion via the Hetzner Cloud Console or API.

There are following resources that support delete protection, which is set to false by default:

  • Floating IPs
  • Load Balancers
  • Volumes (used by Longhorn)

Example scenario where you want to ensure you keep a floating IP that is whitelisted in some firewall so you don't lose access to certain resources or have to wait for the new IP being whitelisted. This is how you can enable delete protection for floating IPs with terraform.tfvars:

enable_delete_protection = {
  floating_ip = true
}

Debugging

First and foremost, it depends, but it's always good to have a quick look into Hetzner quickly without logging in to the UI. That is where the hcloud cli comes in.

  • Activate it with hcloud context create Kube-hetzner; it will prompt for your Hetzner API token, paste that, and hit enter.
  • To check the nodes, if they are running, use hcloud server list.
  • To check the network, use hcloud network describe k3s.
  • To look at the LB, use hcloud loadbalancer describe k3s-traefik.

Then for the rest, you'll often need to log in to your cluster via ssh, to do that, use:

ssh root@<control-plane-ip> -i /path/to/private_key -o StrictHostKeyChecking=no

Then, for control-plane nodes, use journalctl -u k3s to see the k3s logs, and for agents, use journalctl -u k3s-agent instead.

Inspect the value of the k3s config.yaml file with: cat /etc/rancher/k3s/config.yaml, see if it looks kosher.

Last but not least, to see when the previous reboot took place, you can use both last reboot and uptime.

Takedown

If you want to take down the cluster, you can proceed as follows:

terraform destroy -auto-approve

If you see the destroy hanging, it's probably because of the Hetzner LB and the autoscaled nodes. You can use the following command to delete everything (dry run option is available don't worry, and it will only delete resources specific to your cluster):

tmp_script=$(mktemp) && curl -sSL -o "${tmp_script}" https://raw.githubusercontent.com/kube-hetzner/terraform-hcloud-kube-hetzner/master/scripts/cleanup.sh && chmod +x "${tmp_script}" && "${tmp_script}" && rm "${tmp_script}"

As a one time thing, for convenience, you can also save it as an alias in your shell config file, like so:

alias cleanupkh='tmp_script=$(mktemp) && curl -sSL -o "${tmp_script}" https://raw.githubusercontent.com/kube-hetzner/terraform-hcloud-kube-hetzner/master/scripts/cleanup.sh && chmod +x "${tmp_script}" && "${tmp_script}" && rm "${tmp_script}"'

Careful, the above commands will delete everything, including volumes in your projects. You can always try with a dry run, it will give you that option.

Upgrading the Module

Usually, you will want to upgrade the module in your project to the latest version. Just change the version attribute in your kube.tf and terraform apply. This will upgrade the module to the latest version.

When moving from 1.x to 2.x:

  • Within your project folder, run the createkh installation command, see Do Not Skip section above. This will create the snapshot for you. Don't worry, it's non-destructive and will leave your kube.tf and terraform state alone, but will download the required other packer file.
  • Then modify your kube.tf to use version >= 2.0, and remove extra_packages_to_install and opensuse_microos_mirror_link variables if used. This functionality has been moved to the packer snapshot definition, see packer-template/hcloud-microos-snapshots.pkr.hlc.
  • Then run terraform init -upgrade && terraform apply.

Contributing

๐ŸŒฑ This project currently installs openSUSE MicroOS via the Hetzner rescue mode, making things a few minutes slower. To help with that, you could take a few minutes to send a support request to Hetzner, asking them to please add openSUSE MicroOS as a default image, not just an ISO. The more requests they receive, the likelier they are to add support for it, and if they do, that will cut the deployment time by half. The official link to openSUSE MicroOS is https://get.opensuse.org/microos, and their OpenStack Cloud image has full support for Cloud-init, which would probably very much suit the Hetzner Ops team!

Code contributions are very much welcome.

  1. Fork the Project

  2. Create your Branch (git checkout -b AmazingFeature)

  3. Develop your feature

    In your kube.tf, point the source of module to your local clone of the repo.

    Useful commands:

    # To cleanup a Hetzner project
    ../kube-hetzner/scripts/cleanup.sh
    
    # To build the Packer image
    packer build ../kube-hetzner/packer-template/hcloud-microos-snapshots.pkr.hcl
  4. Update examples in kube.tf.example if required.

  5. Commit your Changes (`git commit -m 'Add some AmazingFeature')

  6. Push to the Branch (git push origin AmazingFeature)

  7. Open a Pull Request targeting the staging branch.

Acknowledgements

  • k-andy was the starting point for this project. It wouldn't have been possible without it.
  • Best-README-Template made writing this readme a lot easier.
  • Hetzner Cloud for providing a solid infrastructure and terraform package.
  • Hashicorp for the amazing terraform framework that makes all the magic happen.
  • Rancher for k3s, an amazing Kube distribution that is the core engine of this project.
  • openSUSE for MicroOS, which is just next-level Container OS technology.

terraform-hcloud-kube-hetzner's People

Contributors

aleksasiriski avatar batthebee avatar dependabot[bot] avatar dhoppe avatar github-actions[bot] avatar ianwesleyarmstrong avatar ifeulner avatar jodhi avatar karbowiak avatar m4t7e avatar michaelsstuff avatar mnbro avatar mnencia avatar mysticaltech avatar nupplaphil avatar otavio avatar oujonny avatar phaer avatar purplebooth avatar relativesure avatar ricristian avatar schlichtanders avatar silvest89 avatar strowi avatar thomasprade avatar timheckel avatar trivoallan avatar vafgoettlich avatar valkenburg-prevue-ch avatar wuppie007 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

terraform-hcloud-kube-hetzner's Issues

First control plane is not responding after post microOS_install_commands reboot

I've set up terraform.tfvars with my hcloud_token, public_key, private_key, and:

location                  = "nbg1" # change to `ash` for us-east Ashburn, Virginia location
network_region            = "eu-central" # change to `us-east` if location is ash
agent_server_type         = "cx11"
control_plane_server_type = "cpx11"
lb_server_type            = "lb11"

And as a result of terraform apply I can see:

hcloud_server.first_control_plane (local-exec): Executing: ["/bin/sh" "-c" "ssh -o UserKnownHostsFile=/dev/null -o StrictHostKeyChecking=no -i /Users/....user_name..../.ssh/id_rsa [email protected]... '(sleep 2; reboot)&'; sleep 3"]
hcloud_server.first_control_plane (local-exec): Warning: Permanently added '...IP...' (ED25519) to the list of known hosts.
hcloud_server.first_control_plane (local-exec): Connection to ...IP... closed by remote host.
hcloud_server.first_control_plane: Still creating... [2m10s elapsed]
hcloud_server.first_control_plane: Provisioning with 'local-exec'...
hcloud_server.first_control_plane (local-exec): Executing: ["/bin/sh" "-c" "until ssh -o UserKnownHostsFile=/dev/null -o StrictHostKeyChecking=no -i /Users/....user_name..../.ssh/id_rsa -o ConnectTimeout=2 [email protected]... true 2> /dev/null\ndo\n  echo \"Waiting for MicroOS to reboot and become available...\"\n  sleep 3\ndone\n"]
hcloud_server.first_control_plane (local-exec): Waiting for MicroOS to reboot and become available...
hcloud_server.first_control_plane (local-exec): Waiting for MicroOS to reboot and become available...
hcloud_server.first_control_plane: Still creating... [2m20s elapsed]
hcloud_server.first_control_plane (local-exec): Waiting for MicroOS to reboot and become available...
hcloud_server.first_control_plane (local-exec): Waiting for MicroOS to reboot and become available...
hcloud_server.first_control_plane: Still creating... [2m30s elapsed]
hcloud_server.first_control_plane (local-exec): Waiting for MicroOS to reboot and become available...
hcloud_server.first_control_plane (local-exec): Waiting for MicroOS to reboot and become available...
hcloud_server.first_control_plane (local-exec): Waiting for MicroOS to reboot and become available...
hcloud_server.first_control_plane: Still creating... [2m40s elapsed]
hcloud_server.first_control_plane (local-exec): Waiting for MicroOS to reboot and become available...
hcloud_server.first_control_plane (local-exec): Waiting for MicroOS to reboot and become available...
hcloud_server.first_control_plane (local-exec): Waiting for MicroOS to reboot and become available...
hcloud_server.first_control_plane: Still creating... [2m50s elapsed]
hcloud_server.first_control_plane (local-exec): Waiting for MicroOS to reboot and become available...
hcloud_server.first_control_plane (local-exec): Waiting for MicroOS to reboot and become available...
hcloud_server.first_control_plane (local-exec): Waiting for MicroOS to reboot and become available...
hcloud_server.first_control_plane (local-exec): Waiting for MicroOS to reboot and become available...
hcloud_server.first_control_plane: Still creating... [3m0s elapsed]
hcloud_server.first_control_plane (local-exec): Waiting for MicroOS to reboot and become available...
hcloud_server.first_control_plane (local-exec): Waiting for MicroOS to reboot and become available...
hcloud_server.first_control_plane (local-exec): Waiting for MicroOS to reboot and become available...
hcloud_server.first_control_plane: Still creating... [3m10s elapsed]
hcloud_server.first_control_plane (local-exec): Waiting for MicroOS to reboot and become available...
hcloud_server.first_control_plane (local-exec): Waiting for MicroOS to reboot and become available...
hcloud_server.first_control_plane (local-exec): Waiting for MicroOS to reboot and become available...
hcloud_server.first_control_plane: Still creating... [3m20s elapsed]
hcloud_server.first_control_plane (local-exec): Waiting for MicroOS to reboot and become available...
hcloud_server.first_control_plane (local-exec): Waiting for MicroOS to reboot and become available...
hcloud_server.first_control_plane (local-exec): Waiting for MicroOS to reboot and become available...
hcloud_server.first_control_plane: Still creating... [3m30s elapsed]
hcloud_server.first_control_plane: Provisioning with 'file'...
hcloud_server.first_control_plane: Still creating... [3m40s elapsed]
hcloud_server.first_control_plane: Still creating... [3m50s elapsed]
hcloud_server.first_control_plane: Still creating... [4m0s elapsed]
hcloud_server.first_control_plane: Still creating... [4m10s elapsed]
hcloud_server.first_control_plane: Still creating... [4m20s elapsed]
hcloud_server.first_control_plane: Still creating... [4m30s elapsed]
hcloud_server.first_control_plane: Still creating... [4m40s elapsed]
hcloud_server.first_control_plane: Still creating... [4m50s elapsed]
hcloud_server.first_control_plane: Still creating... [5m0s elapsed]
hcloud_server.first_control_plane: Still creating... [5m10s elapsed]
hcloud_server.first_control_plane: Still creating... [5m20s elapsed]
hcloud_server.first_control_plane: Still creating... [5m30s elapsed]
hcloud_server.first_control_plane: Still creating... [5m40s elapsed]
hcloud_server.first_control_plane: Still creating... [5m50s elapsed]
hcloud_server.first_control_plane: Still creating... [6m0s elapsed]
hcloud_server.first_control_plane: Still creating... [6m10s elapsed]
hcloud_server.first_control_plane: Still creating... [6m20s elapsed]
hcloud_server.first_control_plane: Still creating... [6m30s elapsed]
hcloud_server.first_control_plane: Still creating... [6m40s elapsed]
hcloud_server.first_control_plane: Still creating... [6m50s elapsed]
hcloud_server.first_control_plane: Still creating... [7m0s elapsed]
hcloud_server.first_control_plane: Still creating... [7m10s elapsed]
hcloud_server.first_control_plane: Still creating... [7m20s elapsed]
hcloud_server.first_control_plane: Still creating... [7m30s elapsed]
hcloud_server.first_control_plane: Still creating... [7m40s elapsed]
hcloud_server.first_control_plane: Still creating... [7m50s elapsed]
hcloud_server.first_control_plane: Still creating... [8m0s elapsed]
hcloud_server.first_control_plane: Still creating... [8m10s elapsed]
hcloud_server.first_control_plane: Still creating... [8m20s elapsed]
hcloud_server.first_control_plane: Still creating... [8m30s elapsed]
โ•ท
โ”‚ Error: file provisioner error
โ”‚
โ”‚   with hcloud_server.first_control_plane,
โ”‚   on master.tf line 55, in resource "hcloud_server" "first_control_plane":
โ”‚   55:   provisioner "file" {
โ”‚
โ”‚ timeout - last error: dial tcp ...IP...:22: connect: operation timed out
โ•ต

I've tried to ssh to this machine:

$ ssh [email protected]... -o StrictHostKeyChecking=no                                                                                                                                                                   
ssh: connect to host ...IP... port 22: Operation timed out

and do it again after manual restart of this machine from hetzner web and unfortunately it looks that this machine is not responding.

Do you have an idea how to diagnose it or what could be the problem?

Thanks :)

How do NodePort Services work?

Hi there,

thank you very much for your awesome help in my last ticket regarding the TLS configuration. It works now :)

Now im wondering, how exposing a service with a NodePort works. If I understand it correctly the firewall and loadbalancer should get configured to forward the port once I create a NodePort Service in my cluster, but that does not happen.

Is my assumption correct? Is NodePort not possible with kube-hetzner?

Thank you very much

Decide which versioning scheme to use

Recently, the master branch has been quite unstable and required users to either stay on an unsupported, older commit of kube-hetzner or to re-provision their whole cluster.

I believe that the time would be right to agree on a versioning scheme and implement at least a minimum of a release process to communicate breaking changes more clearly.

My proposal would be to just use https://semver.org/ and start to define a process after which we could release a 1.0.0 (or 0.1.0 if you prefer ;). Ideally we would end up with a git tag, an auto-generated github release & a ready-to-use module published on registry.terraform.io.

I also started a GitHub project regarding the whole thing, you can find it linked in the sidebar of this issue or at https://github.com/orgs/kube-hetzner/projects/1

eager to hear what @mysticaltech, @mnencia and others are thinking!

Control-plane-0 should be tainted with 'node-role.kubernetes.io/master=true:NoSchedule'

Suggest to taint master node (control-plane-0) with node-role.kubernetes.io/master=true:NoSchedule as currently it is not tainted.
I.e. adding

  # Taint Control Plane as master node to avoid scheduling of workloads here
  provisioner "local-exec" {
    command = <<-EOT
      kubectl taint nodes "${self.name}" node-role.kubernetes.io/master=true:NoSchedule
    EOT
  }

to master.tf

Because of being not tainted, workload pods can get placed on control-plane-0 and disrupt cluster
(i.e. Cassandra or HDFS pod placed on control-plane-0 with almost 100% guarantee disrupts cluster)

terraform initialization stops after Install MicroOS and restart

I've set up terraform.tfvars with my hcloud_token, public_key, private_key, and:

location                  = "nbg1" # change to `ash` for us-east Ashburn, Virginia location
network_region            = "eu-central" # change to `us-east` if location is ash
agent_server_type         = "cx11"
control_plane_server_type = "cpx11"
lb_server_type            = "lb11"

servers_num               = 1
agents_num                = 0
# only one server is chosen because all servers has the same issue so I tried to focus on only one

just after success steps in host/main.tf:

  # Install MicroOS
  # Issue a reboot command

I can see:

module.first_control_plane.hcloud_server.server (local-exec): Executing: ["/bin/sh" "-c" "until ssh -o UserKnownHostsFile=/dev/null -o StrictHostKeyChecking=no -i /Users/drackowski/.ssh/id_rsa -o ConnectTimeout=2 [email protected] true 2> /dev/null\ndo\n  echo \"Waiting for MicroOS to reboot and become available...\"\n  sleep 3\ndone\n"]
module.first_control_plane.hcloud_server.server (local-exec): Waiting for MicroOS to reboot and become available...
module.first_control_plane.hcloud_server.server: Still creating... [2m10s elapsed]
module.first_control_plane.hcloud_server.server (local-exec): Waiting for MicroOS to reboot and become available...
...
...
...
module.first_control_plane.hcloud_server.server: Provisioning with 'remote-exec'...
module.first_control_plane.hcloud_server.server (remote-exec): Connecting to remote host via SSH...
module.first_control_plane.hcloud_server.server (remote-exec):   Host: .......
module.first_control_plane.hcloud_server.server (remote-exec):   User: root
module.first_control_plane.hcloud_server.server (remote-exec):   Password: false
module.first_control_plane.hcloud_server.server (remote-exec):   Private key: true
module.first_control_plane.hcloud_server.server (remote-exec):   Certificate: false
module.first_control_plane.hcloud_server.server (remote-exec):   SSH Agent: true
module.first_control_plane.hcloud_server.server (remote-exec):   Checking Host Key: false
module.first_control_plane.hcloud_server.server (remote-exec):   Target Platform: unix

Error response is:

โ•ท
โ”‚ Error: remote-exec provisioner error
โ”‚
โ”‚   with module.first_control_plane.hcloud_server.server,
โ”‚   on modules/host/main.tf line 60, in resource "hcloud_server" "server":
โ”‚   60:   provisioner "remote-exec" {
โ”‚
โ”‚ timeout - last error: dial tcp ........:22: i/o timeout
โ•ต

When I'm trying to ssh to this machine it's not responding "timed out" after long time

On server console I can see welcome message from openSUSE with few SSH host keys and "static login: "
Screenshot 2022-02-21 at 21 32 47

Do you know what can I check there else to diagnose what has happened? ๐Ÿ˜…

Full terraform apply console output is in attachment full console output.txt

Disable Traefik ingress controller & lb

Hi guys
Thanks to all contributors to this amazing project!

Is it possible to disable the treafik ingress controller at all? I would prefer using nginx-ingress or istio-gateway as ingress solution. For that, I didn't need the Loadbalancer and the treafik installation.

I'm also open to contributing such a toggle function but would need some inputs on how to implement it the best way. :)

cheers,
Johann Schley

When using allow_scheduling_on_control_plane=true, lb does not point to control planes

What I did:

git clone https://github.com/kube-hetzner/kube-hetzner.git
git checkout staging

export TF_VAR_hcloud_token="my-hcloud-token"

Create a terraform.tfvars file:

# You need to replace these
public_key   = "~/.ssh/id_ed25519.pub"
# Must be "private_key = null" when you want to use ssh-agent, for a Yubikey like device auth or an SSH key-pair with passphrase
private_key  = "~/.ssh/id_ed25519"

# These can be customized, or left with the default values
# For Hetzner locations see https://docs.hetzner.com/general/others/data-centers-and-connection/
# For Hetzner server types see https://www.hetzner.com/cloud
location                  = "nbg1" # change to `ash` for us-east Ashburn, Virginia location
network_region            = "eu-central" # change to `us-east` if location is ash
agent_server_type         = "cpx21"
control_plane_server_type = "cpx31"
lb_server_type            = "lb11"
servers_num               = 3
agents_num                = 0

# If you want to use a specific Hetzner CCM and CSI version, set them below, otherwise leave as is for the latest versions
# hetzner_ccm_version = ""
# hetzner_csi_version = ""

# If you want to kustomize the Hetzner CCM and CSI containers with the "latest" tags and imagePullPolicy Always, 
# to have them automatically update when the node themselve get updated via the rancher system upgrade controller, the default is "false".
# If you choose to keep the default of "false", you can always use ArgoCD to monitor the CSI and CCM manifest for new releases,
# that is probably the more "vanilla" option to keep these components always updated. 
# hetzner_ccm_containers_latest = true
# hetzner_csi_containers_latest = true

# If you want to use letsencrypt with tls Challenge, the email address is used to send you certificates expiration notices
traefik_acme_tls = true
traefik_acme_email = "[email protected]"

# If you want to allow non-control-plane workloads to run on the control-plane nodes set "true" below. The default is "false".
allow_scheduling_on_control_plane = true

What happened:

All resources are properly scheduled, but the load balancer does not point to the control planes.

lb_overview

What I expect:

I expect a reference from the load balancer to the control planes.

4 node(s) had taint {node.cloudprovider.kubernetes.io/uninitialized: true}

Hello, after a fresh terraform apply at hetzner, all my pods are in ready state with the following error messages:

0/4 nodes are available: 4 node(s) had taint {node.cloudprovider.kubernetes.io/uninitialized: true}, that the pod didn't tolerate.

Running the commande:
kubectl taint nodes --all node.cloudprovider.kubernetes.io/uninitialized-
fixed the problem, but the load balancer is not created

Do, you know how can I fix it please ?

Request: enable IPv6 on Loadbalancer

Hi,

would be great if we can enable IPv6 on loadbalancer via a variable and get the v6 address just like the v4 (hcloud_load_balancer.traefik.ipv6)

Thanks
and great project!

feature request: custom firewall

Just a suggestion (but a really nice to have):

Feature Custom Firewalls

In order to keep variables, where they should be, and never touch the .tf files,
it would be nice if there is a place, where custom ports can be managed:

currently I do this in main.tf

resource "hcloud_firewall" "k3s" {
 name = "k3s-firewall"

## My Custom firewall rule
# Postgres
 rule {
   direction = "out"
   protocol  = "tcp"
   port      = "5432"
   destination_ips = [
     "0.0.0.0/0"
   ]
 }

Maybe that part could be "outsourced" in a firewall.tf-file or in a nested array in variables.

edit: removed "multiple feature" like mentioned in the first comment

timeout - last error: dial tcp IP:22: i/o timeout

Hello, when I run the following command:
terraform apply -auto-approve,
I have this error:
image
This is my configuration:

# Only the first values starting with a * are obligatory, the rest can remain with their default values, or you
# could adapt them to your needs.
#
# Note that some values, notably "location" and "public_key" have no effect after the initial cluster has been setup.
# This is in order to keep terraform from re-provisioning all nodes at once which would loose data. If you want to update,
# those, you should instead change the value here and then manually re-provision each node one-by-one. Grep for "lifecycle".

# * Your Hetzner project API token 
hcloud_token = "๐Ÿค"
# * Your public key
public_key = "id_rsa.pub"
# * Your private key, must be "private_key = null" when you want to use ssh-agent, for a Yubikey like device auth or an SSH key-pair with passphrase
private_key = "id_rsa"

# These can be customized, or left with the default values
# For Hetzner locations see https://docs.hetzner.com/general/others/data-centers-and-connection/
# For Hetzner server types see https://www.hetzner.com/cloud
location       = "fsn1"       # change to `ash` for us-east Ashburn, Virginia location
network_region = "eu-central" # change to `us-east` if location is ash

# You can have up to as many subnets as you want (preferably if the form of 10.X.0.0/16),
# their primary use is to logically separate the nodes.
# The control_plane network is mandatory.
network_ipv4_subnets = {
  control_plane = "10.1.0.0/16"
  agent_big     = "10.2.0.0/16"
  agent_small   = "10.3.0.0/16"
}

# At least 3 server nodes is recommended for HA, otherwise you need to turn off automatic upgrade (see ReadMe).
# As per rancher docs, it must be always an odd number, never even! See https://rancher.com/docs/k3s/latest/en/installation/ha-embedded/
# For instance, 1 is ok (non-HA), 2 not ok, 3 is ok (becomes HA).
control_plane_count = 3

# The type of control plane nodes, see https://www.hetzner.com/cloud, the minimum instance supported is cpx11 (just a few cents more than cx11)
control_plane_server_type = "cpx11"

# As for the agent nodepools, below is just an example, if you do not want nodepools, just use one,
# and change the name to what you want, it need not be "agent-big" or "agent-small", also give them the subnet prefer.
# For single node clusters set this equal to {}
agent_nodepools = {
  # agent-big = {
  #   server_type = "cpx21",
  #   count       = 1,
  #   subnet      = "agent_big",
  # }
  agent-small = {
    server_type = "cpx11",
    count       = 2,
    subnet      = "agent_small",
  }
}

# That will depend on how much load you want it to handle, see https://www.hetzner.com/cloud/load-balancer
load_balancer_type = "lb11"

### The following values are fully optional

# It's best to leave the network range as is, unless you know what you are doing. The default is "10.0.0.0/8".
# network_ipv4_range = "10.0.0.0/8"

# If you want to use a specific Hetzner CCM and CSI version, set them below, otherwise leave as is for the latest versions
# hetzner_ccm_version = ""
# hetzner_csi_version = ""

# If you want to use letsencrypt with tls Challenge, the email address is used to send you certificates expiration notices
traefik_acme_tls   = true
traefik_acme_email = "๐Ÿค"

# If you want to allow non-control-plane workloads to run on the control-plane nodes set "true" below. The default is "false".
# Also good for single node clusters.
/* allow_scheduling_on_control_plane = true */

# If you want to disable automatic upgrade of k3s, you can set this to false, default is "true".
# automatically_upgrade_k3s = false

# Allows you to specify either stable, latest, or testing (defaults to stable), see https://rancher.com/docs/k3s/latest/en/upgrades/basic/
# initial_k3s_channel = "latest"

# Adding extra firewall rules, like opening a port
# In this example with allow port TCP 5432 for a Postgres service we will open via a nodeport
# More info on the format here https://registry.terraform.io/providers/hetznercloud/hcloud/latest/docs/resources/firewall
# extra_firewall_rules = [
#   {
#     direction = "in"
#     protocol  = "tcp"
#     port      = "5432"
#     source_ips = [
#       "0.0.0.0/0"
#     ]
#   },
# ]

Do you know how can I fix this problem please ?

Why are Server not Put into a Placement Group?

When I Setup a Cluster it seems that the server are not created with a Placement Group of the Type "Spread".
This should be common practice though to maximise availability should the host machine fail.

There is a fairly recent Tutorial on Hetzner Community that mentions that.

Error: hcloud/inlineAttachServerToNetwork

One of the nodes were tainted. I tried to reapply Terraform via tf apply.
That didn't work, so I deleted the node (agent-3) completely in the Hetzner UI
and tried a tf plan and tf apply --auto-apply


hcloud_server.agents[3]: Creating...
โ•ท
โ”‚ Error: hcloud/inlineAttachServerToNetwork: attach server to network: provided IP is not available (ip_not_available)
โ”‚
โ”‚   with hcloud_server.agents[3],
โ”‚   on agents.tf line 1, in resource "hcloud_server" "agents":
โ”‚    1: resource "hcloud_server" "agents" {

The agent-3 was created but not attached to K8.

After that I tried to increase the node numbers from 3 agents to 5.
Node agent-4 was created and attached, but node-3 was still not able to attach:

complete output:

 tf apply --auto-approve
random_password.k3s_token: Refreshing state... [id=none]
local_file.traefik_config: Refreshing state... [id=25ba84696ee16d68f5b98f6ea6b70bb14c3c530c]
hcloud_placement_group.k3s_placement_group: Refreshing state... [id=19653]
hcloud_ssh_key.default: Refreshing state... [id=5492430]
hcloud_network.k3s: Refreshing state... [id=1352333]
hcloud_firewall.k3s: Refreshing state... [id=290151]
hcloud_network_subnet.k3s: Refreshing state... [id=1352333-10.0.0.0/16]
local_file.hetzner_csi_config: Refreshing state... [id=aa232912bcf86722e32b698e1e077522c7f02a9d]
local_file.hetzner_ccm_config: Refreshing state... [id=f5ec6cb5689cb5830d04857365d567edae562174]
hcloud_server.first_control_plane: Refreshing state... [id=17736249]
hcloud_server.control_planes[0]: Refreshing state... [id=17736377]
hcloud_server.control_planes[1]: Refreshing state... [id=17736378]
hcloud_server.agents[5]: Refreshing state... [id=17861319]
hcloud_server.agents[3]: Refreshing state... [id=17869801]
hcloud_server.agents[0]: Refreshing state... [id=17736379]
hcloud_server.agents[1]: Refreshing state... [id=17736385]
hcloud_server.agents[4]: Refreshing state... [id=17858945]
hcloud_server.agents[2]: Refreshing state... [id=17736383]

Note: Objects have changed outside of Terraform

Terraform detected the following changes made outside of Terraform since the last "terraform apply":

  # hcloud_placement_group.k3s_placement_group has been changed
  ~ resource "hcloud_placement_group" "k3s_placement_group" {
        id      = "19653"
        name    = "k3s-placement-group"
      ~ servers = [
          + 17869801,
            # (8 unchanged elements hidden)
        ]
        # (2 unchanged attributes hidden)
    }
  # hcloud_server.agents[3] has been changed
  ~ resource "hcloud_server" "agents" {
      + datacenter         = "fsn1-dc14"
        id                 = "17869801"
      + ipv4_address       = "78.47.82.149"
      + ipv6_address       = "2a01:4f8:c17:8d4a::1"
      + ipv6_network       = "2a01:4f8:c17:8d4a::/64"
        name               = "k3s-agent-3"
      + status             = "running"
        # (12 unchanged attributes hidden)

      - network {
          - alias_ips  = [] -> null
          - ip         = "10.0.0.8" -> null
          - network_id = 1352333 -> null
        }
    }
  # hcloud_firewall.k3s has been changed
  ~ resource "hcloud_firewall" "k3s" {
        id     = "290151"
        name   = "k3s-firewall"
        # (1 unchanged attribute hidden)

      + apply_to {
          + server = 17869801
        }

        # (21 unchanged blocks hidden)
    }

Unless you have made equivalent changes to your configuration, or ignored the relevant attributes using ignore_changes, the following plan may
include actions to undo or respond to these changes.

โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€

Terraform used the selected providers to generate the following execution plan. Resource actions are indicated with the following symbols:
-/+ destroy and then create replacement

Terraform will perform the following actions:

  # hcloud_server.agents[3] is tainted, so must be replaced
-/+ resource "hcloud_server" "agents" {
      + backup_window      = (known after apply)
      ~ datacenter         = "fsn1-dc14" -> (known after apply)
      ~ id                 = "17869801" -> (known after apply)
      ~ ipv4_address       = "78.47.82.xxx" -> (known after apply)
      ~ ipv6_address       = "2a01:4f8:c17:xxxx::1" -> (known after apply)
      ~ ipv6_network       = "2a01:4f8:c17:xxxx::/64" -> (known after apply)
        name               = "k3s-agent-3"
      ~ status             = "running" -> (known after apply)
        # (12 unchanged attributes hidden)

      + network {
          + alias_ips   = []
          + ip          = "10.0.0.8"
          + mac_address = (known after apply)
          + network_id  = 1352333
        }
    }

Plan: 1 to add, 0 to change, 1 to destroy.

Changes to Outputs:
  ~ agents_public_ip = [
        # (2 unchanged elements hidden)
        "138.201.246.xxx",
      + (known after apply),
      + "78.46.163.xxx",
      + "49.12.100.xxx",
    ]
hcloud_server.agents[3]: Destroying... [id=17869801]
hcloud_server.agents[3]: Destruction complete after 2s
hcloud_server.agents[3]: Creating...
hcloud_server.agents[3]: Still creating... [10s elapsed]
โ•ท
โ”‚ Error: hcloud/inlineAttachServerToNetwork: attach server to network: provided IP is not available (ip_not_available)
โ”‚
โ”‚   with hcloud_server.agents[3],
โ”‚   on agents.tf line 1, in resource "hcloud_server" "agents":
โ”‚    1: resource "hcloud_server" "agents" {
โ”‚
โ•ต

How can I get agent-3 working again?

thank you in advance

Unable to deploy Kubernetes cluster. SSH connection fails

Hi,
I'm trying to deploy a cluster with the same config than the template but when the servers reboot the deployment is unable to connect to them via SSH.

module.control_planes[1].hcloud_server.server (remote-exec): Connecting to remote host via SSH... module.control_planes[1].hcloud_server.server (remote-exec): Host: XXXXX module.control_planes[1].hcloud_server.server (remote-exec): User: root module.control_planes[1].hcloud_server.server (remote-exec): Password: false module.control_planes[1].hcloud_server.server (remote-exec): Private key: true module.control_planes[1].hcloud_server.server (remote-exec): Certificate: false module.control_planes[1].hcloud_server.server (remote-exec): SSH Agent: true module.control_planes[1].hcloud_server.server (remote-exec): Checking Host Key: false module.control_planes[1].hcloud_server.server (remote-exec): Target Platform: unix module.agents["agent-small-0"].hcloud_server.server: Still creating... [4m30s elapsed] module.agents["agent-big-0"].hcloud_server.server: Still creating... [4m30s elapsed]

thank you - and quick question about tls

Hi - first THANK YOU so much for all the effort you've put into this repository @mysticaltech. I am very new to kubernetes and the whole k3 ecosystem, and your effort to help others is really wonderful. Kudos to you, sir :)

Second - I have a small issue with my TLS and I cannot get it to work. I simply want blog.domain.com to be secured, and though there are lots of ways to get a cert, I'm simply trying to use a wildcard of my own. I've successfully created this using:

kubectl create secret tls domain-tls --cert ./domain.crt --key ./domain.key

And I have the ingress set like (bound to a ghost deployment)

apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: domain-ingress-blog
spec:
  tls: 
    - hosts: 
      - blog.domain.com
      secretName: domain-tls
  rules:
  - host: blog.domain.com
    http:
      paths:
      - pathType: Prefix
        path: "/"
        backend:
          service:
            name: domain-blog-ghost
            port:
              number: 80

And I've created an A record pointing the blog.domain.com to the traefik load balancer IP with my dns provider.

I am missing something, though, because the default Traefik cert is always shown when I hit blog.domain.com:

image

My Traefik rendered yaml template is below, resulting from the terraform apply:

apiVersion: helm.cattle.io/v1
kind: HelmChartConfig
metadata:
  name: traefik
  namespace: kube-system
spec:
  valuesContent: |-
    service:
      enabled: true
      type: LoadBalancer
      annotations:
        "load-balancer.hetzner.cloud/name": "traefik"
        # make hetzners load-balancer connect to our nodes via our private k3s-net.
        "load-balancer.hetzner.cloud/use-private-ip": "true"
        # keep hetzner-ccm from exposing our private ingress ip, which in general isn't routeable from the public internet.
        "load-balancer.hetzner.cloud/disable-private-ingress": "true"
        # disable ipv6 by default, because external-dns doesn't support AAAA for hcloud yet https://github.com/kubernetes-sigs/external-dns/issues/2044
        "load-balancer.hetzner.cloud/ipv6-disabled": "false"
        "load-balancer.hetzner.cloud/location": "ash"
        "load-balancer.hetzner.cloud/type": "lb11"
        "load-balancer.hetzner.cloud/uses-proxyprotocol": "true"
        # "load-balancer.hetzner.cloud/http-redirect-http": "true"
        "load-balancer.hetzner.cloud/http-sticky-sessions": "true"
    additionalArguments:
      - "--entryPoints.web.proxyProtocol.trustedIPs=127.0.0.1/32,10.0.0.0/8"
      - "--entryPoints.websecure.proxyProtocol.trustedIPs=127.0.0.1/32,10.0.0.0/8"
      - "--entryPoints.web.forwardedHeaders.trustedIPs=127.0.0.1/32,10.0.0.0/8"
      - "--entryPoints.websecure.forwardedHeaders.trustedIPs=127.0.0.1/32,10.0.0.0/8"

Any help would be greatly appreciated. Thanks again :)

Where to terminate TLS connections?

Hi,

I really appreciate your work and have successfully created my own k8s cluster on the Hetzner cloud :)

Now I wanted to add TLS / HTTPS support and wanted to let the TLS connection terminate on the loadbalancer. Automatically retrieved certificates from letsencrypt seem fine. However the loadbalancer does not seem to work when I change its service

from
"[tcp] 443 -> 31028"
to
"[https] 443 -> 30468"

I have completely removed the tcp service for port 80, because I think I will not need it.
The loadbalancer shows 'unhealthy' for this service and I cannot access any ingress anymore.

Can someone please advise me on how to achieve TLS support with the hetzner loadbalancer and traefik ingress? :) Thanks!

Creating a cluster with 3 control-planes, the 3rd one fails to join

So I took up a cluster with 3 control planes and 2 agents. Then noticed that only 2 servers and 2 agents were present. But hcloud server list was listing them all.

ksnip_20220209-154211

Turns out it failed to join, because of "too many learner", as follows:

ksnip_20220209-155019

So I issued systemctl start k3s-server another time and it worked. Meaning we have to wait and make sure that servers start before, and retry if necessary, before returning success.

ksnip_20220209-154959

Fix for agents launched between Feb 10 and Feb 15 2022

Hello folks, there was an error in the agents' definition launched from Feb 10 to Feb 15. Here's the fix. You have two options.

1/ Scale down, to 0 agents, apply and scale back up, apply.

2/ Login via SSH to each agent and issue a few commands to fix them.

  • Get the agent IP with hcloud server list
  • Login via ssh root@IP -i ~/.ssh/id_ed25519 -o StrictHostKeyChecking=no
  • Issue the following commands:
    systemctl disable k3s-server
    systemctl stop k3s-server
    systemctl --now enable k3s-agent
    

local-exec provisioner error

i followed the instructions from the readme file and the error
local-exec provisioner error Error running command 'kubectl -n kube-system create secret generic hcloud exit status 1. Output: Unable to connect to the server: dial tcp
pops up.
Am i missing anything?

Stuck on `remote-exec`

Hi @mysticaltech !

Finally got time to test this beast repo this weekend ;)
This might sound a bit silly, but I'm kind of stuck on the remote-exec for the initialization of the first_control_plane

I did clone and create a new terraform.tfvars with the corresponding token, public key and private key.
Server spins up fine but seems like just can't connect to it.

However, if I manually connect from the host to ssh [email protected], this works!

Let me know what I'm missing - and great stuff! thank you for sharing this repo.

image

staging not found

commit e7f016f

staging is not available for me:

hcloud_server.first_control_plane (remote-exec): 02/10 12:20:52 [ERROR] CUID#7 - Download aborted. URI=https://raw.githubusercontent.com/kube-hetzner/kube-hetzner/staging/.files/openSUSE-MicroOS.x86_64-k3s-kvm-and-xen.qcow2.meta4
hcloud_server.first_control_plane (remote-exec): Exception: [AbstractCommand.cc:351] errorCode=3 URI=https://raw.githubusercontent.com/kube-hetzner/kube-hetzner/staging/.files/openSUSE-MicroOS.x86_64-k3s-kvm-and-xen.qcow2.meta4
hcloud_server.first_control_plane (remote-exec):   -> [HttpSkipResponseCommand.cc:218] errorCode=3 Resource not found

replaced staging with master, now it is running

hcloud_network_subnet.k3s: Still destroying

Somehow terraform destroy keeps hanging on destroying the network:

hcloud_placement_group.k3s_placement_group: Destruction complete after 0s
hcloud_firewall.k3s: Destruction complete after 0s
...
..
.
hcloud_network_subnet.k3s: Still destroying... [id=1352246-10.0.0.0/16, 10m40s elapsed]
hcloud_network_subnet.k3s: Still destroying... [id=1352246-10.0.0.0/16, 10m50s elapsed]
hcloud_network_subnet.k3s: Still destroying... [id=1352246-10.0.0.0/16, 11m0s elapsed]
hcloud_network_subnet.k3s: Still destroying... [id=1352246-10.0.0.0/16, 11m10s elapsed]
hcloud_network_subnet.k3s: Still destroying... [id=1352246-10.0.0.0/16, 11m20s elapsed]
hcloud_network_subnet.k3s: Still destroying... [id=1352246-10.0.0.0/16, 11m30s elapsed]
hcloud_network_subnet.k3s: Still destroying... [id=1352246-10.0.0.0/16, 11m40s elapsed]

I also tried to reapply terraform destroy
When I manually delete the network in the UI, it finishes a few seconds later

Error when provisioning

I followed the Readme and am getting this error. It seemed to have created the 3 control planes, network and firewall but not the nodepool/nodes or load balancer.

 Error: invalid input in field 'name' (invalid_input): [name => [Name must be a valid hostname.]]
โ”‚
โ”‚   with module.agents["myname_nodes-1"].hcloud_server.server,
โ”‚   on modules/host/main.tf line 1, in resource "hcloud_server" "server":
โ”‚    1: resource "hcloud_server" "server" {

New "MicroOS" version does not deploy, stuck on control-plane-1

Bildschirmaufnahme.2022-02-13.um.420mov.mov

Can someone tell me please what I am doing wrong?

โฏ tf apply --auto-approve

Terraform used the selected providers to generate the following execution plan. Resource actions are
indicated with the following symbols:
  + create
 <= read (data resources)

Terraform will perform the following actions:

  # data.remote_file.kubeconfig will be read during apply
  # (config refers to values not yet known)
 <= data "remote_file" "kubeconfig"  {
      + content = (known after apply)
      + id      = (known after apply)
      + path    = "/etc/rancher/k3s/k3s.yaml"

      + conn {
          + agent       = false
          + host        = (known after apply)
          + port        = 22
          + private_key = (sensitive value)
          + user        = "root"
        }
    }

  # hcloud_firewall.k3s will be created
  + resource "hcloud_firewall" "k3s" {
      + id     = (known after apply)
      + labels = (known after apply)
      + name   = "k3s"

      + rule {
          + destination_ips = [
              + "0.0.0.0/0",
            ]
          + direction       = "out"
          + protocol        = "icmp"
          + source_ips      = []
        }
      + rule {
          + destination_ips = [
              + "0.0.0.0/0",
            ]
          + direction       = "out"
          + port            = "123"
          + protocol        = "udp"
          + source_ips      = []
        }
      + rule {
          + destination_ips = [
              + "0.0.0.0/0",
            ]
          + direction       = "out"
          + port            = "443"
          + protocol        = "tcp"
          + source_ips      = []
        }
      + rule {
          + destination_ips = [
              + "0.0.0.0/0",
            ]
          + direction       = "out"
          + port            = "53"
          + protocol        = "tcp"
          + source_ips      = []
        }
      + rule {
          + destination_ips = [
              + "0.0.0.0/0",
            ]
          + direction       = "out"
          + port            = "53"
          + protocol        = "udp"
          + source_ips      = []
        }
      + rule {
          + destination_ips = [
              + "0.0.0.0/0",
            ]
          + direction       = "out"
          + port            = "80"
          + protocol        = "tcp"
          + source_ips      = []
        }
      + rule {
          + destination_ips = []
          + direction       = "in"
          + protocol        = "icmp"
          + source_ips      = [
              + "0.0.0.0/0",
            ]
        }
      + rule {
          + destination_ips = []
          + direction       = "in"
          + protocol        = "icmp"
          + source_ips      = [
              + "10.0.0.0/8",
              + "127.0.0.1/32",
              + "169.254.169.254/32",
              + "213.239.246.1/32",
            ]
        }
      + rule {
          + destination_ips = []
          + direction       = "in"
          + port            = "22"
          + protocol        = "tcp"
          + source_ips      = [
              + "0.0.0.0/0",
            ]
        }
      + rule {
          + destination_ips = []
          + direction       = "in"
          + port            = "6443"
          + protocol        = "tcp"
          + source_ips      = [
              + "0.0.0.0/0",
            ]
        }
      + rule {
          + destination_ips = []
          + direction       = "in"
          + port            = "any"
          + protocol        = "tcp"
          + source_ips      = [
              + "10.0.0.0/8",
              + "127.0.0.1/32",
              + "169.254.169.254/32",
              + "213.239.246.1/32",
            ]
        }
      + rule {
          + destination_ips = []
          + direction       = "in"
          + port            = "any"
          + protocol        = "udp"
          + source_ips      = [
              + "10.0.0.0/8",
              + "127.0.0.1/32",
              + "169.254.169.254/32",
              + "213.239.246.1/32",
            ]
        }
    }

  # hcloud_network.k3s will be created
  + resource "hcloud_network" "k3s" {
      + delete_protection = false
      + id                = (known after apply)
      + ip_range          = "10.0.0.0/8"
      + name              = "k3s"
    }

  # hcloud_network_subnet.k3s will be created
  + resource "hcloud_network_subnet" "k3s" {
      + gateway      = (known after apply)
      + id           = (known after apply)
      + ip_range     = "10.0.0.0/16"
      + network_id   = (known after apply)
      + network_zone = "eu-central"
      + type         = "cloud"
    }

  # hcloud_placement_group.k3s will be created
  + resource "hcloud_placement_group" "k3s" {
      + id      = (known after apply)
      + labels  = {
          + "engine"      = "k3s"
          + "provisioner" = "terraform"
        }
      + name    = "k3s"
      + servers = (known after apply)
      + type    = "spread"
    }

  # hcloud_server.agents[0] will be created
  + resource "hcloud_server" "agents" {
      + backup_window      = (known after apply)
      + backups            = false
      + datacenter         = (known after apply)
      + delete_protection  = false
      + firewall_ids       = (known after apply)
      + id                 = (known after apply)
      + image              = "ubuntu-20.04"
      + ipv4_address       = (known after apply)
      + ipv6_address       = (known after apply)
      + ipv6_network       = (known after apply)
      + keep_disk          = false
      + labels             = {
          + "engine"      = "k3s"
          + "provisioner" = "terraform"
        }
      + location           = "fsn1"
      + name               = "k3s-agent-0"
      + placement_group_id = (known after apply)
      + rebuild_protection = false
      + rescue             = "linux64"
      + server_type        = "cpx21"
      + ssh_keys           = (known after apply)
      + status             = (known after apply)

      + network {
          + alias_ips   = []
          + ip          = "10.0.1.1"
          + mac_address = (known after apply)
          + network_id  = (known after apply)
        }
    }

  # hcloud_server.agents[1] will be created
  + resource "hcloud_server" "agents" {
      + backup_window      = (known after apply)
      + backups            = false
      + datacenter         = (known after apply)
      + delete_protection  = false
      + firewall_ids       = (known after apply)
      + id                 = (known after apply)
      + image              = "ubuntu-20.04"
      + ipv4_address       = (known after apply)
      + ipv6_address       = (known after apply)
      + ipv6_network       = (known after apply)
      + keep_disk          = false
      + labels             = {
          + "engine"      = "k3s"
          + "provisioner" = "terraform"
        }
      + location           = "fsn1"
      + name               = "k3s-agent-1"
      + placement_group_id = (known after apply)
      + rebuild_protection = false
      + rescue             = "linux64"
      + server_type        = "cpx21"
      + ssh_keys           = (known after apply)
      + status             = (known after apply)

      + network {
          + alias_ips   = []
          + ip          = "10.0.1.2"
          + mac_address = (known after apply)
          + network_id  = (known after apply)
        }
    }

  # hcloud_server.control_planes[0] will be created
  + resource "hcloud_server" "control_planes" {
      + backup_window      = (known after apply)
      + backups            = false
      + datacenter         = (known after apply)
      + delete_protection  = false
      + firewall_ids       = (known after apply)
      + id                 = (known after apply)
      + image              = "ubuntu-20.04"
      + ipv4_address       = (known after apply)
      + ipv6_address       = (known after apply)
      + ipv6_network       = (known after apply)
      + keep_disk          = false
      + labels             = {
          + "engine"      = "k3s"
          + "provisioner" = "terraform"
        }
      + location           = "fsn1"
      + name               = "k3s-control-plane-1"
      + placement_group_id = (known after apply)
      + rebuild_protection = false
      + rescue             = "linux64"
      + server_type        = "cpx11"
      + ssh_keys           = (known after apply)
      + status             = (known after apply)

      + network {
          + alias_ips   = []
          + ip          = "10.0.0.3"
          + mac_address = (known after apply)
          + network_id  = (known after apply)
        }
    }

  # hcloud_server.control_planes[1] will be created
  + resource "hcloud_server" "control_planes" {
      + backup_window      = (known after apply)
      + backups            = false
      + datacenter         = (known after apply)
      + delete_protection  = false
      + firewall_ids       = (known after apply)
      + id                 = (known after apply)
      + image              = "ubuntu-20.04"
      + ipv4_address       = (known after apply)
      + ipv6_address       = (known after apply)
      + ipv6_network       = (known after apply)
      + keep_disk          = false
      + labels             = {
          + "engine"      = "k3s"
          + "provisioner" = "terraform"
        }
      + location           = "fsn1"
      + name               = "k3s-control-plane-2"
      + placement_group_id = (known after apply)
      + rebuild_protection = false
      + rescue             = "linux64"
      + server_type        = "cpx11"
      + ssh_keys           = (known after apply)
      + status             = (known after apply)

      + network {
          + alias_ips   = []
          + ip          = "10.0.0.4"
          + mac_address = (known after apply)
          + network_id  = (known after apply)
        }
    }

  # hcloud_server.first_control_plane will be created
  + resource "hcloud_server" "first_control_plane" {
      + backup_window      = (known after apply)
      + backups            = false
      + datacenter         = (known after apply)
      + delete_protection  = false
      + firewall_ids       = (known after apply)
      + id                 = (known after apply)
      + image              = "ubuntu-20.04"
      + ipv4_address       = (known after apply)
      + ipv6_address       = (known after apply)
      + ipv6_network       = (known after apply)
      + keep_disk          = false
      + labels             = {
          + "engine"      = "k3s"
          + "provisioner" = "terraform"
        }
      + location           = "fsn1"
      + name               = "k3s-control-plane-0"
      + placement_group_id = (known after apply)
      + rebuild_protection = false
      + rescue             = "linux64"
      + server_type        = "cpx11"
      + ssh_keys           = (known after apply)
      + status             = (known after apply)

      + network {
          + alias_ips   = []
          + ip          = "10.0.0.2"
          + mac_address = (known after apply)
          + network_id  = (known after apply)
        }
    }

  # hcloud_ssh_key.k3s will be created
  + resource "hcloud_ssh_key" "k3s" {
      + fingerprint = (known after apply)
      + id          = (known after apply)
      + name        = "k3s"
      + public_key  = "ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQC5mH6iwpbJY+ssGIJUVsClE5LO/e9/YhA2k+oOP6VzxK2f9GutJu6wYNd6re5Ma1BRZL1ld95QKs/k1F1HWq75y1VJMawD+72+7OR6eT1nwJyrFDVk801UgCuOPJtLGAjNXx9uT2AMKZ08crnRGap3XzjLynVxoeETndINMew3LKnaL3zGkrDRRZnysrIoB3c8ywS9WlQxB5M3zdMICQ6aqsonIHChDybHnKb+wEKFUbND5ga/V1VG2GUR18uNGu01Zpxxof566C+26owSfrnA9R7KllUI/+/zYTqFRt5a2F3B/k0I+5WhSsAuRbI/eundl1oTP4sAtJ8qKBt20VYL [email protected]"
    }

  # local_file.kubeconfig will be created
  + resource "local_file" "kubeconfig" {
      + directory_permission = "0777"
      + file_permission      = "600"
      + filename             = "kubeconfig.yaml"
      + id                   = (known after apply)
      + sensitive_content    = (sensitive value)
    }

  # random_password.k3s_token will be created
  + resource "random_password" "k3s_token" {
      + id          = (known after apply)
      + length      = 48
      + lower       = true
      + min_lower   = 0
      + min_numeric = 0
      + min_special = 0
      + min_upper   = 0
      + number      = true
      + result      = (sensitive value)
      + special     = false
      + upper       = true
    }

Plan: 12 to add, 0 to change, 0 to destroy.

Changes to Outputs:
  + agents_public_ip        = [
      + (known after apply),
      + (known after apply),
    ]
  + controlplanes_public_ip = [
      + (known after apply),
      + (known after apply),
      + (known after apply),
    ]
  + kubeconfig              = (sensitive value)
  + kubeconfig_file         = (sensitive value)
random_password.k3s_token: Creating...
random_password.k3s_token: Creation complete after 0s [id=none]
hcloud_network.k3s: Creating...
hcloud_placement_group.k3s: Creating...
hcloud_ssh_key.k3s: Creating...
hcloud_firewall.k3s: Creating...
hcloud_placement_group.k3s: Creation complete after 1s [id=21532]
hcloud_ssh_key.k3s: Creation complete after 1s [id=5557860]
hcloud_network.k3s: Creation complete after 1s [id=1370757]
hcloud_network_subnet.k3s: Creating...
hcloud_firewall.k3s: Creation complete after 1s [id=300569]
hcloud_network_subnet.k3s: Creation complete after 1s [id=1370757-10.0.0.0/16]
hcloud_server.first_control_plane: Creating...
hcloud_server.first_control_plane: Still creating... [10s elapsed]
hcloud_server.first_control_plane: Provisioning with 'file'...
hcloud_server.first_control_plane: Still creating... [20s elapsed]
hcloud_server.first_control_plane: Still creating... [30s elapsed]
hcloud_server.first_control_plane: Provisioning with 'remote-exec'...
hcloud_server.first_control_plane (remote-exec): Connecting to remote host via SSH...
hcloud_server.first_control_plane (remote-exec):   Host: 49.12.221.176
hcloud_server.first_control_plane (remote-exec):   User: root
hcloud_server.first_control_plane (remote-exec):   Password: false
hcloud_server.first_control_plane (remote-exec):   Private key: true
hcloud_server.first_control_plane (remote-exec):   Certificate: false
hcloud_server.first_control_plane (remote-exec):   SSH Agent: true
hcloud_server.first_control_plane (remote-exec):   Checking Host Key: false
hcloud_server.first_control_plane (remote-exec):   Target Platform: unix
hcloud_server.first_control_plane (remote-exec): Connected!
hcloud_server.first_control_plane (remote-exec): + apt-get install -y aria2
hcloud_server.first_control_plane: Still creating... [40s elapsed]
hcloud_server.first_control_plane (remote-exec): Reading package lists... 0%
hcloud_server.first_control_plane (remote-exec): Reading package lists... 0%
hcloud_server.first_control_plane (remote-exec): Reading package lists... 16%
hcloud_server.first_control_plane (remote-exec): Reading package lists... Done
hcloud_server.first_control_plane (remote-exec): Building dependency tree... 0%
hcloud_server.first_control_plane (remote-exec): Building dependency tree... 0%
hcloud_server.first_control_plane (remote-exec): Building dependency tree... 50%
hcloud_server.first_control_plane (remote-exec): Building dependency tree... 50%
hcloud_server.first_control_plane (remote-exec): Building dependency tree... Done
hcloud_server.first_control_plane (remote-exec): Reading state information... 0%
hcloud_server.first_control_plane (remote-exec): Reading state information... 0%
hcloud_server.first_control_plane (remote-exec): Reading state information... Done
hcloud_server.first_control_plane (remote-exec): The following additional packages will be installed:
hcloud_server.first_control_plane (remote-exec):   libaria2-0 libc-ares2
hcloud_server.first_control_plane (remote-exec): The following NEW packages will be installed:
hcloud_server.first_control_plane (remote-exec):   aria2 libaria2-0 libc-ares2
hcloud_server.first_control_plane (remote-exec): 0 upgraded, 3 newly installed, 0 to remove and 0 not upgraded.
hcloud_server.first_control_plane (remote-exec): Need to get 1,571 kB of archives.
hcloud_server.first_control_plane (remote-exec): After this operation, 6,225 kB of additional disk space will be used.
hcloud_server.first_control_plane (remote-exec): 0% [Working]
hcloud_server.first_control_plane (remote-exec): Get:1 http://mirror.hetzner.com/debian/packages bullseye/main amd64 libc-ares2 amd64 1.17.1-1+deb11u1 [102 kB]
hcloud_server.first_control_plane (remote-exec): 1% [1 libc-ares2 14.2 kB/102 kB 14%]
hcloud_server.first_control_plane (remote-exec): 12% [Working]
hcloud_server.first_control_plane (remote-exec): Get:2 http://mirror.hetzner.com/debian/packages bullseye/main amd64 libaria2-0 amd64 1.35.0-3 [1,107 kB]
hcloud_server.first_control_plane (remote-exec): 13% [2 libaria2-0 28.6 kB/1,107 kB 3%]
hcloud_server.first_control_plane (remote-exec): 75% [Waiting for headers]
hcloud_server.first_control_plane (remote-exec): Get:3 http://mirror.hetzner.com/debian/packages bullseye/main amd64 aria2 amd64 1.35.0-3 [362 kB]
hcloud_server.first_control_plane (remote-exec): 77% [3 aria2 35.8 kB/362 kB 10%]
hcloud_server.first_control_plane (remote-exec): 100% [Working]
hcloud_server.first_control_plane (remote-exec): Fetched 1,571 kB in 0s (4,481 kB/s)
hcloud_server.first_control_plane (remote-exec): Selecting previously unselected package libc-ares2:amd64.
hcloud_server.first_control_plane (remote-exec): (Reading database ...
hcloud_server.first_control_plane (remote-exec): (Reading database ... 5%
hcloud_server.first_control_plane (remote-exec): (Reading database ... 10%
hcloud_server.first_control_plane (remote-exec): (Reading database ... 15%
hcloud_server.first_control_plane (remote-exec): (Reading database ... 20%
hcloud_server.first_control_plane (remote-exec): (Reading database ... 25%
hcloud_server.first_control_plane (remote-exec): (Reading database ... 30%
hcloud_server.first_control_plane (remote-exec): (Reading database ... 35%
hcloud_server.first_control_plane (remote-exec): (Reading database ... 40%
hcloud_server.first_control_plane (remote-exec): (Reading database ... 45%
hcloud_server.first_control_plane (remote-exec): (Reading database ... 50%
hcloud_server.first_control_plane (remote-exec): (Reading database ... 55%
hcloud_server.first_control_plane (remote-exec): (Reading database ... 60%
hcloud_server.first_control_plane (remote-exec): (Reading database ... 65%
hcloud_server.first_control_plane (remote-exec): (Reading database ... 70%
hcloud_server.first_control_plane (remote-exec): (Reading database ... 75%
hcloud_server.first_control_plane (remote-exec): (Reading database ... 80%
hcloud_server.first_control_plane (remote-exec): (Reading database ... 85%
hcloud_server.first_control_plane (remote-exec): (Reading database ... 90%
hcloud_server.first_control_plane (remote-exec): (Reading database ... 95%
hcloud_server.first_control_plane (remote-exec): (Reading database ... 100%
hcloud_server.first_control_plane (remote-exec): (Reading database ... 62163 files and directories currently installed.)
hcloud_server.first_control_plane (remote-exec): Preparing to unpack .../libc-ares2_1.17.1-1+deb11u1_amd64.deb ...
hcloud_server.first_control_plane (remote-exec): Unpacking libc-ares2:amd64 (1.17.1-1+deb11u1) ...
hcloud_server.first_control_plane (remote-exec): Selecting previously unselected package libaria2-0:amd64.
hcloud_server.first_control_plane (remote-exec): Preparing to unpack .../libaria2-0_1.35.0-3_amd64.deb ...
hcloud_server.first_control_plane (remote-exec): Unpacking libaria2-0:amd64 (1.35.0-3) ...
hcloud_server.first_control_plane (remote-exec): Selecting previously unselected package aria2.
hcloud_server.first_control_plane (remote-exec): Preparing to unpack .../aria2_1.35.0-3_amd64.deb ...
hcloud_server.first_control_plane (remote-exec): Unpacking aria2 (1.35.0-3) ...
hcloud_server.first_control_plane (remote-exec): Setting up libc-ares2:amd64 (1.17.1-1+deb11u1) ...
hcloud_server.first_control_plane (remote-exec): Setting up libaria2-0:amd64 (1.35.0-3) ...
hcloud_server.first_control_plane (remote-exec): Setting up aria2 (1.35.0-3) ...
hcloud_server.first_control_plane (remote-exec): Processing triggers for man-db (2.9.4-2) ...
hcloud_server.first_control_plane (remote-exec): Processing triggers for libc-bin (2.31-13+deb11u2) ...
hcloud_server.first_control_plane (remote-exec): + aria2c --follow-metalink=mem https://download.opensuse.org/tumbleweed/appliances/openSUSE-MicroOS.x86_64-k3s-kvm-and-xen.qcow2.meta4

hcloud_server.first_control_plane (remote-exec): 02/13 07:16:42 [NOTICE] Downloading 1 item(s)
hcloud_server.first_control_plane (remote-exec): [#19122d 0B/0B CN:1 DL:0B]
hcloud_server.first_control_plane (remote-exec): 02/13 07:16:43 [NOTICE] Download complete: [MEMORY]openSUSE-MicroOS.x86_64-16.0.0-k3s-kvm-and-xen-Snapshot20220210.qcow2.meta4
hcloud_server.first_control_plane (remote-exec): [#3aa26e 15MiB/601MiB(2%) CN:5 DL:20MiB
hcloud_server.first_control_plane (remote-exec): [#3aa26e 407MiB/601MiB(67%) CN:5 DL:232
hcloud_server.first_control_plane (remote-exec): [#3aa26e 549MiB/601MiB(91%) CN:2 DL:200
hcloud_server.first_control_plane (remote-exec): [#3aa26e 563MiB/601MiB(93%) CN:2 DL:150
hcloud_server.first_control_plane: Still creating... [50s elapsed]
hcloud_server.first_control_plane (remote-exec): [#3aa26e 573MiB/601MiB(95%) CN:1 DL:121
hcloud_server.first_control_plane (remote-exec): [#3aa26e 576MiB/601MiB(95%) CN:1 DL:100
hcloud_server.first_control_plane (remote-exec): [#3aa26e 578MiB/601MiB(96%) CN:1 DL:85M
hcloud_server.first_control_plane (remote-exec): [#3aa26e 580MiB/601MiB(96%) CN:1 DL:74M
hcloud_server.first_control_plane (remote-exec): [#3aa26e 583MiB/601MiB(97%) CN:1 DL:66M
hcloud_server.first_control_plane (remote-exec): [#3aa26e 586MiB/601MiB(97%) CN:1 DL:60M
hcloud_server.first_control_plane (remote-exec): [#3aa26e 590MiB/601MiB(98%) CN:1 DL:54M

hcloud_server.first_control_plane (remote-exec): 02/13 07:16:54 [NOTICE] Download complete: /root/openSUSE-MicroOS.x86_64-16.0.0-k3s-kvm-and-xen-Snapshot20220210.qcow2

hcloud_server.first_control_plane (remote-exec): Download Results:
hcloud_server.first_control_plane (remote-exec): gid   |stat|avg speed  |path/URI
hcloud_server.first_control_plane (remote-exec): ======+====+===========+=======================================================
hcloud_server.first_control_plane (remote-exec): 19122d|OK  |   141KiB/s|[MEMORY]openSUSE-MicroOS.x86_64-16.0.0-k3s-kvm-and-xen-Snapshot20220210.qcow2.meta4
hcloud_server.first_control_plane (remote-exec): 3aa26e|OK  |    53MiB/s|/root/openSUSE-MicroOS.x86_64-16.0.0-k3s-kvm-and-xen-Snapshot20220210.qcow2

hcloud_server.first_control_plane (remote-exec): Status Legend:
hcloud_server.first_control_plane (remote-exec): (OK):download completed.
hcloud_server.first_control_plane (remote-exec): + + grep -ie ^opensuse.*microos.*k3s.*qcow2$
hcloud_server.first_control_plane (remote-exec): ls -a
hcloud_server.first_control_plane (remote-exec): + qemu-img convert -p -f qcow2 -O host_device openSUSE-MicroOS.x86_64-16.0.0-k3s-kvm-and-xen-Snapshot20220210.qcow2 /dev/sda
hcloud_server.first_control_plane (remote-exec):     (0.00/100%)
hcloud_server.first_control_plane (remote-exec):     (1.00/100%)
hcloud_server.first_control_plane (remote-exec):     (2.01/100%)
hcloud_server.first_control_plane (remote-exec):     (3.01/100%)
hcloud_server.first_control_plane (remote-exec):     (4.01/100%)
hcloud_server.first_control_plane (remote-exec):     (5.02/100%)
hcloud_server.first_control_plane (remote-exec):     (6.02/100%)
hcloud_server.first_control_plane (remote-exec):     (7.05/100%)
hcloud_server.first_control_plane (remote-exec):     (8.05/100%)
hcloud_server.first_control_plane (remote-exec):     (9.06/100%)
hcloud_server.first_control_plane (remote-exec):     (10.06/100%)
hcloud_server.first_control_plane (remote-exec):     (11.07/100%)
hcloud_server.first_control_plane (remote-exec):     (12.08/100%)
hcloud_server.first_control_plane (remote-exec):     (13.08/100%)
hcloud_server.first_control_plane: Still creating... [1m0s elapsed]
hcloud_server.first_control_plane (remote-exec):     (14.08/100%)
hcloud_server.first_control_plane (remote-exec):     (15.09/100%)
hcloud_server.first_control_plane (remote-exec):     (16.09/100%)
hcloud_server.first_control_plane (remote-exec):     (17.10/100%)
hcloud_server.first_control_plane (remote-exec):     (18.10/100%)
hcloud_server.first_control_plane (remote-exec):     (19.10/100%)
hcloud_server.first_control_plane (remote-exec):     (20.11/100%)
hcloud_server.first_control_plane (remote-exec):     (21.11/100%)
hcloud_server.first_control_plane (remote-exec):     (22.11/100%)
hcloud_server.first_control_plane (remote-exec):     (23.12/100%)
hcloud_server.first_control_plane (remote-exec):     (24.12/100%)
hcloud_server.first_control_plane (remote-exec):     (25.12/100%)
hcloud_server.first_control_plane (remote-exec):     (26.13/100%)
hcloud_server.first_control_plane (remote-exec):     (27.13/100%)
hcloud_server.first_control_plane (remote-exec):     (28.14/100%)
hcloud_server.first_control_plane (remote-exec):     (29.14/100%)
hcloud_server.first_control_plane (remote-exec):     (30.14/100%)
hcloud_server.first_control_plane (remote-exec):     (31.15/100%)
hcloud_server.first_control_plane (remote-exec):     (32.15/100%)
hcloud_server.first_control_plane (remote-exec):     (33.15/100%)
hcloud_server.first_control_plane (remote-exec):     (34.16/100%)
hcloud_server.first_control_plane (remote-exec):     (35.16/100%)
hcloud_server.first_control_plane (remote-exec):     (36.16/100%)
hcloud_server.first_control_plane (remote-exec):     (37.17/100%)
hcloud_server.first_control_plane (remote-exec):     (38.17/100%)
hcloud_server.first_control_plane (remote-exec):     (39.18/100%)
hcloud_server.first_control_plane (remote-exec):     (40.18/100%)
hcloud_server.first_control_plane (remote-exec):     (41.18/100%)
hcloud_server.first_control_plane (remote-exec):     (42.19/100%)
hcloud_server.first_control_plane (remote-exec):     (43.19/100%)
hcloud_server.first_control_plane (remote-exec):     (44.19/100%)
hcloud_server.first_control_plane (remote-exec):     (45.20/100%)
hcloud_server.first_control_plane (remote-exec):     (46.20/100%)
hcloud_server.first_control_plane (remote-exec):     (47.20/100%)
hcloud_server.first_control_plane (remote-exec):     (48.21/100%)
hcloud_server.first_control_plane (remote-exec):     (49.21/100%)
hcloud_server.first_control_plane (remote-exec):     (50.22/100%)
hcloud_server.first_control_plane (remote-exec):     (51.22/100%)
hcloud_server.first_control_plane (remote-exec):     (52.22/100%)
hcloud_server.first_control_plane (remote-exec):     (53.23/100%)
hcloud_server.first_control_plane (remote-exec):     (54.23/100%)
hcloud_server.first_control_plane (remote-exec):     (55.23/100%)
hcloud_server.first_control_plane (remote-exec):     (56.24/100%)
hcloud_server.first_control_plane (remote-exec):     (57.24/100%)
hcloud_server.first_control_plane (remote-exec):     (58.24/100%)
hcloud_server.first_control_plane (remote-exec):     (59.25/100%)
hcloud_server.first_control_plane (remote-exec):     (60.25/100%)
hcloud_server.first_control_plane (remote-exec):     (61.28/100%)
hcloud_server.first_control_plane (remote-exec):     (62.28/100%)
hcloud_server.first_control_plane (remote-exec):     (63.29/100%)
hcloud_server.first_control_plane (remote-exec):     (64.29/100%)
hcloud_server.first_control_plane (remote-exec):     (65.29/100%)
hcloud_server.first_control_plane (remote-exec):     (66.30/100%)
hcloud_server.first_control_plane (remote-exec):     (67.30/100%)
hcloud_server.first_control_plane (remote-exec):     (68.30/100%)
hcloud_server.first_control_plane (remote-exec):     (69.31/100%)
hcloud_server.first_control_plane (remote-exec):     (70.31/100%)
hcloud_server.first_control_plane (remote-exec):     (71.32/100%)
hcloud_server.first_control_plane (remote-exec):     (72.32/100%)
hcloud_server.first_control_plane (remote-exec):     (73.32/100%)
hcloud_server.first_control_plane (remote-exec):     (74.33/100%)
hcloud_server.first_control_plane (remote-exec):     (75.33/100%)
hcloud_server.first_control_plane (remote-exec):     (76.33/100%)
hcloud_server.first_control_plane (remote-exec):     (77.34/100%)
hcloud_server.first_control_plane (remote-exec):     (78.34/100%)
hcloud_server.first_control_plane (remote-exec):     (79.34/100%)
hcloud_server.first_control_plane (remote-exec):     (80.35/100%)
hcloud_server.first_control_plane (remote-exec):     (81.35/100%)
hcloud_server.first_control_plane (remote-exec):     (82.36/100%)
hcloud_server.first_control_plane (remote-exec):     (83.36/100%)
hcloud_server.first_control_plane (remote-exec):     (84.36/100%)
hcloud_server.first_control_plane (remote-exec):     (85.37/100%)
hcloud_server.first_control_plane (remote-exec):     (86.38/100%)
hcloud_server.first_control_plane (remote-exec):     (87.38/100%)
hcloud_server.first_control_plane (remote-exec):     (88.38/100%)
hcloud_server.first_control_plane (remote-exec):     (89.39/100%)
hcloud_server.first_control_plane (remote-exec):     (90.39/100%)
hcloud_server.first_control_plane (remote-exec):     (91.39/100%)
hcloud_server.first_control_plane (remote-exec):     (92.40/100%)
hcloud_server.first_control_plane (remote-exec):     (93.40/100%)
hcloud_server.first_control_plane (remote-exec):     (94.40/100%)
hcloud_server.first_control_plane (remote-exec):     (95.41/100%)
hcloud_server.first_control_plane (remote-exec):     (96.41/100%)
hcloud_server.first_control_plane (remote-exec):     (97.42/100%)
hcloud_server.first_control_plane (remote-exec):     (98.42/100%)
hcloud_server.first_control_plane (remote-exec):     (99.42/100%)
hcloud_server.first_control_plane (remote-exec):     (100.00/100%)
hcloud_server.first_control_plane (remote-exec):     (100.00/100%)
hcloud_server.first_control_plane (remote-exec): + sgdisk -e /dev/sda
hcloud_server.first_control_plane (remote-exec): The operation has completed successfully.
hcloud_server.first_control_plane (remote-exec): + parted -s /dev/sda resizepart 4 99%
hcloud_server.first_control_plane (remote-exec): + parted -s /dev/sda mkpart primary ext2 99% 100%
hcloud_server.first_control_plane (remote-exec): + partprobe /dev/sda
hcloud_server.first_control_plane (remote-exec): + udevadm settle
hcloud_server.first_control_plane (remote-exec): + fdisk -l /dev/sda
hcloud_server.first_control_plane (remote-exec): Disk /dev/sda: 38.15 GiB, 40961572864 bytes, 80003072 sectors
hcloud_server.first_control_plane (remote-exec): Disk model: QEMU HARDDISK
hcloud_server.first_control_plane (remote-exec): Units: sectors of 1 * 512 = 512 bytes
hcloud_server.first_control_plane (remote-exec): Sector size (logical/physical): 512 bytes / 512 bytes
hcloud_server.first_control_plane (remote-exec): I/O size (minimum/optimal): 512 bytes / 512 bytes
hcloud_server.first_control_plane (remote-exec): Disklabel type: gpt
hcloud_server.first_control_plane (remote-exec): Disk identifier: EC33AA26-C0DC-4B6C-AF09-4CA8108C7753

hcloud_server.first_control_plane (remote-exec): Device        Start      End  Sectors  Size Type
hcloud_server.first_control_plane (remote-exec): /dev/sda1      2048     6143     4096    2M BIOS
hcloud_server.first_control_plane (remote-exec): /dev/sda2      6144    47103    40960   20M EFI
hcloud_server.first_control_plane (remote-exec): /dev/sda3     47104 31438847 31391744   15G Linu
hcloud_server.first_control_plane (remote-exec): /dev/sda4  31438848 79203041 47764194 22.8G Linu
hcloud_server.first_control_plane (remote-exec): /dev/sda5  79204352 80001023   796672  389M Linu
hcloud_server.first_control_plane (remote-exec): + mount /dev/sda4 /mnt/
hcloud_server.first_control_plane (remote-exec): + btrfs filesystem resize max /mnt
hcloud_server.first_control_plane (remote-exec): Resize '/mnt' of 'max'
hcloud_server.first_control_plane (remote-exec): + umount /mnt
hcloud_server.first_control_plane (remote-exec): + mke2fs -L ignition /dev/sda5
hcloud_server.first_control_plane (remote-exec): mke2fs 1.46.2 (28-Feb-2021)
hcloud_server.first_control_plane (remote-exec): Discarding device blocks: done
hcloud_server.first_control_plane (remote-exec): Creating filesystem with 398336 1k blocks and 99960 inodes
hcloud_server.first_control_plane (remote-exec): Filesystem UUID: 8a3cd038-472e-4812-abe5-ad2f7a5980ef
hcloud_server.first_control_plane (remote-exec): Superblock backups stored on blocks:
hcloud_server.first_control_plane (remote-exec): 	8193, 24577, 40961, 57345, 73729, 204801, 221185

hcloud_server.first_control_plane (remote-exec): Allocating group tables: done
hcloud_server.first_control_plane (remote-exec): Writing inode tables: done
hcloud_server.first_control_plane (remote-exec): Writing superblocks and filesystem accounting information: done

hcloud_server.first_control_plane (remote-exec): + mount /dev/sda5 /mnt
hcloud_server.first_control_plane (remote-exec): + mkdir /mnt/ignition
hcloud_server.first_control_plane (remote-exec): + cp /root/config.ign /mnt/ignition/config.ign
hcloud_server.first_control_plane (remote-exec): + umount /mnt
hcloud_server.first_control_plane: Provisioning with 'local-exec'...
hcloud_server.first_control_plane (local-exec): Executing: ["/bin/sh" "-c" "ssh -o UserKnownHostsFile=/dev/null -o StrictHostKeyChecking=no -i ~/.ssh/id_rsa [email protected] '(sleep 2; reboot)&'; sleep 3"]
hcloud_server.first_control_plane: Still creating... [1m10s elapsed]
hcloud_server.first_control_plane (local-exec): Warning: Permanently added '49.12.221.176' (ECDSA) to the list of known hosts.
hcloud_server.first_control_plane (local-exec): Connection to 49.12.221.176 closed by remote host.
hcloud_server.first_control_plane: Provisioning with 'local-exec'...
hcloud_server.first_control_plane (local-exec): Executing: ["/bin/sh" "-c" "until ssh -o UserKnownHostsFile=/dev/null -o StrictHostKeyChecking=no -i ~/.ssh/id_rsa -o ConnectTimeout=2 [email protected] true 2> /dev/null\ndo\n  echo \"Waiting for MicroOS to reboot and become available...\"\n  sleep 2\ndone\n"]
hcloud_server.first_control_plane (local-exec): Waiting for MicroOS to reboot and become available...
hcloud_server.first_control_plane: Still creating... [1m20s elapsed]
hcloud_server.first_control_plane (local-exec): Waiting for MicroOS to reboot and become available...
hcloud_server.first_control_plane (local-exec): Waiting for MicroOS to reboot and become available...
hcloud_server.first_control_plane (local-exec): Waiting for MicroOS to reboot and become available...
hcloud_server.first_control_plane: Still creating... [1m30s elapsed]
hcloud_server.first_control_plane (local-exec): Waiting for MicroOS to reboot and become available...
hcloud_server.first_control_plane (local-exec): Waiting for MicroOS to reboot and become available...
hcloud_server.first_control_plane (local-exec): Waiting for MicroOS to reboot and become available...
hcloud_server.first_control_plane: Still creating... [1m40s elapsed]
hcloud_server.first_control_plane (local-exec): Waiting for MicroOS to reboot and become available...
hcloud_server.first_control_plane (local-exec): Waiting for MicroOS to reboot and become available...
hcloud_server.first_control_plane (local-exec): Waiting for MicroOS to reboot and become available...
hcloud_server.first_control_plane (local-exec): Waiting for MicroOS to reboot and become available...
hcloud_server.first_control_plane (local-exec): Waiting for MicroOS to reboot and become available...
hcloud_server.first_control_plane: Still creating... [1m50s elapsed]
hcloud_server.first_control_plane (local-exec): Waiting for MicroOS to reboot and become available...
hcloud_server.first_control_plane: Provisioning with 'file'...
hcloud_server.first_control_plane: Still creating... [2m0s elapsed]
hcloud_server.first_control_plane: Still creating... [2m10s elapsed]
hcloud_server.first_control_plane: Still creating... [2m20s elapsed]
hcloud_server.first_control_plane: Still creating... [2m30s elapsed]
hcloud_server.first_control_plane: Still creating... [2m40s elapsed]
hcloud_server.first_control_plane: Still creating... [2m50s elapsed]
hcloud_server.first_control_plane: Still creating... [3m0s elapsed]
hcloud_server.first_control_plane: Still creating... [3m10s elapsed]
hcloud_server.first_control_plane: Still creating... [3m20s elapsed]
hcloud_server.first_control_plane: Still creating... [3m30s elapsed]
hcloud_server.first_control_plane: Still creating... [3m40s elapsed]
hcloud_server.first_control_plane: Still creating... [3m50s elapsed]
hcloud_server.first_control_plane: Still creating... [4m0s elapsed]
hcloud_server.first_control_plane: Still creating... [4m10s elapsed]
hcloud_server.first_control_plane: Still creating... [4m20s elapsed]
hcloud_server.first_control_plane: Still creating... [4m30s elapsed]
hcloud_server.first_control_plane: Still creating... [4m40s elapsed]
hcloud_server.first_control_plane: Still creating... [4m50s elapsed]
hcloud_server.first_control_plane: Still creating... [5m0s elapsed]
hcloud_server.first_control_plane: Still creating... [5m10s elapsed]
hcloud_server.first_control_plane: Still creating... [5m20s elapsed]
hcloud_server.first_control_plane: Still creating... [5m30s elapsed]
hcloud_server.first_control_plane: Still creating... [5m40s elapsed]
hcloud_server.first_control_plane: Still creating... [5m50s elapsed]
hcloud_server.first_control_plane: Still creating... [6m0s elapsed]
hcloud_server.first_control_plane: Still creating... [6m10s elapsed]
hcloud_server.first_control_plane: Still creating... [6m20s elapsed]
hcloud_server.first_control_plane: Still creating... [6m30s elapsed]
hcloud_server.first_control_plane: Still creating... [6m40s elapsed]
hcloud_server.first_control_plane: Still creating... [6m50s elapsed]
โ•ท
โ”‚ Error: file provisioner error
โ”‚
โ”‚   with hcloud_server.first_control_plane,
โ”‚   on master.tf line 54, in resource "hcloud_server" "first_control_plane":
โ”‚   54:   provisioner "file" {
โ”‚
โ”‚ timeout - last error: dial tcp 49.12.221.176:22: connect: operation timed out
โ•ต

traefik ingress: Skipping service: no endpoints found

Hi again :) So...I thought I understood, but I continue to have issues setting up simple ingress routes. I understand if this is out of scope for the kube-hetzner project, as it may likely just be a traefik configuration that I don't understand. In this scenario, I have installed the whoami helm chart:

helm repo add cowboysysop https://cowboysysop.github.io/charts/
helm install my-release cowboysysop/whoami

And I can see the ingress, service, and pod deployed all correctly into the default namespace, and port forwarding on the service or pod displays the whoami information, but the external route (https://whoami.site.com) returns an empty reply, like it was never 'caught' via the ingress, I keep getting these messages in the traefik pod's log:

level=error msg="Skipping service: no endpoints found" ingress=whoami-1645104801 namespace=default serviceName=whoami-1645104801 servicePort="&ServiceBackendPort{Name:,Number:80,}" providerName=kubernetes

As always, any help is much appreciated.

Server IPs blacklisted by opensuse.org

@mnencia @phaer At some point I was getting these errors:

ksnip_20220210-030402

It wouldn't even let me curl to the links, nslookup would give the IP, and that works on my personal machine, it responds to HTTPs, but on the node, total silence. Meaning, the IP had downloaded so much, that it was blacklisted.

Had to temporarily "host" the meta4 file, over at https://raw.githubusercontent.com/kube-hetzner/kube-hetzner/staging/.files/openSUSE-MicroOS.x86_64-k3s-kvm-and-xen.qcow2.meta4

It works like a charm and is 10x faster to download, have no idea why ๐Ÿคฏ, the only thing is that it would be hard to maintain up-to-date images that way.

feature request: Nodepools

Feature Nodepools

In the current form, it's only possible to create a cluster with equally sized nodes.
It would be great to have be able to have different sized nodes like this:

  pools:
    - id: "memory-pool"
      count: 3
      size: CX51
    - id: "worker-pool"
      count: 8
      size: CX11

If we like to go further, it's maybe possible to spead the cluster into different physical locations too. But I think that would be a really hard nut, because of private networks asf.

  pools:
    - id: "memory-pool"
      location: "fsn1"
      count: 3
      size: CX51
    - id: "worker-pool"
      location: "hel1"
      count: 8
      size: CX11

Validate node reboots healthy

@mnencia Have created like you a cluster of 3 controls and 2 agents. And to accelerated the process, have issued touch /var/run/reboot-required on all five of them, to simulate a post-update scenario.

I will report back on what happens after that. Please do not hesitate to share your findings here too.

The connection to the server x.x.x.x:6443 was refused - did you specify the right host or port?

Hi all - interestingly, I've left the k3 cluster running for some time, and twice now the cluster has become completely unreachable. This happens after a couple hours, but I am not sure how many. I think it's somehow tied to the auto-rebooting nature of kured but that's a guess. If I restart the servers via the hetzner UI one by one, the cluster comes back online.

This is the error I'm getting:

The connection to the server 5.161.69.37:6443 was refused - did you specify the right host or port?

And when I ssh into that box, this is the status I see for the k3 service:

static:~ # systemctl status k3s-server.service
ร— k3s-server.service - Lightweight Kubernetes
     Loaded: loaded (/usr/lib/systemd/system/k3s-server.service; enabled; vendor preset: disabled)
     Active: failed (Result: exit-code) since Fri 2022-02-18 00:27:29 UTC; 2h 16min ago
       Docs: https://k3s.io
    Process: 1478 ExecStartPre=/sbin/modprobe br_netfilter (code=exited, status=0/SUCCESS)
    Process: 1484 ExecStartPre=/sbin/modprobe overlay (code=exited, status=0/SUCCESS)
    Process: 1485 ExecStart=/usr/bin/k3s server ${SERVER_OPTS} (code=exited, status=2)
   Main PID: 1485 (code=exited, status=2)
      Tasks: 82
        CPU: 1h 2min 31.811s
     CGroup: /system.slice/k3s-server.service
             โ”œโ”€2215 /usr/sbin/containerd-shim-runc-v2 -namespace k8s.io -id e579d5aad3973a0ce14cbb971f1415469a925718cc2ed07f3556c70688f631f9 -address /run/k3s/containerd/containerd.sock
             โ”œโ”€2218 /usr/sbin/containerd-shim-runc-v2 -namespace k8s.io -id a8b619e61d6205f9eb9ae4b7714dc9e5ed68042dcdac8230072682e2f984e972 -address /run/k3s/containerd/containerd.sock
             โ”œโ”€2432 /usr/sbin/containerd-shim-runc-v2 -namespace k8s.io -id a31940dafe0a541b55356cdaf11845155b4261ea756ecc253eb4f3d17bacb571 -address /run/k3s/containerd/containerd.sock
             โ”œโ”€2513 /usr/sbin/containerd-shim-runc-v2 -namespace k8s.io -id 6b6eb08633e07d89b0485d57bff93bdfd4f768e8da5d8e87edb8bcb58d7c7086 -address /run/k3s/containerd/containerd.sock
             โ”œโ”€2656 /usr/sbin/containerd-shim-runc-v2 -namespace k8s.io -id 49b259381dae18c8113ff298eeab50130d6c6836aca4c2a42d57bb577bd0687d -address /run/k3s/containerd/containerd.sock
             โ””โ”€2856 /usr/sbin/containerd-shim-runc-v2 -namespace k8s.io -id a2b2f399521c8270574ab6f5b505806a4775ff4a76a44af53d87b534503a2088 -address /run/k3s/containerd/containerd.sock

Feb 18 00:27:29 static k3s[1485]:         /home/abuild/rpmbuild/BUILD/k3s-1.22.3-k3s1/vendor/k8s.io/kubernetes/cmd/kube-controller-manager/app/controllermanager.go:272 +0x745
Feb 18 00:27:29 static systemd[1]: k3s-server.service: Main process exited, code=exited, status=2/INVALIDARGUMENT
Feb 18 00:27:29 static systemd[1]: k3s-server.service: Failed with result 'exit-code'.
Feb 18 00:27:29 static systemd[1]: k3s-server.service: Unit process 2215 (containerd-shim) remains running after unit stopped.
Feb 18 00:27:29 static systemd[1]: k3s-server.service: Unit process 2218 (containerd-shim) remains running after unit stopped.
Feb 18 00:27:29 static systemd[1]: k3s-server.service: Unit process 2432 (containerd-shim) remains running after unit stopped.
Feb 18 00:27:29 static systemd[1]: k3s-server.service: Unit process 2513 (containerd-shim) remains running after unit stopped.
Feb 18 00:27:29 static systemd[1]: k3s-server.service: Unit process 2656 (containerd-shim) remains running after unit stopped.
Feb 18 00:27:29 static systemd[1]: k3s-server.service: Unit process 2856 (containerd-shim) remains running after unit stopped.
Feb 18 00:27:29 static systemd[1]: k3s-server.service: Consumed 1h 2min 29.664s CPU time.

Happy to dig in to find more detail, but I am pretty new to k3 (and k8). I would suggest leaving a cluster running for a bit and seeing if this is also happening to you? My terraforms.tfvars is pretty simple:

# You need to replace these
hcloud_token = "xxxx"
public_key   = "./xxx-kube.pub"
# Must be "private_key = null" when you want to use ssh-agent, for a Yubikey like device auth or an SSH key-pair with passphrase
private_key  = "./xxx-kube"


# These can be customized, or left with the default values
# For Hetzner locations see https://docs.hetzner.com/general/others/data-centers-and-connection/
# For Hetzner server types see https://www.hetzner.com/cloud
location                  = "ash" # change to `ash` for us-east Ashburn, Virginia location
network_region            = "us-east" # change to `us-east` if location is ash
agent_server_type         = "cpx31"
control_plane_server_type = "cpx11"
lb_server_type            = "lb11"

# At least 3 server nodes is recommended for HA, otherwise you need to turn off automatic upgrade (see ReadMe).
servers_num               = 3

# For agent nodes, at least 2 is recommended for HA, but you can keep automatic upgrades.
agents_num                = 3

# If you want to use a specific Hetzner CCM and CSI version, set them below, otherwise leave as is for the latest versions
# hetzner_ccm_version = ""
# hetzner_csi_version = ""

# If you want to allow non-control-plane workloads to run on the control-plane nodes set "true" below. The default is "false".
# allow_scheduling_on_control_plane = true

As always, thanks for the help! I hope this is just something with my setup, and not universal, but I thought I should report it now that I've seen it happen twice.

k3s failed to start, see journalctl -u k3s

[Fixed] k3s failed to start, see journalctl -u k3s, that error happens sometimes on first_control_plane when the eth1 network interface is not present. This bug is rare enough, and we believe it comes from Hetzner, randomly.

If it happens, destroy, and re-apply terraform.

Remove SSH password auth

We need to remove SSH password auth ideally through ignition, but if not possible, through combustion. And also do some basic hardening of that service if needed.

First and foremost, we need to find the location of the SSH config file.

Error: hcloud/setRescue

I recently contacted Hetzner to increase my limits to deploy 3 master nodes and 3 worker nodes and after the limit increase I executed terraform but the script exited with an unknown error

โ•ท
โ”‚ Error: hcloud/setRescue: hcclient/WaitForActions: action 382332309 failed: Unknown Error (unknown_error)
โ”‚ 
โ”‚   with hcloud_server.control_planes[0],
โ”‚   on servers.tf line 1, in resource "hcloud_server" "control_planes":
โ”‚    1: resource "hcloud_server" "control_planes" {
โ”‚ 

here is my terraform.tfvars

# You need to replace these
hcloud_token = "my-token"
public_key   = "/home/user/.ssh/id_ed25519.pub"
# Must be "private_key = null" when you want to use ssh-agent, for a Yubikey like device auth or an SSH key-pair with passphrase
private_key  = "/home/user/.ssh/id_ed25519"

# These can be customized, or left with the default values
# For Hetzner locations see https://docs.hetzner.com/general/others/data-centers-and-connection/
# For Hetzner server types see https://www.hetzner.com/cloud
location                  = "fsn1" # change to `ash` for us-east Ashburn, Virginia location
network_region            = "eu-central" # change to `us-east` if location is ash
agent_server_type         = "cx41"
control_plane_server_type = "cx21"
lb_server_type            = "lb21"

# At least 3 server nodes is recommended for HA, otherwise you need to turn off automatic upgrade (see ReadMe).
servers_num               = 3

# For agent nodes, at least 2 is recommended for HA, but you can keep automatic upgrades.
agents_num                = 3

# If you want to use a specific Hetzner CCM and CSI version, set them below, otherwise leave as is for the latest versions
# hetzner_ccm_version = ""
# hetzner_csi_version = ""

# If you want to kustomize the Hetzner CCM and CSI containers with the "latest" tags and imagePullPolicy Always, 
# to have them automatically update when the node themselve get updated via the rancher system upgrade controller, the default is "false".
# If you choose to keep the default of "false", you can always use ArgoCD to monitor the CSI and CCM manifest for new releases,
# that is probably the more "vanilla" option to keep these components always updated. 
# hetzner_ccm_containers_latest = true
# hetzner_csi_containers_latest = true

# If you want to use letsencrypt with tls Challenge, the email address is used to send you certificates expiration notices
traefik_acme_tls = true
traefik_acme_email = "my-email"

# If you want to allow non-control-plane workloads to run on the control-plane nodes set "true" below. The default is "false".
# allow_scheduling_on_control_plane = true

I have not edited any other file.

Latest changes breaks update

Running terraform apply on an already existing cluster fails because IP changes:
command:
terraform apply -var-file=prod.tfvars -var hcloud_token="REDACTED"

error:

module.kubernetes.hcloud_server.first_control_plane: Modifying... [id=18032205]
โ•ท
โ”‚ Error: hcloud/updateServerInlineNetworkAttachments: hcloud/inlineAttachServerToNetwork: attach server to network: provided IP is not available (ip_not_available)
โ”‚ 
โ”‚   with module.kubernetes.hcloud_server.first_control_plane,
โ”‚   on .terraform/modules/kubernetes/master.tf line 1, in resource "hcloud_server" "first_control_plane":
โ”‚    1: resource "hcloud_server" "first_control_plane" {
โ”‚ 

plan output:

$ terraform plan -var-file=prod.tfvars -var hcloud_token="REDACTED"
module.kubernetes.hcloud_ssh_key.k3s: Refreshing state... [id=5585976]
module.kubernetes.random_password.k3s_token: Refreshing state... [id=none]
module.kubernetes.hcloud_network.k3s: Refreshing state... [id=REDACTED]
module.kubernetes.hcloud_placement_group.k3s: Refreshing state... [id=22200]
module.kubernetes.hcloud_firewall.k3s: Refreshing state... [id=303856]
module.kubernetes.hcloud_network_subnet.k3s: Refreshing state... [id=REDACTED-10.0.0.0/16]
module.kubernetes.hcloud_server.first_control_plane: Refreshing state... [id=18032205]
module.kubernetes.hcloud_server.control_planes[0]: Refreshing state... [id=18036735]
module.kubernetes.hcloud_server.agents[2]: Refreshing state... [id=18076430]
module.kubernetes.hcloud_server.agents[0]: Refreshing state... [id=18032235]
module.kubernetes.hcloud_server.agents[1]: Refreshing state... [id=18036736]
module.kubernetes.hcloud_server.control_planes[1]: Refreshing state... [id=18032231]

Terraform used the selected providers to generate the following execution plan. Resource actions are indicated with the following symbols:
  + create
  ~ update in-place
 <= read (data resources)

Terraform will perform the following actions:

  # module.kubernetes.data.remote_file.kubeconfig will be read during apply
  # (config refers to values not yet known)
 <= data "remote_file" "kubeconfig"  {
      + content = (known after apply)
      + id      = (known after apply)
      + path    = "/etc/rancher/k3s/k3s.yaml"

      + conn {
          + agent = (sensitive)
          + host  = "REDACTED"
          + port  = 22
          + user  = "root"
        }
    }

  # module.kubernetes.hcloud_server.agents[0] will be updated in-place
  ~ resource "hcloud_server" "agents" {
        id                 = "18032235"
        name               = "k3s-agent-0"
        # (17 unchanged attributes hidden)

      - network {
          - alias_ips   = [] -> null
          - ip          = "10.0.1.1" -> null
          - mac_address = "86:00:00:04:6a:7d" -> null
          - network_id  = REDACTED -> null
        }
      + network {
          + alias_ips   = []
          + ip          = "10.0.2.1"
          + mac_address = (known after apply)
          + network_id  = REDACTED
        }
    }

  # module.kubernetes.hcloud_server.agents[1] will be updated in-place
  ~ resource "hcloud_server" "agents" {
        id                 = "18036736"
        name               = "k3s-agent-1"
        # (17 unchanged attributes hidden)

      - network {
          - alias_ips   = [] -> null
          - ip          = "10.0.1.2" -> null
          - mac_address = "86:00:00:04:70:79" -> null
          - network_id  = REDACTED -> null
        }
      + network {
          + alias_ips   = []
          + ip          = "10.0.2.2"
          + mac_address = (known after apply)
          + network_id  = REDACTED
        }
    }

  # module.kubernetes.hcloud_server.agents[2] will be updated in-place
  ~ resource "hcloud_server" "agents" {
        id                 = "18076430"
        name               = "k3s-agent-2"
        # (17 unchanged attributes hidden)

      - network {
          - alias_ips   = [] -> null
          - ip          = "10.0.1.3" -> null
          - mac_address = "86:00:00:04:9c:3f" -> null
          - network_id  = REDACTED -> null
        }
      + network {
          + alias_ips   = []
          + ip          = "10.0.2.3"
          + mac_address = (known after apply)
          + network_id  = REDACTED
        }
    }

  # module.kubernetes.hcloud_server.control_planes[0] will be updated in-place
  ~ resource "hcloud_server" "control_planes" {
        id                 = "18036735"
        name               = "k3s-control-plane-1"
        # (17 unchanged attributes hidden)

      - network {
          - alias_ips   = [] -> null
          - ip          = "10.0.0.3" -> null
          - mac_address = "86:00:00:04:70:78" -> null
          - network_id  = REDACTED -> null
        }
      + network {
          + alias_ips   = []
          + ip          = "10.0.1.2"
          + mac_address = (known after apply)
          + network_id  = REDACTED
        }
    }

  # module.kubernetes.hcloud_server.control_planes[1] will be updated in-place
  ~ resource "hcloud_server" "control_planes" {
        id                 = "18032231"
        name               = "k3s-control-plane-2"
        # (17 unchanged attributes hidden)

      - network {
          - alias_ips   = [] -> null
          - ip          = "10.0.0.4" -> null
          - mac_address = "86:00:00:04:6a:7a" -> null
          - network_id  = REDACTED -> null
        }
      + network {
          + alias_ips   = []
          + ip          = "10.0.1.3"
          + mac_address = (known after apply)
          + network_id  = REDACTED
        }
    }

  # module.kubernetes.hcloud_server.first_control_plane will be updated in-place
  ~ resource "hcloud_server" "first_control_plane" {
        id                 = "18032205"
        name               = "k3s-control-plane-0"
        # (17 unchanged attributes hidden)

      + network {
          + alias_ips   = []
          + ip          = "10.0.1.1"
          + mac_address = (known after apply)
          + network_id  = REDACTED
        }
    }

  # module.kubernetes.local_file.kubeconfig will be created
  + resource "local_file" "kubeconfig" {
      + directory_permission = "0777"
      + file_permission      = "600"
      + filename             = "kubeconfig.yaml"
      + id                   = (known after apply)
      + sensitive_content    = (sensitive value)
    }
    

Metrics-Server unable to scrape metrics

Hi,

the metrics-server in my cluster is unable to scrape metrics from nodes:

I0305 08:15:26.352996       1 serving.go:341] Generated self-signed cert (/tmp/apiserver.crt, /tmp/apiserver.key)
I0305 08:15:26.719418       1 requestheader_controller.go:169] Starting RequestHeaderAuthRequestController
I0305 08:15:26.719485       1 shared_informer.go:240] Waiting for caches to sync for RequestHeaderAuthRequestController
I0305 08:15:26.719421       1 configmap_cafile_content.go:202] Starting client-ca::kube-system::extension-apiserver-authentication::client-ca-file
I0305 08:15:26.719520       1 shared_informer.go:240] Waiting for caches to sync for client-ca::kube-system::extension-apiserver-authentication::client-ca-file
I0305 08:15:26.719496       1 configmap_cafile_content.go:202] Starting client-ca::kube-system::extension-apiserver-authentication::requestheader-client-ca-file
I0305 08:15:26.719646       1 shared_informer.go:240] Waiting for caches to sync for client-ca::kube-system::extension-apiserver-authentication::requestheader-client-ca-file
I0305 08:15:26.720159       1 dynamic_serving_content.go:130] Starting serving-cert::/tmp/apiserver.crt::/tmp/apiserver.key
I0305 08:15:26.720267       1 secure_serving.go:202] Serving securely on :4443
I0305 08:15:26.720356       1 tlsconfig.go:240] Starting DynamicServingCertificateController
E0305 08:15:26.723158       1 scraper.go:139] "Failed to scrape node" err="Get \"https://10.2.0.1:10250/stats/summary?only_cpu_and_memory=true\": x509: certificate is valid for 127.0.0.1, 88.198.105.71, not 10.2.0.1" node="agent-big-0"
I0305 08:15:26.820176       1 shared_informer.go:247] Caches are synced for RequestHeaderAuthRequestController 
I0305 08:15:26.820185       1 shared_informer.go:247] Caches are synced for client-ca::kube-system::extension-apiserver-authentication::requestheader-client-ca-file 
I0305 08:15:26.820191       1 shared_informer.go:247] Caches are synced for client-ca::kube-system::extension-apiserver-authentication::client-ca-file 
I0305 08:15:27.267832       1 server.go:188] "Failed probe" probe="metric-storage-ready" err="not metrics to serve"
I0305 08:15:27.636114       1 server.go:188] "Failed probe" probe="metric-storage-ready" err="not metrics to serve"
I0305 08:15:28.269638       1 server.go:188] "Failed probe" probe="metric-storage-ready" err="not metrics to serve"
I0305 08:15:29.635527       1 server.go:188] "Failed probe" probe="metric-storage-ready" err="not metrics to serve"
I0305 08:15:31.635288       1 server.go:188] "Failed probe" probe="metric-storage-ready" err="not metrics to serve"
I0305 08:15:33.636699       1 server.go:188] "Failed probe" probe="metric-storage-ready" err="not metrics to serve"
I0305 08:15:35.635678       1 server.go:188] "Failed probe" probe="metric-storage-ready" err="not metrics to serve"
I0305 08:15:37.636936       1 server.go:188] "Failed probe" probe="metric-storage-ready" err="not metrics to serve"
I0305 08:15:39.635200       1 server.go:188] "Failed probe" probe="metric-storage-ready" err="not metrics to serve"
I0305 08:15:41.635422       1 server.go:188] "Failed probe" probe="metric-storage-ready" err="not metrics to serve"
E0305 08:15:41.711552       1 scraper.go:139] "Failed to scrape node" err="Get \"https://10.2.0.1:10250/stats/summary?only_cpu_and_memory=true\": x509: certificate is valid for 127.0.0.1, 88.198.105.71, not 10.2.0.1" node="agent-big-0"
E0305 08:15:56.722408       1 scraper.go:139] "Failed to scrape node" err="Get \"https://10.2.0.1:10250/stats/summary?only_cpu_and_memory=true\": x509: certificate is valid for 127.0.0.1, 88.198.105.71, not 10.2.0.1" node="agent-big-0"
E0305 08:16:11.698713       1 scraper.go:139] "Failed to scrape node" err="Get \"https://10.2.0.1:10250/stats/summary?only_cpu_and_memory=true\": x509: certificate is valid for 127.0.0.1, 88.198.105.71, not 10.2.0.1" node="agent-big-0"
E0305 08:16:26.707787       1 scraper.go:139] "Failed to scrape node" err="Get \"https://10.2.0.1:10250/stats/summary?only_cpu_and_memory=true\": x509: certificate is valid for 127.0.0.1, 88.198.105.71, not 10.2.0.1" node="agent-big-0"

My cluster only consists of that one big agent and 3 control nodes. Any idea whats happening here?

Move to k3s in binary form

@mnencia @phaer Had a very interesting conversation with Richard Brown, he says no RPM is needed and that the btrfs sub-volumes are writable so we can just swap the binary, and voila!

So we can go back to MicroOS vanilla, and just use the k3s binaries from https://github.com/k3s-io/k3s/releases, as is. Maybe have a timer that checks for a new release, if there is one, touch /var/run/reboot-required, Kured drains the node, and reboots, and on reboot, we have a small script that does the swap :)

Let's see - I will try to give it a shot this weekend, but please do not hesitate if you feel inspired.

Also, welcome to the team if you'll accept, just sent you the invitation :) ๐Ÿพ

Error: hcloud/inlineAttachServerToNetwork: attach server to network: no subnet or IP available (service_error)

Hi - thanks so much for this project. When I attempt to deploy to the ash region, using the instructions I get this after the terraform apply -auto-approve command:

Error: hcloud/inlineAttachServerToNetwork: attach server to network: no subnet or IP available (service_error)
   with hcloud_server.first_control_plane,
   on master.tf line 1, in resource "hcloud_server" "first_control_plane":
    1: resource "hcloud_server" "first_control_plane" {

Stuck on Waiting for load-balancer to get an IP

I am trying to create a small cluster with 1 control plane and 2 agents. I already increased the timeout of the bash script to 500 and still having the issue. I tried to create a new project and generated a new api token and still the same.

Here's the loop output

null_resource.first_control_plane: Still creating... [10m50s elapsed]
null_resource.first_control_plane (remote-exec): Waiting for load-balancer to get an IP...
null_resource.first_control_plane (remote-exec): Waiting for load-balancer to get an IP... 
null_resource.first_control_plane (remote-exec): Waiting for load-balancer to get an IP...

Here is my vars file

location                  = "fsn1"
network_region            = "eu-central" 
agent_server_type         = "cpx21"
control_plane_server_type = "cpx11"
lb_server_type            = "lb11"
servers_num               = 1
agents_num                = 2

'sleep' is not recognized as an internal or external command

After runing the cluster I get this,

โ•ท
โ”‚ Error: local-exec provisioner error
โ”‚
โ”‚ with hcloud_server.first_control_plane,
โ”‚ on master.tf line 44, in resource "hcloud_server" "first_control_plane":
โ”‚ 44: provisioner "local-exec" {
โ”‚
โ”‚ Error running command 'sleep 60 && ping 138.201.89.68 | grep --line-buffered "bytes from" | head -1 && sleep 100 &&
โ”‚ scp -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -i ./keys/id_rsa
โ”‚ [email protected]:/etc/rancher/k3s/k3s.yaml ./kubeconfig.yaml
โ”‚ sed -i -e 's/127.0.0.1/138.201.89.68/g' ./kubeconfig.yaml
โ”‚ ': exit status 1. Output: 'sleep' is not recognized as an internal or external command,
โ”‚ operable program or batch file.

I'm trying to access the server but I have another problem, I have a generated keys using puttygen, I can't connect to the controlplane . does anyine know how to export key from keygen in a proper format? i'm using windows 10

Dependency Dashboard

This issue provides visibility into Renovate updates and their statuses. Learn more

This repository currently has no open or pending branches.


  • Check this box to trigger a request for Renovate to run again on this repository

Fix for hcloud csi crashloopbackoff

Just apply the following command kubectl apply -f https://raw.githubusercontent.com/hetznercloud/csi-driver/v1.6.0/deploy/kubernetes/hcloud-csi.yml

Migration from k3os to openSUSE MicroOS

Recently Rancher, the creators of k3s and k3os has been bought by SUSE. And in doing so, they've dropped official support for k3os (k3s on the other hand is thriving and has been separated from Rancher).

I went on to contact Jacob Blain Christen, the lead maintainer of k3os, and he told me that he'll continue to do releases on the weekends and that the project could live on if the community maintained it.

However, that is not a stable backing for this project, so I made my own research and concluded that OpenSuse MicroOS, has HUGE backing has it piggybacks on Tumbleweed, a major OpenSuse distro, and has stable and automated transactional updates, as such it's now the best OS to replace k3os.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.