mbert / kubeadm2ha Goto Github PK

A set of scripts and documentation for adding redundancy (etcd cluster, multiple masters) to a cluster set up with kubeadm 1.8 and above

License: Apache License 2.0

Shell 54.06% Jinja 45.94%

kubeadm2ha's People

Contributors

Stargazers

Watchers

Forkers

lzbgt lypht rohimma bourneye xadcoh joshuacox opentokix zreigz chewbh researchiteng kubezilla

kubeadm2ha's Issues

20-etcd-service-manager.conf -> missing cgroup-driver

20-etcd-service-manager.conf is missing the "--cgroup-driver=systemd"
BTW, FYI, starting 1.11 this setting is handled automatically by kubeadm, but this step is before calling kubeadm.

"kubeadm now detects the Docker cgroup driver and starts the kubelet with the matching driver. This eliminates a common error experienced by new users in when the Docker cgroup driver is not the same as the one set for the kubelet due to different Linux distributions setting different cgroup drivers for Docker, making it hard to start the kubelet properly. (#64347, @neolit123)"

The prepare-nodes cgroup driver part does not apply to this because:
a) 20-etcd-service-manager.conf overrides the 10-kubeadm.conf
b) the prepare-nodes code won't do anything any longer, as 10-kubeadm.conf no longer holds the "cgroup-driver" string.

bridge-nf-call-iptables - suggestion

"/root/join-worker-node.sh" - FATAL ERROR
(ignorable, as it should be done as a prereq maybe)
[ERROR FileContent--proc-sys-net-bridge-bridge-nf-call-iptables]: /proc/sys/net/bridge/bridge-nf-call-iptables contents are not set to 1
echo "1" > /proc/sys/net/bridge/bridge-nf-call-iptables
but also make it persistent

No package matching 'nginx-1.12.2' found available, installed or updated

I am testing your scripts out and getting the following error:

TASK [nginx : Install nginx via package manager] ***********************************************************************************************************************************************************
fatal: [my-cluster-master-1]: FAILED! => {"changed": false, "failed": true, "msg": "No package matching 'nginx-1.12.2' found available, installed or updated", "rc": 126, "results": ["No package matching 'nginx-1.12.2' found available, installed or updated"]}
fatal: [my-cluster-master-2]: FAILED! => {"changed": false, "failed": true, "msg": "No package matching 'nginx-1.12.2' found available, installed or updated", "rc": 126, "results": ["No package matching 'nginx-1.12.2' found available, installed or updated"]}
fatal: [my-cluster-master-3]: FAILED! => {"changed": false, "failed": true, "msg": "No package matching 'nginx-1.12.2' found available, installed or updated", "rc": 126, "results": ["No package matching 'nginx-1.12.2' found available, installed or updated"]}

Role: ha-settings improvements

Your are doing some steps twice, which is not necessary (seems like a copy paste error for me):

ha-settings/tasks/main.yaml

Get current kube-proxy settings
Edit current kube-proxy settings to use the virtual IP instead of the host IP
Apply edited kube-proxy settings
Force restart of all kube-proxy pods

etcd? yaml to json?

● etcd.service - Etcd Server
   Loaded: loaded (/usr/lib/systemd/system/etcd.service; enabled; vendor preset: disabled)
   Active: failed (Result: start-limit) since Thu 2018-03-01 18:56:36 UTC; 26min ago
  Process: 13849 ExecStart=/bin/bash -c GOMAXPROCS=$(nproc) /usr/bin/etcd --name="${ETCD_NAME}" --data-dir="${ETCD_DATA_DIR}" --listen-client-urls="${ETCD_LISTEN_CLIENT_URLS}" (code=exited, status=1/FAILURE)
 Main PID: 13849 (code=exited, status=1/FAILURE)

Mar 01 18:56:35 localhost.localdomain systemd[1]: etcd.service: main process exited, code=exited, status...LURE
Mar 01 18:56:35 localhost.localdomain systemd[1]: Failed to start Etcd Server.
Mar 01 18:56:35 localhost.localdomain systemd[1]: Unit etcd.service entered failed state.
Mar 01 18:56:35 localhost.localdomain systemd[1]: etcd.service failed.
Mar 01 18:56:36 localhost.localdomain systemd[1]: etcd.service holdoff time over, scheduling restart.
Mar 01 18:56:36 localhost.localdomain systemd[1]: start request repeated too quickly for etcd.service
Mar 01 18:56:36 localhost.localdomain systemd[1]: Failed to start Etcd Server.
Mar 01 18:56:36 localhost.localdomain systemd[1]: Unit etcd.service entered failed state.
Mar 01 18:56:36 localhost.localdomain systemd[1]: etcd.service failed.
Hint: Some lines were ellipsized, use -l to show in full.

but if I log in directly to the host and su to the etcd user:

-bash-4.2$ etcd --config-file=/etc/etcd/etcd.conf
2018-03-01 19:22:21.671961 I | etcdmain: Loading server configuration from "/etc/etcd/etcd.conf"
2018-03-01 19:22:21.672261 E | etcdmain: error verifying flags, error converting YAML to JSON: yaml: line 7: did not find expected <document start>. See 'etcd --help'.

etcd is still 3.2.7

etcd --version
etcd Version: 3.2.7
Git SHA: bb66589
Go Version: go1.8.3
Go OS/Arch: linux/amd64

I see this as an example:
https://github.com/coreos/etcd/blob/master/etcd.conf.yml.sample

am I missing something? Should this be an environment file instead?

kubeadm token create --print-join-command - suggestion

I noticed there is a nice option "--print-join-command" , which provides the
kubeadm token create --print-join-command
I0914 14:11:14.396695 3948 feature_gate.go:230] feature gates: &{map[]}
kubeadm join 10.1.3.2:6443 --token aaaaaa.bbbbbbbbbbbbbbbb --discovery-token-ca-cert-hash sha256:bbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb

You may want to look at it for the join-token role, which is quite complicated now with the openssl option.

SSL for etcd works only if etcd instances run on the master hosts

Since the SSL certificates are copied to the etcd hosts which can, but need not be the same as the master hosts, the client certificates will be unavailable for the K8s cluster if the etcd hosts are separate.

inventory -> missing "," after NGINX

inventory -> there is no var NGINX_TAG, only NGINX_VERSION

etcd role works only from root

If the playbook is run without root, the etcd role fails due to lack of perms. when it tries to copy the certs from localhost to all etcd machines:

kubeadm2ha/ansible/roles/etcd/tasks/main.yaml

Line 64 in a89c719

- name: Copy certs to all etcd nodes

Reason being: at unarchive, all files are owned by root (as they were on the primary-etcd), and now, a non-root user on control machine (localaction) cannot read them.

Ugly work around: just add mode=0755 at the unarchive localaction:

kubeadm2ha/ansible/roles/etcd/tasks/main.yaml

Line 60 in a89c719

- name: Unarchive certificates on localhost...
Ideally, split the certs.tar.gz archive in 2 archives: one for kubeadmcfg.yaml, one certificates.
Once there are 2 archives, there is no need to unarchive on the control machine, but let ansible transfer the archive from local control machine to the destination and unarchive there, with the right perms.

modprobe ip_vs - suggestion

"/root/join-worker-node.sh" - WARNING
(ignorable, as it should be done as a prereq maybe)
RequiredIPVSKernelModulesAvailable]: the IPVS proxier will not be used, because the following required kernel modules are not loaded: [ip_vs_wrr ip_vs_sh ip_vs ip_vs_rr]
add:
modprobe ip_vs_wrr ip_vs_sh ip_vs ip_vs_rr
But also make them persistent.

- name: load ip_vs kernel modules
  modprobe: name={{ item }} state=present
  with_items:
  - ip_vs_wrr
  - ip_vs_rr
  - ip_vs_sh
  - ip_vs

- name: persist ip_vs kernel modules
  copy:
    path: /etc/modules-load.d/ip_vs.conf
      content: |
        ip_vs_wrr
        ip_vs_rr
        ip_vs_sh
        ip_vs

swapoff - suggestion

prepare nodes doesn't swapoff /remove from fstab. (ignorable, as it should be done as a prereq maybe)

kube component can not do `Watch` when apiserver is set to VIP:PORT?

hi (:
the default load balancing strategy of nginx is rr, so when a pod(sth like kube-proxy) do Watch action, it will print a lots of warning log message like

W0301 02:10:52.929987       1 reflector.go:341] k8s.io/kubernetes/pkg/client/informers/informers_generated/internalversion/factory.go:85: watch of *core.Service ended with: very short watch: k8s.io/kubernetes/pkg/client/informers/informers_generated/internalversion/factory.go:85: Unexpected watch close - watch lasted less than a second and no items received

How to deal with this issue? or just ignore it?

Different kubeadm-init Configs

Hi,

why are you using different kubeadm-init.yaml.j2 tempalte files for master / secondary? I just ran your playbook and you should always set "endpoint-reconciler-type" for the apiserver (not only on the secondaries but also on the master):

apiServerExtraArgs:
  {% if KUBERNETES_VERSION | match('^1\.8') %}apiserver-count: "{{ groups['masters'] | length }}"{% else %}endpoint-reconciler-type: "lease"{% endif %}

You should always use the "global" templates/kubeadm-init.yaml.j2 file:
template/kubeadm-init.yaml.j2

Join worker nodes MASTER_VIP

You should join all worker (minions) nodes via the MASTER_VIP and not via the primary master IP:

join-token/templates/join-worker-node.sh.j2

kubeadm join --token {{ TOKEN.stdout }} {{ MASTER_VIP }}:6443 --discovery-token-ca-cert-hash sha256:{{ HASH.stdout }}