banzaicloud / pke Goto Github PK

PKE is an extremely simple CNCF certified Kubernetes installer and distribution, designed to work on any cloud, VM or bare metal.

License: Apache License 2.0

Makefile 1.97% Go 93.93% Shell 4.10%

kubernetes distribution golang k8s

pke's Issues

Cilium fails to start on CentOS 8 Vagrant box

Describe the bug

Cilium components and CoreDNS don't start after spinning up the vagrant box with CentOS8 operating system.

Steps to reproduce the issue:

I reworked a little centos8-single.sh script:

#!/bin/bash -e

# build latest pke tool
#GOOS=linux make pke

KUBERNETES_VERSION="${1:-v1.18.9}"

export VAGRANT_VAGRANTFILE=Vagrantfile-centos8
vagrant up centos1
vagrant ssh centos1 -c "sudo yum install -y ca-certificates"
vagrant ssh centos1 -c "sudo update-ca-trust enable"
vagrant ssh centos1 -c "sudo cp /vagrant/crt/* /etc/pki/ca-trust/source/anchors/"
vagrant ssh centos1 -c "sudo update-ca-trust extract"
vagrant ssh centos1 -c "sudo curl -vL https://banzaicloud.com/downloads/pke/latest -o /usr/local/bin/pke"
vagrant ssh centos1 -c "sudo chmod +x /usr/local/bin/pke"
vagrant ssh centos1 -c "sudo /scripts/pke-single.sh '$KUBERNETES_VERSION' '192.168.64.11:6443' containerd cilium"
vagrant ssh centos1 -c 'sudo cat /etc/kubernetes/admin.conf' > pke-single-config.yaml

export KUBECONFIG=$PWD/pke-single-config.yaml

echo ""
echo "You can access your PKE cluster either:"
echo "- from your host machine accessing the cluster via kubectl. Please run:"
echo "export KUBECONFIG=$PWD/pke-single-config.yaml"
echo ""
echo "- or starting a shell in the virtual machine. Please run:"
echo "vagrant ssh centos1 -c 'sudo -s'"

I commented the build of pke utility and instead - downloaded it directly from latest release.
After that I invoked the script and waited until all components of k8s would be installed.

[root@centos1 ~]# kubectl get all --all-namespaces
NAMESPACE     NAME                                         READY   STATUS              RESTARTS   AGE
kube-system   pod/auto-approver-6cdf7bb44f-p886f           0/1     ContainerCreating   0          4m4s
kube-system   pod/cilium-gt6ck                             0/1     CrashLoopBackOff    4          4m4s
kube-system   pod/cilium-operator-65d6bd4cf9-vrmh4         1/1     Running             0          4m4s
kube-system   pod/coredns-66985b8c8d-prhmz                 0/1     ContainerCreating   0          4m4s
kube-system   pod/coredns-66985b8c8d-wxqnz                 0/1     ContainerCreating   0          4m4s
kube-system   pod/etcd-centos1                             1/1     Running             0          4m19s
kube-system   pod/kube-apiserver-centos1                   1/1     Running             0          4m19s
kube-system   pod/kube-controller-manager-centos1          1/1     Running             0          4m18s
kube-system   pod/kube-proxy-7qjds                         1/1     Running             0          4m4s
kube-system   pod/kube-scheduler-centos1                   1/1     Running             0          4m18s
kube-system   pod/local-path-provisioner-7b69d654d-k8fl4   0/1     ContainerCreating   0          4m4s

NAMESPACE     NAME                 TYPE        CLUSTER-IP   EXTERNAL-IP   PORT(S)                  AGE
default       service/kubernetes   ClusterIP   10.10.0.1    <none>        443/TCP                  4m21s
kube-system   service/kube-dns     ClusterIP   10.10.0.10   <none>        53/UDP,53/TCP,9153/TCP   4m19s

NAMESPACE     NAME                        DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR            AGE
kube-system   daemonset.apps/cilium       1         1         0       1            0           <none>                   4m17s
kube-system   daemonset.apps/kube-proxy   1         1         1       1            1           kubernetes.io/os=linux   4m19s

NAMESPACE     NAME                                     READY   UP-TO-DATE   AVAILABLE   AGE
kube-system   deployment.apps/auto-approver            0/1     1            0           4m18s
kube-system   deployment.apps/cilium-operator          1/1     1            1           4m17s
kube-system   deployment.apps/coredns                  0/2     2            0           4m19s
kube-system   deployment.apps/local-path-provisioner   0/1     1            0           4m18s

NAMESPACE     NAME                                               DESIRED   CURRENT   READY   AGE
kube-system   replicaset.apps/auto-approver-6cdf7bb44f           1         1         0       4m4s
kube-system   replicaset.apps/cilium-operator-65d6bd4cf9         1         1         1       4m4s
kube-system   replicaset.apps/coredns-66985b8c8d                 2         2         0       4m4s
kube-system   replicaset.apps/local-path-provisioner-7b69d654d   1         1         0       4m4s

[root@centos1 ~]# kubectl logs -n kube-system pod/cilium-gt6ck
Error from server: Get https://192.168.64.11:10250/containerLogs/kube-system/cilium-gt6ck/cilium-agent: remote error: tls: internal error

[root@centos1 containers]# cat cilium-gt6ck_kube-system_cilium-agent-514b0e54b301c8db3a37fbd497d2869d3c301c0e003508d4aa781563dbfeea33.log
...65
2020-10-19T17:00:32.431721961Z stderr F level=info msg="Annotating k8s node" subsys=daemon v4CiliumHostIP.IPv4=10.20.0.69 v4Prefix=10.20.0.0/24 v4healthIP.IPv4=10.20.0.210 v6CiliumHostIP.IPv6="<nil>" v6Prefix="<nil>" v6healthIP.IPv6="<nil>"
2020-10-19T17:00:32.431728564Z stderr F level=info msg="Initializing identity allocator" subsys=identity-cache
2020-10-19T17:00:32.431730942Z stderr F level=info msg="Cluster-ID is not specified, skipping ClusterMesh initialization" subsys=daemon
2020-10-19T17:00:32.431956193Z stderr F level=info msg="Adding local node to cluster" subsys=nodediscovery
2020-10-19T17:00:32.441482266Z stderr F level=info msg="Setting up base BPF datapath" subsys=daemon
2020-10-19T17:00:34.398110106Z stderr F level=info msg="Blacklisting local route as no-alloc" route=10.0.2.0/24 subsys=ipam
2020-10-19T17:00:34.398170967Z stderr F level=info msg="Blacklisting local route as no-alloc" route=192.168.64.0/24 subsys=ipam
2020-10-19T17:00:34.400694344Z stderr F level=error msg="Command execution failed" cmd="[iptables -w 5 -t filter -S]" error="exit status 3" subsys=iptables
2020-10-19T17:00:34.400705923Z stderr F level=warning msg="modprobe: ERROR: could not insert 'ip_tables': Exec format error" subsys=iptables
2020-10-19T17:00:34.400709297Z stderr F level=warning msg="iptables v1.6.1: can't initialize iptables table `filter': Table does not exist (do you need to insmod?)" subsys=iptables
2020-10-19T17:00:34.40071334Z stderr F level=warning msg="Perhaps iptables or your kernel needs to be upgraded." subsys=iptables
2020-10-19T17:00:34.406168382Z stderr F level=error msg="Command execution failed" cmd="[iptables -w 5 -t filter -N CILIUM_TRANSIENT_FORWARD]" error="exit status 3" subsys=iptables
2020-10-19T17:00:34.406180964Z stderr F level=warning msg="modprobe: ERROR: could not insert 'ip_tables': Exec format error" subsys=iptables
2020-10-19T17:00:34.406183959Z stderr F level=warning msg="iptables v1.6.1: can't initialize iptables table `filter': Table does not exist (do you need to insmod?)" subsys=iptables
2020-10-19T17:00:34.406192082Z stderr F level=warning msg="Perhaps iptables or your kernel needs to be upgraded." subsys=iptables
2020-10-19T17:00:34.406194308Z stderr F level=error msg="Error while initializing daemon" error="cannot add custom chain CILIUM_TRANSIENT_FORWARD: exit status 3" subsys=daemon
2020-10-19T17:00:34.406196664Z stderr F level=fatal msg="Error while creating daemon" error="cannot add custom chain CILIUM_TRANSIENT_FORWARD: exit status 3" subsys=daemon

If I run yum update - everything will run fine.

Expected behavior

All works fine right out-of-the-box.

Create single command to run PKE in Vagrant

Create the following commands:

single node
two VMs: master and worker
cleanup

Add Docker install support

Is your feature request related to a problem? Please describe.
I want PKE to install Docker for me.

Describe the solution you'd like to see
Install Docker if it's not already installed on the machine.

Describe alternatives you've considered
The current alternative is to use images with pre-installed docker.

Additional context
While we prefer containerd, docker is still a requirement in some cases (eg. for GPU workloads)

etcd join may fail

Kubeadm does not retry on connection refused.

[check-etcd] Checking that the etcd cluster is healthy
error execution phase check-etcd: Get https://pke-x-nlb-3dfa0a01002de621.elb.eu-central-1.amazonaws.com:6443/api/v1/namespaces/kube-system/configmaps/kubeadm-config: dial tcp 52.58.170.154:6443: connect: connection refused

Build RPM/DEB packages

Is your feature request related to a problem? Please describe.
Currently we install PKE "manually" by downloading the binary.

Describe the solution you'd like to see
Let's build RPM/DEB packages.

Replace API server check based on ComponentStatus

The ComponentStatus resource is deprecated as of Kubernetes 1.19: kubernetes/kubernetes#93570

We use it to check if the API server is available: https://github.com/banzaicloud/pke/blob/master/cmd/pke/app/phases/kubeadm/controlplane/controlplane.go#L1018

`On return, n == len(b) if and only if err == nil.` we can leave this check out

On return, n == len(b) if and only if err == nil. we can leave this check out

Originally posted by @orymate in #6

Keepalived based HA master

Support automatic keepalived installation/configuration

In the case of Ubuntu 20.04, PKE installation via vagrant will fail if docker container runtime is used.

Describe the bug
Using ubuntu-single-docker.sh script, the installation failed in the case of Ubuntu 20.04.

Steps to reproduce the issue:
Run ./ubuntu-single-docker.sh

Expected behavior
The installation should be completed successfully

Additional context

    ubuntu-docker: E
    ubuntu-docker: :
    ubuntu-docker: Version '1.2.13-1' for 'containerd.io' was not found
    ubuntu-docker: E
    ubuntu-docker: :
    ubuntu-docker: Version '5:19.03.8~3-0~ubuntu-focal' for 'docker-ce' was not found
    ubuntu-docker: E
    ubuntu-docker: :
    ubuntu-docker: Version '5:19.03.8~3-0~ubuntu-focal' for 'docker-ce-cli' was not found
    ubuntu-docker: /tmp/vagrant-shell: line 20: /etc/docker/daemon.json: No such file or directory
    ubuntu-docker: Failed to restart docker.service: Unit docker.service not found.
The SSH command responded with a non-zero exit status. Vagrant
assumes that this means the command failed. The output for this command
should be in the log above. Please read the output to determine what
went wrong.

use time.NewTicker + defer ticker.Stop

Originally posted by @orymate in #6

Allow changing etcd version

Currently only the builtin kubedm versions are supported.

Ubuntu support

Support latest Ubuntu LTS (18.04) as a base operating system for PKE.

detect OS (add option like --platform=auto|ubuntu|rhel)
implement package installation with apt in addition to yum
find out the equivalent packages
adjust (configuration) paths, formats
???
test basic operation and fix issues

External etcd configuration for PKE

Support for external etcd configuration

Kubernetes API server --etcd-prefix string
Client certificate for etcd
Fine tune encryption parameters

Handle Kubernetes version deprecation

#114 (comment)

Upgrade fails from 1.15.7 to 1.16.4

Describe the bug
Upgrade fails with:

  err> [preflight] Some fatal errors occurred:
  err> 	[ERROR CoreDNSUnsupportedPlugins]: there are unsupported plugins in the CoreDNS Corefile

Steps to reproduce the issue:

pke upgrade master --kuberentes-version=1.16.4

Expected behavior
Upgrade succeeds.

Additional context
See related issue in Kubernetes: kubernetes/kubernetes#82889

Skip installation based on PKE config

Skip installing kubernetes and container runtime based on PKE config

See #98

Audit server log collection support

Is your feature request related to a problem? Please describe.
I would like to be able to collect audit logs using the logging-operator.

Describe the solution you'd like to see
I would like to see the audit log written to a hostpath, because the fluent-bit daemonset could read those logs from there as usual.

Describe alternatives you've considered
We could push the audit logs to a webhook, that would write the logs to stdout, so that it could be collected the usual way but that would be overkill I beleive.

Send logs to pipeline during pke bootstrap

Update the PKE to send error output and result of the bootstrapping process. Depends on API endpoint made in banzaicloud/pipeline#3174

PKE add calico as network provider

fixed in #52

Preconfigure PKE in pre-cached images

Is your feature request related to a problem? Please describe.
When creating a cluster using a pre-cached image with PKE already installed, I don't want:

PKE to be installed again
Container runtime and Kubernetes installed again
Container runtime and Kubernetes configured again

Describe the solution you'd like to see
Store some PKE configuration in images that PKE can use to configure the node (eg. container runtime). This configuration could be created by a user data script on base images without PKE.

CentOS 8/RHEL 8 support

Is your feature request related to a problem? Please describe.
Installing PKE on CentOS 8/RHEL 8

Refactor error message on missing Pipeline API endpoint

This message indicates configuration should not be warning

[WARNING][pipeline-ready] Skipping phase due to missing Pipeline API endpoint credentials. missing pipeline-cluster-id: validation failed

Consider embedding static assets

Improvement idea (easier maintenance):

Looking at the code there's a bunch of inlined text, like this:
https://github.com/banzaicloud/pke/blob/master/cmd/pke/app/phases/kubeadm/node/kubeadm.go#L112

For easier maintenance you could consider embedding static assets instead of defining them inline as sting:
https://github.com/avelino/awesome-go#resource-embedding

Add Kubernetes 1.19 support

Hyperkube deprecation

Hyperkube was deprecated and there are no official images shipped as of 1.19

We don't seem to use it since 1.18, but it's worth noting as in the v1beta3 it will be removed.

Some references:

Missing default storage class

In case of no cloud provider is specified no default storage class is defined.

Add default storage class.

Wrong advertise address after upgrade process.

Describe the bug
The advertiseaddress is wrong after the new kubeadm config has been generated.

Steps to reproduce the issue:
Run pke upgrade on a cluster which uses different addresses for advertise-address and controlplane-endpoint.

Expected behavior
Pke upgrade should run without error.

PKE

Fix installing pke+pipeline on localhost scenario

Describe the bug
When you run banzai cp up --provider pke on a linux box without docker or containerd, the installer exits with an error message missing ctr in the path.

Expected behavior
PKE installation starts, and all steps requiring ctr run after that only.

install failed

log:

[etcd] Creating static Pod manifest for local etcd in "/etc/kubernetes/manifests"
[wait-control-plane] Waiting for the kubelet to boot up the control plane as static Pods from directory "/etc/kubernetes/manifests". This can take up to 4m0s
[kubelet-check] Initial timeout of 40s passed.

Unfortunately, an error has occurred:
	timed out waiting for the condition

This error is likely caused by:
	- The kubelet is not running
	- The kubelet is unhealthy due to a misconfiguration of the node in some way (required cgroups disabled)

If you are on a systemd-powered system, you can try to troubleshoot the error with the following commands:
	- 'systemctl status kubelet'
	- 'journalctl -xeu kubelet'

Additionally, a control plane component may have crashed or exited when started by the container runtime.
To troubleshoot, list all containers using your preferred container runtimes CLI, e.g. docker.
Here is one example how you may list all Kubernetes containers running in docker:
	- 'docker ps -a | grep kube | grep -v pause'
	Once you have found the failing container, you can inspect its logs with:
	- 'docker logs CONTAINERID'
	[WARNING HTTPProxy]: Connection to "https://192.168.150.50" uses proxy "http://172.18.24.90:22222". If that is not intended, adjust your proxy settings
	[WARNING HTTPProxyCIDR]: connection to "10.10.0.0/16" uses proxy "http://172.18.24.90:22222". This may lead to malfunctional cluster setup. Make sure that Pod and Services IP ranges specified correctly as exceptions in proxy configuration
	[WARNING HTTPProxyCIDR]: connection to "10.20.0.0/16" uses proxy "http://172.18.24.90:22222". This may lead to malfunctional cluster setup. Make sure that Pod and Services IP ranges specified correctly as exceptions in proxy configuration
	[WARNING Service-Kubelet]: kubelet service is not enabled, please run 'systemctl enable kubelet.service'
error execution phase wait-control-plane: couldn't initialize a Kubernetes cluster
/bin/kubeadm [init --config=/etc/kubernetes/kubeadm.conf] err: exit status 1 4m26.70353459s
Resetting kubeadm...
/bin/kubeadm [reset --force --cri-socket=unix:///run/containerd/containerd.sock]
[preflight] running pre-flight checks
[reset] Reading configuration from the cluster...
[reset] FYI: You can look at this config file with 'kubectl -n kube-system get cm kubeadm-config -oyaml'
[reset] stopping the kubelet service
[reset] unmounting mounted directories in "/var/lib/kubelet"
[reset] deleting contents of stateful directories: [/var/lib/etcd /var/lib/kubelet /etc/cni/net.d /var/lib/dockershim /var/run/kubernetes]
[reset] deleting contents of config directories: [/etc/kubernetes/manifests /etc/kubernetes/pki]
[reset] deleting files: [/etc/kubernetes/admin.conf /etc/kubernetes/kubelet.conf /etc/kubernetes/bootstrap-kubelet.conf /etc/kubernetes/controller-manager.conf /etc/kubernetes/scheduler.conf]

The reset process does not reset or clean up iptables rules or IPVS tables.
If you wish to reset iptables, you must do so manually.
For example: 
iptables -F && iptables -t nat -F && iptables -t mangle -F && iptables -X

If your cluster was setup to utilize IPVS, run ipvsadm --clear (or similar)
to reset your system's IPVS tables.

W0319 18:46:33.298246    6055 reset.go:213] [reset] Unable to fetch the kubeadm-config ConfigMap, using etcd pod spec as fallback: failed to get config map: Get https://192.168.150.50:6443/api/v1/namespaces/kube-system/configmaps/kubeadm-config: net/http: TLS handshake timeout
/bin/kubeadm [reset --force --cri-socket=unix:///run/containerd/containerd.sock] err: <nil> 10.096034565s
Error: exit status 1

[root@kubesphere kubernetes]# systemctl status kubelet -l
● kubelet.service - kubelet: The Kubernetes Node Agent
   Loaded: loaded (/etc/systemd/system/kubelet.service; disabled; vendor preset: disabled)
  Drop-In: /etc/systemd/system/kubelet.service.d
           └─0-containerd.conf, 10-kubeadm.conf
   Active: active (running) since Tue 2019-03-19 18:58:20 CST; 1s ago
     Docs: https://kubernetes.io/docs/
 Main PID: 7246 (kubelet)
    Tasks: 18
   Memory: 26.0M
   CGroup: /system.slice/kubelet.service
           └─7246 /usr/bin/kubelet --bootstrap-kubeconfig=/etc/kubernetes/bootstrap-kubelet.conf --kubeconfig=/etc/kubernetes/kubelet.conf --config=/var/lib/kubelet/config.yaml --container-runtime=remote --container-runtime-endpoint=unix:///run/containerd/containerd.sock

Mar 19 18:58:21 kubesphere kubelet[7246]: E0319 18:58:21.457124    7246 kubelet.go:2266] node "kubesphere" not found
Mar 19 18:58:21 kubesphere kubelet[7246]: I0319 18:58:21.459905    7246 kubelet_node_status.go:278] Setting node annotation to enable volume controller attach/detach
Mar 19 18:58:21 kubesphere kubelet[7246]: I0319 18:58:21.461490    7246 kubelet_node_status.go:72] Attempting to register node kubesphere
Mar 19 18:58:21 kubesphere kubelet[7246]: E0319 18:58:21.461995    7246 kubelet_node_status.go:94] Unable to register node "kubesphere" with API server: Post https://192.168.150.50:6443/api/v1/nodes: dial tcp 192.168.150.50:6443: connect: connection refused
Mar 19 18:58:21 kubesphere kubelet[7246]: E0319 18:58:21.557409    7246 kubelet.go:2266] node "kubesphere" not found
Mar 19 18:58:21 kubesphere kubelet[7246]: E0319 18:58:21.657547    7246 kubelet.go:2266] node "kubesphere" not found
Mar 19 18:58:21 kubesphere kubelet[7246]: E0319 18:58:21.746162    7246 reflector.go:134] k8s.io/kubernetes/pkg/kubelet/kubelet.go:444: Failed to list *v1.Service: Get https://192.168.150.50:6443/api/v1/services?limit=500&resourceVersion=0: dial tcp 192.168.150.50:6443: connect: connection refused
Mar 19 18:58:21 kubesphere kubelet[7246]: E0319 18:58:21.746904    7246 reflector.go:134] k8s.io/kubernetes/pkg/kubelet/config/apiserver.go:47: Failed to list *v1.Pod: Get https://192.168.150.50:6443/api/v1/pods?fieldSelector=spec.nodeName%3Dkubesphere&limit=500&resourceVersion=0: dial tcp 192.168.150.50:6443: connect: connection refused
Mar 19 18:58:21 kubesphere kubelet[7246]: E0319 18:58:21.747999    7246 reflector.go:134] k8s.io/kubernetes/pkg/kubelet/kubelet.go:453: Failed to list *v1.Node: Get https://192.168.150.50:6443/api/v1/nodes?fieldSelector=metadata.name%3Dkubesphere&limit=500&resourceVersion=0: dial tcp 192.168.150.50:6443: connect: connection refused
Mar 19 18:58:21 kubesphere kubelet[7246]: E0319 18:58:21.757721    7246 kubelet.go:2266] node "kubesphere" not found

DenyEscalatingExec should be switchable

Is your feature request related to a problem? Please describe.
Sometimes we need to start K8s cluster without DenyEscalatingExec admission plugin.

Describe the solution you'd like to see
Use --deny-escalating-exec for turn on the admission plugin during install.

Describe alternatives you've considered
We can ssh into master node and edit apiserver manifest.

Upgrade/deprecate weave

Do we still use it?

Change node pool labels to respect node restrictions

banzaicloud/pipeline#1964

Support for Kubernetes version 1.20

Is your feature request related to a problem? Please describe.
Support Kubernetes version 1.20

Missing admission-control configuration on joining master nodes

Kubernetes API server fails to start on additional master nodes due to admission-control configuration is missing.

Generate admission-control configuration files on additional master nodes.

This issue will auto resolve when the policy is in compliance.

Issue created by Allstar. See https://github.com/ossf/allstar/ for more information. For questions specific to the repository, please contact the owner or maintainer.

banzaicloud / pke Goto Github PK

pke's Issues

Recommend Projects

Recommend Topics

Recommend Org