Giter VIP home page Giter VIP logo

Comments (20)

gyliu513 avatar gyliu513 commented on June 17, 2024

@hemanthavs this maybe a bug of kubeadm.

@morvencao I recalled you also encounter this issue before, any comments for this? If this is a bug of kubeadm, we can open an issue there.

from cluster-api-provider-ibmcloud.

morvencao avatar morvencao commented on June 17, 2024

@gyliu513 Yes, I have encountered that, let me try to reproduce this and find the root cause.

from cluster-api-provider-ibmcloud.

jichenjc avatar jichenjc commented on June 17, 2024

projectcalico/calico#2040
seems point to kube-dns issue...

from cluster-api-provider-ibmcloud.

gyliu513 avatar gyliu513 commented on June 17, 2024

/assign @morvencao

from cluster-api-provider-ibmcloud.

jichenjc avatar jichenjc commented on June 17, 2024

I met this again today...

root@jichen1:/home/cloudusr/go/src/sigs.k8s.io/cluster-api-provider-ibmcloud/cmd/clusterctl# kubectl --kubeconfig=kubeconfig get pods --all-namespaces
NAMESPACE                  NAME                                          READY   STATUS             RESTARTS   AGE
ibmcloud-provider-system   ibmcloud-machine-controller-0                 1/1     Running            1          12m
kube-system                calico-node-9qmtb                             1/1     Running            0          12m
kube-system                calico-node-w96kf                             0/1     CrashLoopBackOff   5          7m42s
kube-system                coredns-fb8b8dccf-n2zkv                       1/1     Running            0          12m
kube-system                coredns-fb8b8dccf-sz85p                       1/1     Running            0          12m
kube-system                etcd-jichen-master-g2x5d                      1/1     Running            0          11m
kube-system                kube-apiserver-jichen-master-g2x5d            1/1     Running            0          11m
kube-system                kube-controller-manager-jichen-master-g2x5d   1/1     Running            0          11m
kube-system                kube-proxy-j9ll5                              1/1     Running            0          7m42s
kube-system                kube-proxy-zzlfn                              1/1     Running            0          12m
kube-system                kube-scheduler-jichen-master-g2x5d            1/1     Running            0          11m
kube-system                kubernetes-dashboard-5f7b999d65-ntrxb         1/1     Running            0          12m
system                     controller-manager-0                          1/1     Running            0          12m
root@jichen1:/home/cloudusr/go/src/sigs.k8s.io/cluster-api-provider-ibmcloud/cmd/clusterctl# kubectl --kubeconfig=kubeconfig log calico-node-w96kf -n kube-system
Threshold time for bird readiness check:  30s
2019-05-23 09:53:34.707 [INFO][10] startup.go 256: Early log level set to info
2019-05-23 09:53:34.707 [INFO][10] startup.go 272: Using NODENAME environment for node name
2019-05-23 09:53:34.707 [INFO][10] startup.go 284: Determined node name: jichen-node-s9s2b
2019-05-23 09:53:34.709 [INFO][10] startup.go 316: Checking datastore connection
2019-05-23 09:54:04.709 [INFO][10] startup.go 331: Hit error connecting to datastore - retry error=Get https://10.96.0.1:443/api/v1/nodes/foo: dial tcp 10.96.0.1:443: i/o timeout
2019-05-23 09:54:35.710 [INFO][10] startup.go 331: Hit error connecting to datastore - retry error=Get https://10.96.0.1:443/api/v1/nodes/foo: dial tcp 10.96.0.1:443: i/o timeout

maybe dns has some issue that our create didn't do well in user-data inject and cloud-init didn't set it well?

from cluster-api-provider-ibmcloud.

jichenjc avatar jichenjc commented on June 17, 2024

@hemanth-avs @xunpan @gyliu513 we need give this high priority...

root@jichen1:/home/cloudusr/go/src/sigs.k8s.io/cluster-api-provider-ibmcloud/cmd/clusterctl# kubectl --kubeconfig=kubeconfig  get nodes
NAME                  STATUS   ROLES    AGE     VERSION
jichen-master-g2x5d   Ready    master   22h     v1.14.0
jichen-node-12345     Ready    <none>   3h42m   v1.14.0
jichen-node-67890     Ready    <none>   44m     v1.14.0
jichen-node-s9s2b     Ready    <none>   22h     v1.14.0
name1                 Ready    <none>   5h48m   v1.14.0

I have 1+4 cluster above and

root@jichen1:/home/cloudusr/go/src/sigs.k8s.io/cluster-api-provider-ibmcloud/cmd/clusterctl# kubectl --kubeconfig=kubeconfig  get pods -n kube-system
NAME                                          READY   STATUS             RESTARTS   AGE
calico-node-7zhdp                             0/1     CrashLoopBackOff   63         3h42m
calico-node-9qmtb                             1/1     Running            0          22h
calico-node-ggspb                             0/1     CrashLoopBackOff   97         5h48m
calico-node-w96kf                             0/1     Running            368        22h
calico-node-whb9c                             0/1     CrashLoopBackOff   15         45m

apparently all nodes has issue in their calico settings

I am not very expert in network area especailly the calico side, looks like the new node has issue in connecting to svc 10.96.0.1 which is by default the kube-api
logon to one node and found it has a route here:

root@jichen-node-67890:~# ip route
10.0.0.0/8 via 10.109.100.193 dev eth0 proto static

so all 10.x.x.x ip will take 10.109.100.193 as gateway and I don't know whether it has some issue

will need further check on following potential point:

  1. whether 10.x.x.x reserved by IBM cloud as private IP and lead to ip route to someplace else
  2. whether security group blocked the internal IP (less likely because we can kubeadm join ...)

from cluster-api-provider-ibmcloud.

gyliu513 avatar gyliu513 commented on June 17, 2024

What is the log of those CrashLoopBackOff pods? Is this a bug of kubeadm? Does OpenStack works fine?

from cluster-api-provider-ibmcloud.

jichenjc avatar jichenjc commented on June 17, 2024

log pasted above, I don't think it's kubeadm bug because openstack seems works fine based on my previous test result... @morvencao have you recently use openstack to create cluster and have this issue?

from cluster-api-provider-ibmcloud.

jichenjc avatar jichenjc commented on June 17, 2024

run this on the node

iptables -t nat -A PREROUTING -d 10.96.0.1 -j DNAT --to-destination 10.109.100.204

calico-node-9qmtb 1/1 Running 2 3d18h
....
calico-node-whb9c 1/1 Running 1112 2d20h

fixed the problem (whb9c ) pod, so maybe kube-proxy has some problem ?? need further check on the configuration...

from cluster-api-provider-ibmcloud.

morvencao avatar morvencao commented on June 17, 2024

@jichenjc Sorry for the late response, I didn't have time to try this on provider openstack.

If you check the logs of the failed pod, you can see that the pod can't connect to the api-server:

root@mc-dev:~/go/src/sigs.k8s.io/cluster-api-provider-ibmcloud/cmd/clusterctl# kc -n kube-system logs -f calico-node-smfx8
Threshold time for bird readiness check:  30s
2019-05-27 11:14:48.306 [INFO][10] startup.go 256: Early log level set to info
2019-05-27 11:14:48.306 [INFO][10] startup.go 272: Using NODENAME environment for node name
2019-05-27 11:14:48.306 [INFO][10] startup.go 284: Determined node name: ibmcloud-node-jkb9p
2019-05-27 11:14:48.307 [INFO][10] startup.go 316: Checking datastore connection
2019-05-27 11:15:18.308 [INFO][10] startup.go 331: Hit error connecting to datastore - retry error=Get https://10.96.0.1:443/api/v1/nodes/foo: dial tcp 10.96.0.1:443: i/o timeout
2019-05-27 11:15:49.309 [INFO][10] startup.go 331: Hit error connecting to datastore - retry error=Get https://10.96.0.1:443/api/v1/nodes/foo: dial tcp 10.96.0.1:443: i/o timeout

It may be caused by the wrong interface is being selected by Calico since each node have multiple interfaces.

from cluster-api-provider-ibmcloud.

morvencao avatar morvencao commented on June 17, 2024

The reason calico node crashes should be caused by the overlap of the service CIDR and host CIDR; The default service CIDR is 10.96.0.0/12 while the host IP is 10.xx.xxx.xxx.
So I suggest making sure they do not overlap by changing our default service CIDR from 10.96.0.0/12 to 20.96.0.0/12 in PR: #191

Have verified this in my local, after change the default servce CIDR, all calico pods can be started normally:

root@mc-dev:~/go/src/sigs.k8s.io/cluster-api-provider-ibmcloud/cmd/clusterctl# kubectl --kubeconfig=./kubeconfig get node
NAME                    STATUS   ROLES    AGE     VERSION
ibmcloud-master-ckwg8   Ready    master   4m35s   v1.14.0
ibmcloud-node-sz9dc     Ready    <none>   10s     v1.14.0
root@mc-dev:~/go/src/sigs.k8s.io/cluster-api-provider-ibmcloud/cmd/clusterctl# kubectl --kubeconfig=./kubeconfig get pod --all-namespaces
NAMESPACE                  NAME                                            READY   STATUS    RESTARTS   AGE
ibmcloud-provider-system   clusterapi-controller-0                         1/1     Running   0          4m28s
kube-system                calico-node-q4rgp                               1/1     Running   0          27s
kube-system                calico-node-x89bw                               1/1     Running   0          4m28s
kube-system                coredns-fb8b8dccf-29h8c                         1/1     Running   0          4m28s
kube-system                coredns-fb8b8dccf-4bbtj                         1/1     Running   0          4m28s
kube-system                etcd-ibmcloud-master-ckwg8                      1/1     Running   0          3m41s
kube-system                kube-apiserver-ibmcloud-master-ckwg8            1/1     Running   0          3m47s
kube-system                kube-controller-manager-ibmcloud-master-ckwg8   1/1     Running   0          3m52s
kube-system                kube-proxy-2fv6h                                1/1     Running   0          27s
kube-system                kube-proxy-sgc8g                                1/1     Running   0          4m28s
kube-system                kube-scheduler-ibmcloud-master-ckwg8            1/1     Running   0          3m41s
system                     controller-manager-0                            1/1     Running   0          4m28s

@jichenjc Maybe you can double check if this works if you have time.

from cluster-api-provider-ibmcloud.

jichenjc avatar jichenjc commented on June 17, 2024

ok, let me try this solution first, then I will post some comments based on the PR you submitted and info here, the IBM cloud 10.x.x.x/8 might be the key issue that you suggested @morvencao

from cluster-api-provider-ibmcloud.

morvencao avatar morvencao commented on June 17, 2024

@jichenjc Thanks, let's keep default service CIDR unchanged and try to make vm IP configurable.

from cluster-api-provider-ibmcloud.

jichenjc avatar jichenjc commented on June 17, 2024

I tried to use 20.96.x.x/12 and 172.31.0.0/16 as

spec:
    clusterNetwork:
        services:
            cidrBlocks: ["172.31.0.0/16"]

both works for me, so I suggest we use 172.31.0.0/16 as this is enough for us to proceed and still a private IP address, we should document this limitation because of IBM cloud private ip issue and proceed...

@gyliu513 @gyliu513 @xunpan thoughts?

from cluster-api-provider-ibmcloud.

gyliu513 avatar gyliu513 commented on June 17, 2024

@jichenjc can we enhance the clouds.yaml as follows to enable end user can configure the cidr for the vms?

clouds:
  ibmcloud:
    auth:
      apiUserName: "Your API Username"
      authenticationKey: "Your API Authentication Key"
    machines:
      cidrBlocks: ["20.96.0.0/12"]

from cluster-api-provider-ibmcloud.

morvencao avatar morvencao commented on June 17, 2024

@gyliu513 If have want to use customized CIDR for VM IP other than default 10.x.x.x/8, then we need to create subnetwork and then create VM in that sebnetwork, something like openstack.

from cluster-api-provider-ibmcloud.

jichenjc avatar jichenjc commented on June 17, 2024

@morvencao you mean create a subnet in IBM cloud and then our instance will start to use that network? I didn't try this before but I think it's something we can suggest

so to summarize ,we have to keep '10.96.0.0/12' as the cluster ip and change IBM cloud side?

from cluster-api-provider-ibmcloud.

morvencao avatar morvencao commented on June 17, 2024

@jichenjc I thought subnetwork or configurable VM CIDR should be support finally.

For now, I'm not sure if softlayer supports to create customized subnetwork, I didn't find the creation subnetwork interface neither in API, nor in UI portal:https://control.softlayer.com/network/subnets

from cluster-api-provider-ibmcloud.

jichenjc avatar jichenjc commented on June 17, 2024

I have no idea how IBM cloud runs their IP settings

looks like we used this subnet (maybe because we used washington 1 zone) and I have no idea why it has a route 10.x.x.x/8 by default and how it connected each other, as far as I can tell seems current IBM cloud doesn't support VPC (see AWS VPC concept) ... so again, for short team, seems modify cluster service IP (it's virtual and nonaccessable to outside) is a good way then we have time for questioning and testing on IBM cloud 10.x IP usage

10.109.100.192/26 Private Primary wdc01.bcr05a.1165 Washington 1

from cluster-api-provider-ibmcloud.

morvencao avatar morvencao commented on June 17, 2024

@jichenjc Agreed. Complete fix need to changes in API, at least support configurable CIDR in Clouds.yaml or machines.yaml, I think. Changing the service CIDR should be a short-term fix.

Comments? @gyliu513

from cluster-api-provider-ibmcloud.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.