Comments (20)
@hemanthavs this maybe a bug of kubeadm.
@morvencao I recalled you also encounter this issue before, any comments for this? If this is a bug of kubeadm, we can open an issue there.
from cluster-api-provider-ibmcloud.
@gyliu513 Yes, I have encountered that, let me try to reproduce this and find the root cause.
from cluster-api-provider-ibmcloud.
projectcalico/calico#2040
seems point to kube-dns issue...
from cluster-api-provider-ibmcloud.
/assign @morvencao
from cluster-api-provider-ibmcloud.
I met this again today...
root@jichen1:/home/cloudusr/go/src/sigs.k8s.io/cluster-api-provider-ibmcloud/cmd/clusterctl# kubectl --kubeconfig=kubeconfig get pods --all-namespaces
NAMESPACE NAME READY STATUS RESTARTS AGE
ibmcloud-provider-system ibmcloud-machine-controller-0 1/1 Running 1 12m
kube-system calico-node-9qmtb 1/1 Running 0 12m
kube-system calico-node-w96kf 0/1 CrashLoopBackOff 5 7m42s
kube-system coredns-fb8b8dccf-n2zkv 1/1 Running 0 12m
kube-system coredns-fb8b8dccf-sz85p 1/1 Running 0 12m
kube-system etcd-jichen-master-g2x5d 1/1 Running 0 11m
kube-system kube-apiserver-jichen-master-g2x5d 1/1 Running 0 11m
kube-system kube-controller-manager-jichen-master-g2x5d 1/1 Running 0 11m
kube-system kube-proxy-j9ll5 1/1 Running 0 7m42s
kube-system kube-proxy-zzlfn 1/1 Running 0 12m
kube-system kube-scheduler-jichen-master-g2x5d 1/1 Running 0 11m
kube-system kubernetes-dashboard-5f7b999d65-ntrxb 1/1 Running 0 12m
system controller-manager-0 1/1 Running 0 12m
root@jichen1:/home/cloudusr/go/src/sigs.k8s.io/cluster-api-provider-ibmcloud/cmd/clusterctl# kubectl --kubeconfig=kubeconfig log calico-node-w96kf -n kube-system
Threshold time for bird readiness check: 30s
2019-05-23 09:53:34.707 [INFO][10] startup.go 256: Early log level set to info
2019-05-23 09:53:34.707 [INFO][10] startup.go 272: Using NODENAME environment for node name
2019-05-23 09:53:34.707 [INFO][10] startup.go 284: Determined node name: jichen-node-s9s2b
2019-05-23 09:53:34.709 [INFO][10] startup.go 316: Checking datastore connection
2019-05-23 09:54:04.709 [INFO][10] startup.go 331: Hit error connecting to datastore - retry error=Get https://10.96.0.1:443/api/v1/nodes/foo: dial tcp 10.96.0.1:443: i/o timeout
2019-05-23 09:54:35.710 [INFO][10] startup.go 331: Hit error connecting to datastore - retry error=Get https://10.96.0.1:443/api/v1/nodes/foo: dial tcp 10.96.0.1:443: i/o timeout
maybe dns has some issue that our create didn't do well in user-data inject and cloud-init didn't set it well?
from cluster-api-provider-ibmcloud.
@hemanth-avs @xunpan @gyliu513 we need give this high priority...
root@jichen1:/home/cloudusr/go/src/sigs.k8s.io/cluster-api-provider-ibmcloud/cmd/clusterctl# kubectl --kubeconfig=kubeconfig get nodes
NAME STATUS ROLES AGE VERSION
jichen-master-g2x5d Ready master 22h v1.14.0
jichen-node-12345 Ready <none> 3h42m v1.14.0
jichen-node-67890 Ready <none> 44m v1.14.0
jichen-node-s9s2b Ready <none> 22h v1.14.0
name1 Ready <none> 5h48m v1.14.0
I have 1+4 cluster above and
root@jichen1:/home/cloudusr/go/src/sigs.k8s.io/cluster-api-provider-ibmcloud/cmd/clusterctl# kubectl --kubeconfig=kubeconfig get pods -n kube-system
NAME READY STATUS RESTARTS AGE
calico-node-7zhdp 0/1 CrashLoopBackOff 63 3h42m
calico-node-9qmtb 1/1 Running 0 22h
calico-node-ggspb 0/1 CrashLoopBackOff 97 5h48m
calico-node-w96kf 0/1 Running 368 22h
calico-node-whb9c 0/1 CrashLoopBackOff 15 45m
apparently all nodes has issue in their calico settings
I am not very expert in network area especailly the calico side, looks like the new node has issue in connecting to svc 10.96.0.1 which is by default the kube-api
logon to one node and found it has a route here:
root@jichen-node-67890:~# ip route
10.0.0.0/8 via 10.109.100.193 dev eth0 proto static
so all 10.x.x.x ip will take 10.109.100.193 as gateway and I don't know whether it has some issue
will need further check on following potential point:
- whether 10.x.x.x reserved by IBM cloud as private IP and lead to ip route to someplace else
- whether security group blocked the internal IP (less likely because we can kubeadm join ...)
from cluster-api-provider-ibmcloud.
What is the log of those CrashLoopBackOff
pods? Is this a bug of kubeadm
? Does OpenStack works fine?
from cluster-api-provider-ibmcloud.
log pasted above, I don't think it's kubeadm bug because openstack seems works fine based on my previous test result... @morvencao have you recently use openstack to create cluster and have this issue?
from cluster-api-provider-ibmcloud.
run this on the node
iptables -t nat -A PREROUTING -d 10.96.0.1 -j DNAT --to-destination 10.109.100.204
calico-node-9qmtb 1/1 Running 2 3d18h
....
calico-node-whb9c 1/1 Running 1112 2d20h
fixed the problem (whb9c ) pod, so maybe kube-proxy has some problem ?? need further check on the configuration...
from cluster-api-provider-ibmcloud.
@jichenjc Sorry for the late response, I didn't have time to try this on provider openstack.
If you check the logs of the failed pod, you can see that the pod can't connect to the api-server:
root@mc-dev:~/go/src/sigs.k8s.io/cluster-api-provider-ibmcloud/cmd/clusterctl# kc -n kube-system logs -f calico-node-smfx8
Threshold time for bird readiness check: 30s
2019-05-27 11:14:48.306 [INFO][10] startup.go 256: Early log level set to info
2019-05-27 11:14:48.306 [INFO][10] startup.go 272: Using NODENAME environment for node name
2019-05-27 11:14:48.306 [INFO][10] startup.go 284: Determined node name: ibmcloud-node-jkb9p
2019-05-27 11:14:48.307 [INFO][10] startup.go 316: Checking datastore connection
2019-05-27 11:15:18.308 [INFO][10] startup.go 331: Hit error connecting to datastore - retry error=Get https://10.96.0.1:443/api/v1/nodes/foo: dial tcp 10.96.0.1:443: i/o timeout
2019-05-27 11:15:49.309 [INFO][10] startup.go 331: Hit error connecting to datastore - retry error=Get https://10.96.0.1:443/api/v1/nodes/foo: dial tcp 10.96.0.1:443: i/o timeout
It may be caused by the wrong interface is being selected by Calico since each node have multiple interfaces.
from cluster-api-provider-ibmcloud.
The reason calico node crashes should be caused by the overlap of the service CIDR and host CIDR; The default service CIDR is 10.96.0.0/12
while the host IP is 10.xx.xxx.xxx.
So I suggest making sure they do not overlap by changing our default service CIDR from 10.96.0.0/12
to 20.96.0.0/12
in PR: #191
Have verified this in my local, after change the default servce CIDR, all calico pods can be started normally:
root@mc-dev:~/go/src/sigs.k8s.io/cluster-api-provider-ibmcloud/cmd/clusterctl# kubectl --kubeconfig=./kubeconfig get node
NAME STATUS ROLES AGE VERSION
ibmcloud-master-ckwg8 Ready master 4m35s v1.14.0
ibmcloud-node-sz9dc Ready <none> 10s v1.14.0
root@mc-dev:~/go/src/sigs.k8s.io/cluster-api-provider-ibmcloud/cmd/clusterctl# kubectl --kubeconfig=./kubeconfig get pod --all-namespaces
NAMESPACE NAME READY STATUS RESTARTS AGE
ibmcloud-provider-system clusterapi-controller-0 1/1 Running 0 4m28s
kube-system calico-node-q4rgp 1/1 Running 0 27s
kube-system calico-node-x89bw 1/1 Running 0 4m28s
kube-system coredns-fb8b8dccf-29h8c 1/1 Running 0 4m28s
kube-system coredns-fb8b8dccf-4bbtj 1/1 Running 0 4m28s
kube-system etcd-ibmcloud-master-ckwg8 1/1 Running 0 3m41s
kube-system kube-apiserver-ibmcloud-master-ckwg8 1/1 Running 0 3m47s
kube-system kube-controller-manager-ibmcloud-master-ckwg8 1/1 Running 0 3m52s
kube-system kube-proxy-2fv6h 1/1 Running 0 27s
kube-system kube-proxy-sgc8g 1/1 Running 0 4m28s
kube-system kube-scheduler-ibmcloud-master-ckwg8 1/1 Running 0 3m41s
system controller-manager-0 1/1 Running 0 4m28s
@jichenjc Maybe you can double check if this works if you have time.
from cluster-api-provider-ibmcloud.
ok, let me try this solution first, then I will post some comments based on the PR you submitted and info here, the IBM cloud 10.x.x.x/8 might be the key issue that you suggested @morvencao
from cluster-api-provider-ibmcloud.
@jichenjc Thanks, let's keep default service CIDR unchanged and try to make vm IP configurable.
from cluster-api-provider-ibmcloud.
I tried to use 20.96.x.x/12 and 172.31.0.0/16 as
spec:
clusterNetwork:
services:
cidrBlocks: ["172.31.0.0/16"]
both works for me, so I suggest we use 172.31.0.0/16 as this is enough for us to proceed and still a private IP address, we should document this limitation because of IBM cloud private ip issue and proceed...
@gyliu513 @gyliu513 @xunpan thoughts?
from cluster-api-provider-ibmcloud.
@jichenjc can we enhance the clouds.yaml
as follows to enable end user can configure the cidr for the vms?
clouds:
ibmcloud:
auth:
apiUserName: "Your API Username"
authenticationKey: "Your API Authentication Key"
machines:
cidrBlocks: ["20.96.0.0/12"]
from cluster-api-provider-ibmcloud.
@gyliu513 If have want to use customized CIDR for VM IP other than default 10.x.x.x/8
, then we need to create subnetwork
and then create VM in that sebnetwork
, something like openstack.
from cluster-api-provider-ibmcloud.
@morvencao you mean create a subnet in IBM cloud and then our instance will start to use that network? I didn't try this before but I think it's something we can suggest
so to summarize ,we have to keep '10.96.0.0/12' as the cluster ip and change IBM cloud side?
from cluster-api-provider-ibmcloud.
@jichenjc I thought subnetwork
or configurable VM CIDR should be support finally.
For now, I'm not sure if softlayer supports to create customized subnetwork, I didn't find the creation subnetwork interface neither in API, nor in UI portal:https://control.softlayer.com/network/subnets
from cluster-api-provider-ibmcloud.
I have no idea how IBM cloud runs their IP settings
looks like we used this subnet (maybe because we used washington 1 zone) and I have no idea why it has a route 10.x.x.x/8 by default and how it connected each other, as far as I can tell seems current IBM cloud doesn't support VPC (see AWS VPC concept) ... so again, for short team, seems modify cluster service IP (it's virtual and nonaccessable to outside) is a good way then we have time for questioning and testing on IBM cloud 10.x IP usage
10.109.100.192/26 | Private | Primary | wdc01.bcr05a.1165 | Washington 1 |
---|
from cluster-api-provider-ibmcloud.
@jichenjc Agreed. Complete fix need to changes in API, at least support configurable CIDR in Clouds.yaml
or machines.yaml
, I think. Changing the service CIDR should be a short-term fix.
Comments? @gyliu513
from cluster-api-provider-ibmcloud.
Related Issues (20)
- Update template references with commands in docs to point to correct templates HOT 2
- Handle VPC LB logic without modification of Spec.ControlPlaneEndpoint.Host field HOT 3
- Build and Publish latest k8s 1.28.x images for VPC and PowerVS HOT 3
- Enhance linting to cover yaml files HOT 9
- Add support and release content to CAPIBM book doc HOT 5
- Add instructions in doc to update release versions in security-scan workflow HOT 7
- Template: Update cluster-api to v1.x.x HOT 4
- Merge podman titl doc into the tilt development doc HOT 5
- Optimise golang version update across multiple files HOT 1
- Fix index pages content in book HOT 8
- Docs: identifying scope of improvement HOT 2
- E2e tests are failing HOT 2
- Replace pvsadm image import reference with capibmadm powervs image import HOT 5
- Handle create workflow of newly proposed Power VS cluster creation method HOT 4
- Handle delete workflow of newly proposed Power VS cluster creation method HOT 4
- Enhance packer plugin for the powervs dhcp network
- git describe from the Make files state `fatal: No names found, cannot describe anything.`
- VPC e2e is failing in eu-de region HOT 4
- EPIC - Introduce new API Specs and infra creation flow HOT 2
- PR jobs are failing on main branch HOT 6
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from cluster-api-provider-ibmcloud.