Expected Behavior Current Behavior We're expec

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Pod Calico-node crashing as liveness probe fails 50% of the time at least about operator HOT 6 OPEN

tigera commented on August 11, 2024

Pod Calico-node crashing as liveness probe fails 50% of the time at least

from operator.

Comments (6)

AuroreM commented on August 11, 2024 1

We manage to understand the issue : the calico-node starts to quickly and the kube-proxy has not set the iptables fully, though the kubernetes server ip is not set

from operator.

caseydavenport commented on August 11, 2024

yeah, calico/node relies on kube-proxy to program the kubernetes service IP in order to access the API, unless running in eBPF mode in which case an explicit IP can be given.

Might just need to wait until kube-proxy is ready until Calico is installed

from operator.

tmjd commented on August 11, 2024

@caseydavenport I don't think this is something that we or anyone can control. They're both daemonsets and there isn't anything that kube-proxy would be creating/setting that could delay calico-node startup is there? I'm thinking in the case of adding new nodes to a cluster.
Should we consider a startup probe to cover this case? (I just learned about them and see they became available in v1.20).
Here a few links where I was reading about them:

from operator.

wiikip commented on August 11, 2024

After further investigation we found out that the failed request to the kube server API takes 30s to timeout. During this time the liveness probe has the time to fail 3 times and then the pod restarts. I think a quick win solution would be to decrease this timeout delay to 5s, even if the first call to the API fails, 5s later the API will be up for sure and the probe will not kill the pod.

from operator.

caseydavenport commented on August 11, 2024

@tmjd yeah, I think it's a hard one to do safely. We could get something to work for 90% of cases but not all, most likely.

@wiikip yeah, I think we could decrease the connection timeout, but we need to be careful. I don't think we should drop it below 10s - too short and we risk destabilizing unnecessarily. Maybe we can drop the connection timeout to 10s and increase the liveness prove failure threshold to 40s? That should be plenty.

from operator.

wiikip commented on August 11, 2024

@caseydavenport Yeah, I agree with you. Decreasing the connection timeout to 10s and allowing the liveness probe to fail a 4th time ( 40s before it kill the pod) will do the job !

from operator.

Recommend Projects

Pod Calico-node crashing as liveness probe fails 50% of the time at least about operator HOT 6 OPEN

Comments (6)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent