Comments (13)
I've fixed the problem by replacing both the init containers - cni
and pod2daemon
(flexvolume driver
). I need to rebuild all cni-plugin and flex volume driver binaries with static link flags.
My set up is a k3s cluster all running k3os images (2 pi - arm64 + 1 proxmox VM - amd64) with Calico Operator v3.22. One thing I noticed is both archs need to rebuild pod2daemon
but only the amd64 also needs to rebuild the cni
.
I've pushed these two images (cni & pod2daemon-flexvol ) for both arm64
and amd64
in case you wanna test it out on your end.
Cheers
from operator.
Hi @Glen-Tigera , I got an e-mail from another k3d user saying that he's seeing the same issue on k3OS with Tigera Operator v3.21, while it works with v3.15.
from operator.
@xinbinhuang Thanks for looking into this too! 🙏 Appreciate linking a related issue and there seems to be an ongoing resolution.
@tmjd That worked. Once I changed the manifest to use v3.20.0 cni and pod2daemon init containers, the pod network was functional once the daemonset re-created the calico-nodes. It is still functional with v3.22.0 of the calico-node image and v3.20.0 (cni + pod2daemon), so the issue is just in the init containers.
Casey has a PR to fix pod2daemon; that may be the source of this issue.
projectcalico/calico#5515
from operator.
Hey @tmjd, just took a look. Yes you're right, the v3.21 installation has nonPrivileged: Disabled
as the default, while this is not in v3.20. There is also controlPlaneReplicas: 2
in the v3.21 installation.
I believe the k3d cluster create
command creates a 1 server and 1 agent node by default, so that is why there's only 1 calico-node deployed at the time. The number of servers and agents you want on the cluster can be tuned though with their manifest definition.
https://k3d.io/v5.2.2/usage/commands/k3d_cluster_create/#synopsis
https://k3d.io/v5.2.1/usage/configfile/
from operator.
Sorry I got confused about the nodes, I'm not sure why I thought there should be more. I don't think my previous comment was very useful except to know that nonPrivileged: Disabled
is set because that means the installation isn't using the new nonPrivileged option, which is what I would have expected.
You're suggesting the difference in the different versions is the operator but it could very well be in calico-node. Could I suggest trying a v3.21 install and then putting the annotation unsupported.operator.tigera.io/ignore: "true"
on the calico-node daemonset and switching the calico-node image to the v3.20 version and see if the problem still exists. You could also try installing v3.20 and then switching the calico-node image to v3.21 but I'm less confident in version compatibility with that combo.
from operator.
Thanks for looking into this! I'm seeing the same issue on K3os (provisioned as proxmox VM). And I think projectcalico/calico#5356 can be relevant here.
from operator.
@Glen-Tigera could you try updating the cni plugin and flexvol container to v3.20.0 also?
from operator.
K3D + Calico operator install summary:
3.15 ✔️
3.16 ✔️
3.17 ✔️
3.18 ✔️
3.19 ✔️
3.20 ✔️
3.21 ❌
k3d-calico-operator-install-findings.txt
from operator.
@Glen-Tigera did you compare the Installation resources of v3.20 and v3.21? At a minimum the v3.21 should have had a NonPrivileged field that should have been set to Disabled, where that field was not available in v3.20 because the only option was privileged.
Also did you try looking at the calico-node Daemonset because your install-findings file shows that one calico-node was deployed and even Ready, why weren't there more calico-node pods at least being attempted? That suggests a scheduling problem that I think should be reported in the Daemonset.
from operator.
Hey @tmjd sorry been busy with test plans the past few weeks so couldn't address this till now. I provisioned a 3.21 calico/node first and then applied the annotation above. Then I tried changing the image field to v3.20.0 for calico node. Looks like the problem still exists unless there's a better way to downgrade the node.
kubectl annotate daemonsets calico-node -n calico-system unsupported.operator.tigera.io/ignore="true"
daemonset.apps/calico-node annotated
After annotation, the network was the same:
NAMESPACE NAME READY STATUS RESTARTS AGE
tigera-operator tigera-operator-c4b9549c7-w2527 1/1 Running 0 5m46s
calico-system calico-typha-8686dd5c79-798gg 1/1 Running 0 5m23s
calico-system calico-typha-8686dd5c79-q7xx5 1/1 Running 0 5m32s
calico-system calico-kube-controllers-7cd6f7b9f9-rpjkj 0/1 ContainerCreating 0 5m32s
kube-system local-path-provisioner-5ff76fc89d-chc5f 0/1 ContainerCreating 0 7m
kube-system coredns-7448499f4d-9sllb 0/1 ContainerCreating 0 7m
kube-system metrics-server-86cbb8457f-fcbrd 0/1 ContainerCreating 0 7m
calico-system calico-node-fhkqf 1/1 Running 0 5m32s
calico-system calico-node-khmrj 1/1 Running 0 5m32s
calico-system calico-node-j2qqm 1/1 Running 0 5m32s
calico-system calico-node-5hv2d 1/1 Running 0 5m32s
Then I edited the daemonset spec for this:
.spec.containers.env.image: docker.io/calico/node:v3.21.4
to
.spec.containers.env.image: docker.io/calico/node:v3.20.0
after that the daemonset terminated the v3.21.4 calico-node containers and created new ones which pulled in v3.20.0. I waited for a minute and wasn't able to see the remaining containers get healthy so you might be right it could be an issue in calico-node instead of operator.
NAMESPACE NAME READY STATUS RESTARTS AGE
tigera-operator tigera-operator-c4b9549c7-w2527 1/1 Running 0 22m
calico-system calico-typha-8686dd5c79-798gg 1/1 Running 0 22m
calico-system calico-typha-8686dd5c79-q7xx5 1/1 Running 0 22m
kube-system local-path-provisioner-5ff76fc89d-chc5f 0/1 ContainerCreating 0 23m
kube-system coredns-7448499f4d-9sllb 0/1 ContainerCreating 0 23m
kube-system metrics-server-86cbb8457f-fcbrd 0/1 ContainerCreating 0 23m
calico-system calico-kube-controllers-7cd6f7b9f9-9vhzz 0/1 ContainerCreating 0 5m41s
calico-system calico-node-qvqm4 1/1 Running 0 3m56s
calico-system calico-node-wfs29 1/1 Running 0 3m44s
calico-system calico-node-xbdb9 1/1 Running 0 3m22s
calico-system calico-node-5ptfb 1/1 Running 0 2m53s
from operator.
@tmjd while waiting for the upstream image to be fixed, is it possible to override the init containers image during operator installation?
from operator.
One way to temporarily override the init containers you could use the "unsupported" annotation on the calico-node daemonset, but with that annotation added, the daemonset will not be updated by the operator anymore.
You can see how to here, https://github.com/tigera/operator/blob/master/README.md#making-temporary-changes-to-components-the-operator-manages.
from operator.
This isn't an operator issue - the cause of this was us switching to dynamically linked builds of some host binaries (CNI and pod2daemon flexvol).
These both have fixes that will be available in the next Calico release (v3.23).
from operator.
Related Issues (20)
- Operator deletes tigera-system namespace on ApiServer deployment HOT 7
- Incorrect PodCIDR in installations.operator.tigera.io ipPools prevented upgrade HOT 2
- AutoDiscoverProvider leads to wrong result
- Error running cluster on M1 / ARM Mac OS for local development HOT 13
- Calico Operator should support running different dataplanes on different nodes in the same Kubernetes cluster HOT 2
- v1.31.1 showing HIGH vulnerability CVE-2023-44487 HOT 1
- Tigera operator violates PodSecurity "baseline:latest" HOT 2
- Tigera Operator pod keeps restarting. HOT 1
- Pod fails to start when 'sysctl' tuning configured
- Typha autoscaler's autoscaling profile to be configurable
- Propose Windows operator updates HOT 7
- Calico v3.27.0 not working with Tigera v1.32.3 HOT 5
- Uninstallation Failure: Calico Module Leaves Remaining Jobs Blocking Deletion HOT 1
- Can't use calico on windows on EKS due to forced network mode HOT 1
- Calico APIServer does not find certs secret HOT 2
- With Tigera operator, applicative pod lost network after windows nodes reboot HOT 2
- Calico or Tigera operator should create CRDs automatically HOT 1
- Calico v3.27.2 is not working with TigeraOperator v1.32.5 HOT 2
- is there anyway to config labels for calico-system and calico-apiserver using tigera operator
- Expose CNI path for configuration
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from operator.