Giter VIP home page Giter VIP logo

Comments (13)

xinbinhuang avatar xinbinhuang commented on August 11, 2024 3

I've fixed the problem by replacing both the init containers - cni and pod2daemon (flexvolume driver). I need to rebuild all cni-plugin and flex volume driver binaries with static link flags.

My set up is a k3s cluster all running k3os images (2 pi - arm64 + 1 proxmox VM - amd64) with Calico Operator v3.22. One thing I noticed is both archs need to rebuild pod2daemon but only the amd64 also needs to rebuild the cni.

I've pushed these two images (cni & pod2daemon-flexvol ) for both arm64 and amd64 in case you wanna test it out on your end.

Cheers

from operator.

iwilltry42 avatar iwilltry42 commented on August 11, 2024 2

Hi @Glen-Tigera , I got an e-mail from another k3d user saying that he's seeing the same issue on k3OS with Tigera Operator v3.21, while it works with v3.15.

from operator.

Glen-Tigera avatar Glen-Tigera commented on August 11, 2024 2

@xinbinhuang Thanks for looking into this too! 🙏 Appreciate linking a related issue and there seems to be an ongoing resolution.

@tmjd That worked. Once I changed the manifest to use v3.20.0 cni and pod2daemon init containers, the pod network was functional once the daemonset re-created the calico-nodes. It is still functional with v3.22.0 of the calico-node image and v3.20.0 (cni + pod2daemon), so the issue is just in the init containers.

Screenshot from 2022-01-31 15-36-53

Casey has a PR to fix pod2daemon; that may be the source of this issue.
projectcalico/calico#5515

from operator.

Glen-Tigera avatar Glen-Tigera commented on August 11, 2024 1

Hey @tmjd, just took a look. Yes you're right, the v3.21 installation has nonPrivileged: Disabled as the default, while this is not in v3.20. There is also controlPlaneReplicas: 2 in the v3.21 installation.

Screenshot from 2022-01-12 18-32-11

I believe the k3d cluster create command creates a 1 server and 1 agent node by default, so that is why there's only 1 calico-node deployed at the time. The number of servers and agents you want on the cluster can be tuned though with their manifest definition.
https://k3d.io/v5.2.2/usage/commands/k3d_cluster_create/#synopsis
https://k3d.io/v5.2.1/usage/configfile/

from operator.

tmjd avatar tmjd commented on August 11, 2024 1

Sorry I got confused about the nodes, I'm not sure why I thought there should be more. I don't think my previous comment was very useful except to know that nonPrivileged: Disabled is set because that means the installation isn't using the new nonPrivileged option, which is what I would have expected.

You're suggesting the difference in the different versions is the operator but it could very well be in calico-node. Could I suggest trying a v3.21 install and then putting the annotation unsupported.operator.tigera.io/ignore: "true" on the calico-node daemonset and switching the calico-node image to the v3.20 version and see if the problem still exists. You could also try installing v3.20 and then switching the calico-node image to v3.21 but I'm less confident in version compatibility with that combo.

from operator.

xinbinhuang avatar xinbinhuang commented on August 11, 2024 1

Thanks for looking into this! I'm seeing the same issue on K3os (provisioned as proxmox VM). And I think projectcalico/calico#5356 can be relevant here.

from operator.

tmjd avatar tmjd commented on August 11, 2024 1

@Glen-Tigera could you try updating the cni plugin and flexvol container to v3.20.0 also?

from operator.

Glen-Tigera avatar Glen-Tigera commented on August 11, 2024

K3D + Calico operator install summary:
3.15 ✔️
3.16 ✔️
3.17 ✔️
3.18 ✔️
3.19 ✔️
3.20 ✔️
3.21 ❌

k3d-calico-operator-install-findings.txt

from operator.

tmjd avatar tmjd commented on August 11, 2024

@Glen-Tigera did you compare the Installation resources of v3.20 and v3.21? At a minimum the v3.21 should have had a NonPrivileged field that should have been set to Disabled, where that field was not available in v3.20 because the only option was privileged.
Also did you try looking at the calico-node Daemonset because your install-findings file shows that one calico-node was deployed and even Ready, why weren't there more calico-node pods at least being attempted? That suggests a scheduling problem that I think should be reported in the Daemonset.

from operator.

Glen-Tigera avatar Glen-Tigera commented on August 11, 2024

Hey @tmjd sorry been busy with test plans the past few weeks so couldn't address this till now. I provisioned a 3.21 calico/node first and then applied the annotation above. Then I tried changing the image field to v3.20.0 for calico node. Looks like the problem still exists unless there's a better way to downgrade the node.

kubectl annotate daemonsets calico-node -n calico-system unsupported.operator.tigera.io/ignore="true"
daemonset.apps/calico-node annotated

After annotation, the network was the same:

NAMESPACE         NAME                                       READY   STATUS              RESTARTS   AGE
tigera-operator   tigera-operator-c4b9549c7-w2527            1/1     Running             0          5m46s
calico-system     calico-typha-8686dd5c79-798gg              1/1     Running             0          5m23s
calico-system     calico-typha-8686dd5c79-q7xx5              1/1     Running             0          5m32s
calico-system     calico-kube-controllers-7cd6f7b9f9-rpjkj   0/1     ContainerCreating   0          5m32s
kube-system       local-path-provisioner-5ff76fc89d-chc5f    0/1     ContainerCreating   0          7m
kube-system       coredns-7448499f4d-9sllb                   0/1     ContainerCreating   0          7m
kube-system       metrics-server-86cbb8457f-fcbrd            0/1     ContainerCreating   0          7m
calico-system     calico-node-fhkqf                          1/1     Running             0          5m32s
calico-system     calico-node-khmrj                          1/1     Running             0          5m32s
calico-system     calico-node-j2qqm                          1/1     Running             0          5m32s
calico-system     calico-node-5hv2d                          1/1     Running             0          5m32s

Then I edited the daemonset spec for this:
.spec.containers.env.image: docker.io/calico/node:v3.21.4
to
.spec.containers.env.image: docker.io/calico/node:v3.20.0

after that the daemonset terminated the v3.21.4 calico-node containers and created new ones which pulled in v3.20.0. I waited for a minute and wasn't able to see the remaining containers get healthy so you might be right it could be an issue in calico-node instead of operator.

NAMESPACE         NAME                                       READY   STATUS              RESTARTS   AGE
tigera-operator   tigera-operator-c4b9549c7-w2527            1/1     Running             0          22m
calico-system     calico-typha-8686dd5c79-798gg              1/1     Running             0          22m
calico-system     calico-typha-8686dd5c79-q7xx5              1/1     Running             0          22m
kube-system       local-path-provisioner-5ff76fc89d-chc5f    0/1     ContainerCreating   0          23m
kube-system       coredns-7448499f4d-9sllb                   0/1     ContainerCreating   0          23m
kube-system       metrics-server-86cbb8457f-fcbrd            0/1     ContainerCreating   0          23m
calico-system     calico-kube-controllers-7cd6f7b9f9-9vhzz   0/1     ContainerCreating   0          5m41s
calico-system     calico-node-qvqm4                          1/1     Running             0          3m56s
calico-system     calico-node-wfs29                          1/1     Running             0          3m44s
calico-system     calico-node-xbdb9                          1/1     Running             0          3m22s
calico-system     calico-node-5ptfb                          1/1     Running             0          2m53s

from operator.

xinbinhuang avatar xinbinhuang commented on August 11, 2024

@tmjd while waiting for the upstream image to be fixed, is it possible to override the init containers image during operator installation?

from operator.

tmjd avatar tmjd commented on August 11, 2024

One way to temporarily override the init containers you could use the "unsupported" annotation on the calico-node daemonset, but with that annotation added, the daemonset will not be updated by the operator anymore.
You can see how to here, https://github.com/tigera/operator/blob/master/README.md#making-temporary-changes-to-components-the-operator-manages.

from operator.

caseydavenport avatar caseydavenport commented on August 11, 2024

This isn't an operator issue - the cause of this was us switching to dynamically linked builds of some host binaries (CNI and pod2daemon flexvol).

These both have fixes that will be available in the next Calico release (v3.23).

from operator.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.