Giter VIP home page Giter VIP logo

Comments (20)

clivez avatar clivez commented on July 18, 2024 1

Quite agree with Levo, seems the problem come from the configuration file for calico, in my opinion, at least "etcd_endpoints" "etcd_key_file" "etcd_cert_file" and "etcd_ca_cert_file" are needed.

from danm.

hymgg avatar hymgg commented on July 18, 2024

Updated dn so metadata.name matches spec.NetworkID -- didn't know they have to match

apiVersion: danm.k8s.io/v1
kind: DanmNet
metadata:
name: calico-mgmt
namespace: example-sriov
spec:
NetworkID: calico-mgmt
NetworkType: calico

Pod still failed to start but with new error:

Warning FailedCreatePodSandBox 7s kubelet, mtx-huawei2-bld02 Failed create pod sandbox: rpc error: code = Unknown desc = failed to set up sandbox container "834f6e9a1d195a6d410a3e39d1ddb8333d71874801414ce84ba2c04b492086bf" network for pod "sriov-pod": NetworkPlugin cni failed to set up pod "sriov-pod_example-sriov" network: CNI network could not be set up: CNI operation for network:calico-mgmt failed with:CNI delegation failed due to error:Error delegating ADD to CNI plugin:calico because:OS exec call failed:no etcd endpoints specified

from danm.

Levovar avatar Levovar commented on July 18, 2024

glad you really decided to try DANM :)
yes, generally speaking Calico should work, I think we had multiple users successfully using it in the past
in your case it simply a typo:
name: cali-mgmt

danm.k8s.io/interfaces: |
[
{"network":"cali_co_-mgmt", "ip":"dynamic"}
]

from danm.

Levovar avatar Levovar commented on July 18, 2024

ah sorry, only saw the update now. I'm still on my morning coffee :)
NetworkID and name: they don't need to match, but you need to provide the name of the network in the connection definition section of your Pod manifest. However you can name your networks anything!

and for the error: it is thrown by the Calico code, after DANM has delegated the operation. I guess Calico expects some configuration to be present in its backend which is missing. But I confess I'm not that big of a Calico expert, so not sure exactly what's missing.
But for sure the error is not coming from DANM.
summoning @rospring and @clivez , AFAIK they have some Calico experience: guys, any idea what could be the issue here?

from danm.

Levovar avatar Levovar commented on July 18, 2024

after some doc reading:
https://docs.projectcalico.org/v3.5/usage/calicoctl/configure/etcd
I guess you are missing the ETCD_ENDPOINTS environment variable, or config file option so the Calico CNI cannot find its own backend

from danm.

hymgg avatar hymgg commented on July 18, 2024

Thank you guys. Without DANM, Calico has been working on the k8s cluster as the overlay network, I just reused / renamed its config file to calico-mgmt.conf for danm, so wasn't sure why / where to add the additional config info when it's used as a delegate?

(btw, I've used calico with multus, reusing the same config file, didn't have this kind of issue...)

from danm.

Levovar avatar Levovar commented on July 18, 2024

Hmm, interesting. We need to go deeper then :)
Two things come to my mind:

  • can you share with us how i the ETCD store configured for Calico in your cluster? Is it through environment variables, or via config file / ConfigMap?
  • can you try it with a CNI config file which purely contains Calico's config? the current one has plugin chaining which we don't really do, as we have a 1:1 mapping of interfaces and CNI delegation operations.
    That might be the root cause

from danm.

hymgg avatar hymgg commented on July 18, 2024

I followed kubeadm doc to apply calico on k8s,
https://kubernetes.io/docs/setup/production-environment/tools/kubeadm/create-cluster-kubeadm/

Back then it used 2 files,
https://docs.projectcalico.org/v3.3/getting-started/kubernetes/installation/hosted/rbac-kdd.yaml
https://docs.projectcalico.org/v3.3/getting-started/kubernetes/installation/hosted/kubernetes-datastore/calico-networking/1.7/calico.yaml

In calico daemonset it specified k8s for datastore, not etcd.

        # Use Kubernetes API as the backing datastore.
        - name: DATASTORE_TYPE
          value: "kubernetes"

When running multus with calico, I used the same option, "datastore_type": "kubernetes",

cat /etc/cni/net.d/05-multus.conf

{
"name": "multus-cni-network",
"type": "multus",
"delegates": [
{
"name": "k8s-pod-network",
"cniVersion": "0.3.0",
"plugins": [
{
"type": "calico",
"log_level": "info",
"datastore_type": "kubernetes",
"nodename": "mtx-huawei2-bld08",
"mtu": 1440,
"ipam": {
"type": "host-local",
"subnet": "usePodCidr"
},
"policy": {
"type": "k8s"
},
"kubernetes": {
"kubeconfig": "/etc/cni/net.d/calico-kubeconfig"
}
},
{
"type": "portmap",
"snat": true,
"capabilities": {"portMappings": true}
}
]
}
],
"kubeconfig": "/etc/cni/net.d/multus.d/multus.kubeconfig"
}

So DANM's way of delegating with calico is more restricted?

Thanks. -Jessica

from danm.

Levovar avatar Levovar commented on July 18, 2024

well, when handling static delegates you can say it like that. we don't support chaining together plugins, because chaining is usually simply not needed.
so, questions arises: what is "portmap" CNI even used for? :)
Until now we never had a customer who needed the "standard" plugin chaining CNI feature to get something done- simply because we can configure all the features required by a user through our user friendly management API. So, because we have something better, we don't do the less flexible approach of customizing interface provisioning.
If you tell me what portmpapping CNI is required for, I might give you an alternative which you only need to configure into the dynamic network management API, and not into static files.
Alternatively we can also support chaining if required.

When it comes to dynamic delegates everything is configured through the same dynamic, centralized REST API. Therefore I would say these delegates are actually way less restrictive than sticking to the component specific static CNI files.

So, trying to come up with some takeaways, and next steps:

  • do you really need chaining, or this is just the default provisioning and "portmapping" is not really required?
  • if it is required, maybe we already have a dynamically configurable feature substituting it in a friendlier way
  • if not, maybe we can develop one :)
  • or support chaining, if absolutely required for your use-case!
  • but please try it out first with a CNI config which is not a chained one (i.e. without "plugins", only containing the Calico CNI config), because it is still just a hunch that the chaining is the root cause of your issue

from danm.

hymgg avatar hymgg commented on July 18, 2024

The portmap cni was there by default in calico-config, not sure why, k8s doc just says it's required to support hostPort. our apps don't use that. Gonna remove and see.

Thanks. -Jessica

from danm.

hymgg avatar hymgg commented on July 18, 2024

Thank you sir, worked w/o cni chaining,

cat calico-mgmt.conf

{
"name": "k8s-pod-network",
"cniVersion": "0.3.0",
"type": "calico",
"log_level": "info",
"datastore_type": "kubernetes",
"nodename": "mtx-huawei2-bld08",
"mtu": 1440,
"ipam": {
"type": "host-local",
"subnet": "usePodCidr"
},
"policy": {
"type": "k8s"
},
"kubernetes": {
"kubeconfig": "/etc/cni/net.d/calico-kubeconfig"
}
}

Gonna move on to add sriov network.

Thanks. -Jessica

ps. will soon be away for 2 weeks

from danm.

Levovar avatar Levovar commented on July 18, 2024

Cool!
We are not running anywhere, no worries :) Feel free to open follow up issues if you encounter anything out of ordinary during your SRIOV trial!

from danm.

hymgg avatar hymgg commented on July 18, 2024

Please let me know if should put this in a new issue.

continue to follow example/device_plugin_demo

$ cat sriov_net.yaml
apiVersion: danm.k8s.io/v1
kind: DanmNet
metadata:
name: calico-mgmt
namespace: example-sriov
spec:
NetworkID: calico-mgmt
NetworkType: calico

apiVersion: danm.k8s.io/v1
kind: DanmNet
metadata:
name: sriov-a
namespace: example-sriov
spec:
NetworkID: sriov-a
NetworkType: sriov
Options:
device_pool: "intel.com/sriov_net_A"
container_prefix: data_net
rt_tables: 250
vlan: 300
cidr: 10.100.20.0/24
allocation_pool:
start: 10.100.20.10
end: 10.100.20.100

$ cat sriov_pod.yaml
apiVersion: v1
kind: Pod
metadata:
name: sriov-pod
namespace: example-sriov
labels:
env: test
annotations:
danm.k8s.io/interfaces: |
[
{"network":"calico-mgmt", "ip":"dynamic"},
{"network":"sriov-a", "ip":"none"}
]
spec:
containers:

  • name: sriov-pod
    image: busybox:latest
    args:
    • sleep
    • "1000"
      resources:
      requests:
      intel.com/sriov_net_A: '1'
      limits:
      intel.com/sriov_net_A: '1'
      nodeSelector:
      sriov: enabled

Events:
Type Reason Age From Message


Normal Scheduled 4s default-scheduler Successfully assigned example-sriov/sriov-pod to mtx-huawei2-bld03
Warning FailedCreatePodSandBox 1s kubelet, mtx-huawei2-bld03 Failed create pod sandbox: rpc error: code = Unknown desc = failed to set up sandbox container "b91e13fddfdcfeb7a421efbb1b592f24fe2ec5ebdf2862a25ddcff6a78c139af" network for pod "sriov-pod": NetworkPlugin cni failed to set up pod "sriov-pod_example-sriov" network: CNI network could not be set up: CNI operation for network:sriov-a failed with:CNI delegation failed due to error:Error delegating ADD to CNI plugin:sriov because:OS exec call failed:failed to set up IPAM plugin type "fakeipam" from the device "eno31": No IP was passed to fake IPAM
Normal SandboxChanged 1s kubelet, mtx-huawei2-bld03 Pod sandbox changed, it will be killed and re-created.

$ kubectl get node mtx-huawei2-bld03 -o json | jq '.status.allocatable'
{
"cpu": "64",
"ephemeral-storage": "48294789041",
"hugepages-1Gi": "0",
"hugepages-2Mi": "0",
"intel.com/sriov_net_A": "16",
"intel.com/sriov_net_B": "0",
"memory": "196389160Ki",
"pods": "110"
}

Tried dynamic, instead of none, {"network":"sriov-a", "ip":"none"}
The error was "IPv4 address cannot be dynamically allocated for an L2 network!"

Should it be static? how could the example have worked?

Thanks. -Jessica

from danm.

Levovar avatar Levovar commented on July 18, 2024

not picky when it comes to number of issues, no KPIs for it :) so we can continue it in this thread if you want!

so, two issues.
the first one is a regression we have introduced recently: "none" type IP allocation does not currently work with SR-IOV. See related issue: #107
It is scheduled to be corrected in DANM 4.1

The second is a config issue in the manifest, but it is actually the desired result: CIDR is not defined in the network manifest, meaning that the network represents a L2 network. So, if you want L3 VFs (with IP), add the "cidr" attribute to the manifest to define the subnet from which IPs can be allocated to a Pod

from danm.

hymgg avatar hymgg commented on July 18, 2024

Is this line not enough in above dn?
cidr: 10.100.20.0/24

Could you find a complete example of sriov with dynamic? Even better if it also has routing across nodes...

Thanks. -Jessica

from danm.

Levovar avatar Levovar commented on July 18, 2024

ah my bad, did not notice yours already has a CIDR! yes it should be enough.
are you running 3.3, or 4.0? In 3.3 the networks were only validated after their creation, so it can happen that failed. In 4.0 we validate them already at the time of their creation with the "webhook" component.
However, if you run 4.0 "webhook" is a mandatory component. If you run 4.0, but without the webhook, that would explain this behaviour.

if you are running 3.3: can you send me the exact output of "kubectl describe sriov-a -n example-sriov", and the the output of kubectl logs of any netwatcher Pod?
then I can tell you more

regarding routing: well, with SR-IOV you are basically building a good, old-fashioned L2 domain. so assuming you have configured the VLAN tag in the DanmNet for all of your PFs of all of your computes in your switch, connectivity between nodes is achieved by the simple in-subnet switching
if you want to connect to other IPs belonging to other subnets, you can provision IP routes via the "routes" parameter in the DanmNet, or policy-based IP routes via the "proutes" parameter in the connection annotation

from danm.

Levovar avatar Levovar commented on July 18, 2024

Meanwhile: if you do use 4.0 I corrected the "none" type issue in #110
If you change the CNI binary on your cluster to the new one you could give it a go

from danm.

Levovar avatar Levovar commented on July 18, 2024

Let's leave consider this thread closed from the perspective of the original issue, but if you still have any questions related to SR-IOV feel free to open a new one!

from danm.

tcnieh avatar tcnieh commented on July 18, 2024

Quite agree with Levo, seems the problem come from the configuration file for calico, in my opinion, at least "etcd_endpoints" "etcd_key_file" "etcd_cert_file" and "etcd_ca_cert_file" are needed.

@clivez Hello there, I am now utilizing Danm to create the calico networks, but I am facing the same error "CNI operation for network:calico-1 failed with:CNI delegation failed due to error:Error delegating ADD to CNI plugin:calico because:OS exec call failed:no etcd endpoints specified".
You mentioned above that "etcd_endpoints" "etcd_key_file" "etcd_cert_file" and "etcd_ca_cert_file" are minimum needed, then which config file should I setup these arguments, /etc/cni/net.d/calico-1.conf or /etc/cni/net.d/calico-kubeconfig?

In the meanwhile, I try to setup etcd_endpoints IP, referenced from etcd_pod_kube_system, in both /etc/cni/net.d/calico-1.conf and /etc/cni/net.d/calico-kubeconfig, it seems not working.

Sorry, If I should not reply an closed issue, I'll open another new one or ask on slack.

from danm.

Levovar avatar Levovar commented on July 18, 2024

I think the problem here was similar to what you have experienced with your Flannel config, i.e. the Calico config in this case was also in "chained" format
have you verified it yet?

from danm.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.