Giter VIP home page Giter VIP logo

Comments (10)

TothFerenc avatar TothFerenc commented on June 2, 2024

It should work without resource requests (e.g. asking SR-IOV Device Plugin to select VFs). In that case sriov CNI tries to assign a random VF to the Pod, if there's enough free VFs of the PF on that host.
Can you please provide the exact error msg?
I guess the Pod was scheduled to a wrong host where SR-IOV VFs were not prepared in advance.

from danm.

nokia-t1zhou avatar nokia-t1zhou commented on June 2, 2024

i have test it with 2 YAML file.
one is(successful to create pod):

apiVersion: apps/v1beta2
kind: DaemonSet
metadata:
  name: fakefhup
  namespace: cran1
spec:
  selector:
    matchLabels:
      name: fakefhup
  template:
    metadata:
      annotations:
        danm.k8s.io/interfaces: |
        danm.k8s.io/interfaces: |
          [
            { "network":"internal" },
            { "network":"bip" },
            { "network":"intmsg", "ip":"dynamic" },
            { "network":"fronthaulcu", "ip":"dynamic" }
          ]
      labels:
        name: fakefhup
    spec:
      dnsPolicy: ClusterFirst
      #nodeSelector:
      #  nodetype: caas_master
      containers:
        - name: fakefhup
          image: registry.kube-system.svc.nokia.net:5555/rcp/centos:7
          command: ['sh', '-c', 'while true; do echo Hello Kubernetes! && sleep 100;done']
          resources:
            requests:
              nokia.k8s.io/sriov_ens1f1: '1' #one P1 ens1f1 physical NIC based SR-IOV VF requested for bip
              nokia.k8s.io/sriov_ens11f1: '1' #one P1 ens11f1 physical NIC based SR-IOV VF requested for fronthaulcu
            limits:
              nokia.k8s.io/sriov_ens1f1: '1' # keep the same value as request
              nokia.k8s.io/sriov_ens11f1: '1' # keep the same value as request

other is(failed to create pod):

apiVersion: apps/v1beta2
kind: DaemonSet
metadata:
  name: fakefhup
  namespace: cran1
spec:
  selector:
    matchLabels:
      name: fakefhup
  template:
    metadata:
      annotations:
        danm.k8s.io/interfaces: |
        danm.k8s.io/interfaces: |
          [
            { "network":"internal" },
            { "network":"bip" },
            { "network":"intmsg", "ip":"dynamic" },
            { "network":"fronthaulcu", "ip":"dynamic" }
          ]
      labels:
        name: fakefhup
    spec:
      dnsPolicy: ClusterFirst
      #nodeSelector:
      #  nodetype: caas_master
      containers:
        - name: fakefhup
          image: registry.kube-system.svc.nokia.net:5555/rcp/centos:7
          command: ['sh', '-c', 'while true; do echo Hello Kubernetes! && sleep 100;done']
          #resources:
           # requests:
          #    nokia.k8s.io/sriov_ens1f1: '1' #one P1 ens1f1 physical NIC based SR-IOV VF requested for bip
           #   nokia.k8s.io/sriov_ens11f1: '1' #one P1 ens11f1 physical NIC based SR-IOV VF requested for fronthaulcu
           # limits:
           #   nokia.k8s.io/sriov_ens1f1: '1' # keep the same value as request
           #   nokia.k8s.io/sriov_ens11f1: '1' # keep the same value as request

the error log of kubelet is:
Warning FailedCreatePodSandBox 32s kubelet, 192.168.87.22 Failed create pod sandbox: rpc error: code = Unknown desc = failed to set up sandbox container "c199bde6c16e3e7ba9e029af70f8ed4a4bf47b6370bb5c170998f447bea8f205" network for pod "fakefhup-92s7b": NetworkPlugin cni failed to set up pod "fakefhup-92s7b_cran1" network: netplugin failed but error parsing its diagnostic message "": unexpected end of JSON input

in the all host, we have prepare the enough VF.

from danm.

TothFerenc avatar TothFerenc commented on June 2, 2024

It seems like a bug.
Can you please share the DanmNet definition for the failing case? Thanks.

Anyway, why don't you want to use resource requests? (as that is the preferred method)

from danm.

changyi2409 avatar changyi2409 commented on June 2, 2024

The danmnet definition as flollowing:

apiVersion: danm.k8s.io/v1
kind: DanmNet
metadata:
name: internal
namespace: cran1
spec:
NetworkID: internal
NetworkType: flannel
Options:
allocation_pool:
end: ""
start: ""
container_prefix: ""
host_device: ""
rt_tables: 254


apiVersion: danm.k8s.io/v1
kind: DanmNet
metadata:
name: intmsg
namespace: cran1
spec:
NetworkID: intmsg
NetworkType: ipvlan
Options:
host_device: "eno1"
cidr: "192.168.2.0/24"
allocation_pool:
start: "192.168.2.5"
end: "192.168.2.100"
container_prefix: "intmsg"
rt_tables: 0
vlan: 2


apiVersion: danm.k8s.io/v1
kind: DanmNet
metadata:
name: bip
namespace: cran1
spec:
NetworkID: bip
NetworkType: sriov
Options:
allocation_pool:
start: ""
end: ""
host_device: "ens1f1"
container_prefix: "bip"
vlan: 717
device_pool: "xxx.k8s.io/sriov_ens1f1" //I hide device pool name with xxx

from danm.

Levovar avatar Levovar commented on June 2, 2024
  annotations:
    danm.k8s.io/interfaces: |
    danm.k8s.io/interfaces: |
      [

is this a typo in your comment, or an issue in your manifest?

from danm.

nokia-t1zhou avatar nokia-t1zhou commented on June 2, 2024

i am do the test again, and the issue is same as before.
below 2 network is used, one is flannel, other is SRIOV.

[root@controller-3 YAML]# kubectl get danmnet internal -n=cran2 -o yaml
apiVersion: danm.k8s.io/v1
kind: DanmNet
metadata:
  creationTimestamp: "2019-05-15T14:24:55Z"
  generation: 2
  name: internal
  namespace: cran2
  resourceVersion: "39274"
  selfLink: /apis/danm.k8s.io/v1/namespaces/cran2/danmnets/internal
  uid: 362a366f-771d-11e9-9606-d8c497cf132e
spec:
  NetworkID: internal
  NetworkType: flannel
  Options:
    allocation_pool:
      end: ""
      start: ""
    container_prefix: ""
    host_device: ""
    rt_tables: 254
  Validation: "True"
[root@controller-3 YAML]# kubectl get danmnet fronthaulmanagement -n=cran2 -o yaml
apiVersion: danm.k8s.io/v1
kind: DanmNet
metadata:
  creationTimestamp: "2019-05-16T07:13:18Z"
  generation: 26
  name: fronthaulmanagement
  namespace: cran2
  resourceVersion: "406248"
  selfLink: /apis/danm.k8s.io/v1/namespaces/cran2/danmnets/fronthaulmanagement
  uid: 1481c4af-77aa-11e9-a2be-d8c497cf1308
spec:
  NetworkID: fronthaulmanagement
  NetworkType: sriov
  Options:
    alloc: gQ==
    allocation_pool:
      end: 10.70.31.38
      start: 10.70.31.34
    cidr: 10.70.31.32/29
    container_prefix: fhm
    device_pool: nokia.k8s.io/sriov_ens11f1
    host_device: ens11f1
    rt_tables: 0
    vlan: 705
  Validation: "True"

first, use below YAML to create pod:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: zhoutong-pod
  namespace: cran2
spec:
  selector:
    matchLabels:
      app: pod-test
  replicas: 1
  template:
    metadata:
      labels:
        app: pod-test
      annotations:
        danm.k8s.io/interfaces: |
          [
            {
              "network":"internal"
            },
            {
              "network":"fronthaulmanagement"
            }
          ]
    spec:
      hostNetwork: false
      nodeSelector:
        nodename: caas_master1
      containers:
      - name: zhoutong-container
        image: registry.kube-system.svc.nokia.net:5555/rcp/centos:7
        command: ['/bin/bash']
        imagePullPolicy: IfNotPresent
        stdin: true
        tty: true
      restartPolicy: Always

the result is POD can't startup with below kubelet error logs:

E0520 02:32:02.002353  101682 cni.go:331] Error adding cran2_zhoutong-pod-5fb64758d-w5wn6/93d2cd0cc9fd8bd8615caf6204a8b2bc6df78efa08e8c2804579def220d8b445 to network danm/meta_cni: netplugin failed but error parsing its diagnostic message "": unexpected end of JSON input
E0520 02:32:02.446638  101682 remote_runtime.go:109] RunPodSandbox from runtime service failed: rpc error: code = Unknown desc = failed to set up sandbox container "93d2cd0cc9fd8bd8615caf6204a8b2bc6df78efa08e8c2804579def220d8b445" network for pod "zhoutong-pod-5fb64758d-w5wn6": NetworkPlugin cni failed to set up pod "zhoutong-pod-5fb64758d-w5wn6_cran2" network: netplugin failed but error parsing its diagnostic message "": unexpected end of JSON input
E0520 02:32:02.446720  101682 kuberuntime_sandbox.go:68] CreatePodSandbox for pod "zhoutong-pod-5fb64758d-w5wn6_cran2(4f456f2b-7aa7-11e9-a2be-d8c497cf1308)" failed: rpc error: code = Unknown desc = failed to set up sandbox container "93d2cd0cc9fd8bd8615caf6204a8b2bc6df78efa08e8c2804579def220d8b445" network for pod "zhoutong-pod-5fb64758d-w5wn6": NetworkPlugin cni failed to set up pod "zhoutong-pod-5fb64758d-w5wn6_cran2" network: netplugin failed but error parsing its diagnostic message "": unexpected end of JSON input
E0520 02:32:02.446768  101682 kuberuntime_manager.go:693] createPodSandbox for pod "zhoutong-pod-5fb64758d-w5wn6_cran2(4f456f2b-7aa7-11e9-a2be-d8c497cf1308)" failed: rpc error: code = Unknown desc = failed to set up sandbox container "93d2cd0cc9fd8bd8615caf6204a8b2bc6df78efa08e8c2804579def220d8b445" network for pod "zhoutong-pod-5fb64758d-w5wn6": NetworkPlugin cni failed to set up pod "zhoutong-pod-5fb64758d-w5wn6_cran2" network: netplugin failed but error parsing its diagnostic message "": unexpected end of JSON input
E0520 02:32:02.446870  101682 pod_workers.go:190] Error syncing pod 4f456f2b-7aa7-11e9-a2be-d8c497cf1308 ("zhoutong-pod-5fb64758d-w5wn6_cran2(4f456f2b-7aa7-11e9-a2be-d8c497cf1308)"), skipping: failed to "CreatePodSandbox" for "zhoutong-pod-5fb64758d-w5wn6_cran2(4f456f2b-7aa7-11e9-a2be-d8c497cf1308)" with CreatePodSandboxError: "CreatePodSandbox for pod \"zhoutong-pod-5fb64758d-w5wn6_cran2(4f456f2b-7aa7-11e9-a2be-d8c497cf1308)\" failed: rpc error: code = Unknown desc = failed to set up sandbox container \"93d2cd0cc9fd8bd8615caf6204a8b2bc6df78efa08e8c2804579def220d8b445\" network for pod \"zhoutong-pod-5fb64758d-w5wn6\": NetworkPlugin cni failed to set up pod \"zhoutong-pod-5fb64758d-w5wn6_cran2\" network: netplugin failed but error parsing its diagnostic message \"\": unexpected end of JSON input"
W0520 02:32:03.294604  101682 docker_sandbox.go:384] failed to read pod IP from plugin/docker: NetworkPlugin cni failed on the status hook for pod "zhoutong-pod-5fb64758d-w5wn6_cran2": CNI failed to retrieve network namespace path: cannot find network namespace for the terminated container "93d2cd0cc9fd8bd8615caf6204a8b2bc6df78efa08e8c2804579def220d8b445"
W0520 02:32:03.337761  101682 pod_container_deletor.go:75] Container "93d2cd0cc9fd8bd8615caf6204a8b2bc6df78efa08e8c2804579def220d8b445" not found in pod's containers
W0520 02:32:03.341558  101682 cni.go:309] CNI failed to retrieve network namespace path: cannot find network namespace for the terminated container "93d2cd0cc9fd8bd8615caf6204a8b2bc6df78efa08e8c2804579def220d8b445"
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x18 pc=0xf5fdaa]

goroutine 1 [running]:
main.getAllocatedDevices(0xc000318750, 0x130b540, 0xc00011d860, 0xc00011eec0, 0x1a, 0xc000476b00, 0x4, 0xc00031c000)
        /build/src/github.com/nokia/danm/pkg/danm/danm.go:194 +0x6a
main.setupNetworking(0xc000318750, 0x0, 0x0, 0x6)
        /build/src/github.com/nokia/danm/pkg/danm/danm.go:233 +0x897
main.createInterfaces(0xc00032acb0, 0x1184102, 0x5)
        /build/src/github.com/nokia/danm/pkg/danm/danm.go:82 +0x4ff
github.com/nokia/danm/pkg/vendor/github.com/containernetworking/cni/pkg/skel.(*dispatcher).checkVersionAndCall(0xc000121ef8, 0xc00032acb0, 0x13191e0, 0xc000100540, 0x1201ab8, 0x0, 0x130d580)
        /build/src/github.com/nokia/danm/pkg/vendor/github.com/containernetworking/cni/pkg/skel/skel.go:162 +0x259
github.com/nokia/danm/pkg/vendor/github.com/containernetworking/cni/pkg/skel.(*dispatcher).pluginMain(0xc000121ef8, 0x1201ab8, 0x1201ac0, 0x13191e0, 0xc000100540, 0x42d231)
        /build/src/github.com/nokia/danm/pkg/vendor/github.com/containernetworking/cni/pkg/skel/skel.go:173 +0x32e
github.com/nokia/danm/pkg/vendor/github.com/containernetworking/cni/pkg/skel.PluginMainWithError(...)
        /build/src/github.com/nokia/danm/pkg/vendor/github.com/containernetworking/cni/pkg/skel/skel.go:210
github.com/nokia/danm/pkg/vendor/github.com/containernetworking/cni/pkg/skel.PluginMain(0x1201ab8, 0x1201ac0, 0x13191e0, 0xc000100540)
        /build/src/github.com/nokia/danm/pkg/vendor/github.com/containernetworking/cni/pkg/skel/skel.go:222 +0xf3
main.main()
        /build/src/github.com/nokia/danm/pkg/danm/danm.go:477 +0x8c

then i use below YAML to create POD, it's working and POD is running:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: zhoutong-pod
  namespace: cran2
spec:
  selector:
    matchLabels:
      app: pod-test
  replicas: 1
  template:
    metadata:
      labels:
        app: pod-test
      annotations:
        danm.k8s.io/interfaces: |
          [
            {
              "network":"internal"
            },
            {
              "network":"fronthaulmanagement"
            }
          ]
    spec:
      hostNetwork: false
      nodeSelector:
        nodename: caas_master1
      containers:
      - name: zhoutong-container
        image: registry.kube-system.svc.nokia.net:5555/rcp/centos:7
        command: ['/bin/bash']
        imagePullPolicy: IfNotPresent
        stdin: true
        tty: true
        resources:
          requests:
            nokia.k8s.io/sriov_ens11f1: '1'
          limits:
            nokia.k8s.io/sriov_ens11f1: '1'
      restartPolicy: Always

from danm.

Levovar avatar Levovar commented on June 2, 2024

yep, definitely there is an error in DANM code. I mean it cores, so :)

but, besides improving the error handling for this scenario the problem is that I really don't think DANM, and the SR-IOV CNI should try and allocate anything on its own in a Device Plugin managed setup.
Let's say we expose 8 VFs from a PF. Those 8 VFs are managed by the Device Manager inside Kubelet from that point onward, and NOT by the DP.
If we would automatically allocate a VF to a Pod without it specifying resource requests, Device Manager would never know about this allocation happening behind its back.
As a result, it would continue to advertise 8 VFs worth of capacity, when in reality we only have 7 left.

So, yes, it is actually mandatory. We just need to gracefully handle this scenario within DANM code, and return an explicit error, rather than core.

from danm.

nokia-t1zhou avatar nokia-t1zhou commented on June 2, 2024

yep, definitely there is an error in DANM code. I mean it cores, so :)

but, besides improving the error handling for this scenario the problem is that I really don't think DANM, and the SR-IOV CNI should try and allocate anything on its own in a Device Plugin managed setup.
Let's say we expose 8 VFs from a PF. Those 8 VFs are managed by the Device Manager inside Kubelet from that point onward, and NOT by the DP.
If we would automatically allocate a VF to a Pod without it specifying resource requests, Device Manager would never know about this allocation happening behind its back.
As a result, it would continue to advertise 8 VFs worth of capacity, when in reality we only have 7 left.

So, yes, it is actually mandatory. We just need to gracefully handle this scenario within DANM code, and return an explicit error, rather than core.

yes, agree with you, please update the README to indicate that the resource requests in POD YAML is mandatory.

from danm.

Levovar avatar Levovar commented on June 2, 2024

yep, definitely there is an error in DANM code. I mean it cores, so :)
but, besides improving the error handling for this scenario the problem is that I really don't think DANM, and the SR-IOV CNI should try and allocate anything on its own in a Device Plugin managed setup.
Let's say we expose 8 VFs from a PF. Those 8 VFs are managed by the Device Manager inside Kubelet from that point onward, and NOT by the DP.
If we would automatically allocate a VF to a Pod without it specifying resource requests, Device Manager would never know about this allocation happening behind its back.
As a result, it would continue to advertise 8 VFs worth of capacity, when in reality we only have 7 left.
So, yes, it is actually mandatory. We just need to gracefully handle this scenario within DANM code, and return an explicit error, rather than core.

yes, agree with you, please update the README to indicate that the resource requests in POD YAML is mandatory.

definitely. I will keep this Issue open to track both the update of the documentation, and the improvement in the error handling code

from danm.

Levovar avatar Levovar commented on June 2, 2024

I'm kind of sure that the underlying code issue is fixed by #119.

Documentation was also updated: 9cd7db7

So I think this issue is can be considered done

from danm.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.