Giter VIP home page Giter VIP logo

intel-device-plugins-for-kubernetes's Issues

FPGA: init container for plugin daemon set

Idea is to create init container for FPGA device plugin daemonset where following items will be done:

  • host mount some directory, e.g. /opt/intel/fpga and CRI-O hook configuration directory
  • rsync --delete all needed libraries and binaries for fpga programming into this directory (including fpga_crihook)
  • force-update fpga_crihook configuration file for CRI-O

check for formatting issues in TravisCI

It looks like TravisCI is supposed to detect code that isn't formatted in the canonic way, because there is a make format invocation in .travis.yml. But what that does is reformat code, without failing the CI build because go fmt returns a zero exit code regardless whether changes were necessary or not.

Here's what I've been using in a different project:

test_fmt:
	@ files=$$(find pkg cmd test -name '*.go'); \
	if [ $$(gofmt -d $$files | wc -l) -ne 0 ]; then \
		echo "formatting errors:"; \
		gofmt -d $$files; \
		false; \
	fi
fmt:
	gofmt -l -w $$(find pkg cmd test -name '*.go')

Can’t run the demo on k8s node with two FPGAs

Hi.
I'm trying the FPGA plugin and I have 2 questions related to a multi-FPGA k8s node.
If any one here can give me some hint I will be very appreciate.

I have a DELL R740 server with two Intel PAC (FPGA Board), and I use this server as a k8s node.
BTW, I also have a DELL R640 with single Intel PAC and I do not have problem with this one.

Question1

I want to create two pod, each pod attached with one PAC.
But inside pod, I can see both two PAC with lspci command.
Is this behavior correct?

$ lspci  | grep accelerators
86:00.0 Processing accelerators: Intel Corporation Device 09c4
d8:00.0 Processing accelerators: Intel Corporation Device 09c4

Question2

When I run the nlb3 of ubuntu-demo-opae with FPGA Card,
the demo fails with Error: device enumeration failed. in af mode although the AFU id is correct.
And Error: couldn't open device. in both af and region mode.

The only way to run the demo I found is to change af mode to region mode and also specify the bus number like ./nbl3 -B 0x86.

Is it required to specify the bus number on a multi-FPGA node?
If so, how do I know the allocated FPGA bus number from the pod? Because I can not tell which is the one with the lspci command.

pod creation

I modified the yaml and created my pod as following

  • test-fpga-region2.yml (copy of test-fpga-region.yml)
-name: test-fpga-region
+name: test-fpga-region2
-command: ["sh", "/usr/bin/test_fpga.sh"]
+command: ["sleep", "3600"]
-fpga.intel.com/arria10-nlb3: 1 
+fpga.intel.com/arria10: 1 # To use region mode

nlb3 execution result

I executed my pod as following

## execute nlb3  (region mode)

$ kubectl exec -ti test-fpga-region2 /usr/bin/nlb3
[22][WARN][accelerator::open] Errors encountered while opening accelerator resource
Error: couldn't open device.
command terminated with exit code 102

## execute nlb3 with Bus number option (region mode)

$ kubectl exec -ti test-fpga-region2 /usr/bin/nlb3 -- -B 0x86


Cachelines Read_Count Write_Count Cache_Rd_Hit Cache_Wr_Hit Cache_Rd_Miss Cache_Wr_Miss   Eviction 'Clocks(@200 MHz)'   Rd_Bandwidth   Wr_Bandwidth
         1          1           0            0            0             0             0          0              155     0.083 GB/s     0.000 GB/s

VH0_Rd_Count VH0_Wr_Count VH1_Rd_Count VH1_Wr_Count VL0_Rd_Count VL0_Wr_Count
           1            1            0            0            0            0

## execute nlb3  (af mode)
ubectl exec -ti test-fpga-region /usr/bin/nlb3
[74][WARN][accelerator::open] Errors encountered while opening accelerator resource
Error: couldn't open device.
command terminated with exit code 102

## execute nlb3 with Bus number option (af mode)
ubectl exec -ti test-fpga-region /usr/bin/nlb3 -- -B 0x86
Error: device enumeration failed.
Please make sure that the driver is loaded and that a bitstream for
AFU id: F7DF405C-BD7A-CF72-22F1-44B0B93ACD18 is programmed.
command terminated with exit code 102

k8s status

$ kubectl get no
NAME     STATUS   ROLES    AGE   VERSION
r740     Ready    <none>   31d   v1.12.3
master   Ready    master   40d   v1.12.3

$ kubectl get po
NAME                                            READY   STATUS    RESTARTS   AGE
intel-fpga-webhook-deployment-98d745549-nf8v5   1/1     Running   0          17h

$ kubectl describe node r740 | grep fpga.intel.com
fpga.intel.com/device-plugin-mode: region
fpga.intel.com/region-9926ab6d6c925a68aabca7d84c545738:  2
fpga.intel.com/region-9926ab6d6c925a68aabca7d84c545738:  2

$ kubectl describe node r740 | grep hugepage
 hugepages-1Gi:                                       0
 hugepages-2Mi:                                       2Gi
 hugepages-1Gi:                                       0
 hugepages-2Mi:                                       2Gi
 
$ cat /sys/class/fpga/intel-fpga-dev.0/intel-fpga-port.0/afu_id
f7df405cbd7acf7222f144b0b93acd18

$ cat /sys/class/fpga/intel-fpga-dev.1/intel-fpga-port.1/afu_id
f7df405cbd7acf7222f144b0b93acd18

Wrong FPGA device plugin af/region ID

I followed the doc cmd/fpga_plugin.

Run FPGA device plugin as administrator in af mode

# kubectl describe node node08 node09 node10 | grep fpga.intel.com
                    fpga.intel.com/device-plugin-mode: af
 fpga.intel.com/af-f7df405cbd7acf7222f144b0b93acd18:      1
 fpga.intel.com/region-a9f2d0f3b39857b0b34fd226bf364fee:  0
 fpga.intel.com/af-f7df405cbd7acf7222f144b0b93acd18:      1
 fpga.intel.com/region-a9f2d0f3b39857b0b34fd226bf364fee:  0
  fpga.intel.com/af-f7df405cbd7acf7222f144b0b93acd18      0           0
  fpga.intel.com/region-a9f2d0f3b39857b0b34fd226bf364fee  0           0
                    fpga.intel.com/device-plugin-mode: af
 fpga.intel.com/af-f7df405cbd7acf7222f144b0b93acd18:      1
 fpga.intel.com/region-a9f2d0f3b39857b0b34fd226bf364fee:  0
 fpga.intel.com/af-f7df405cbd7acf7222f144b0b93acd18:      1
 fpga.intel.com/region-a9f2d0f3b39857b0b34fd226bf364fee:  0
  fpga.intel.com/af-f7df405cbd7acf7222f144b0b93acd18      0          0
  fpga.intel.com/region-a9f2d0f3b39857b0b34fd226bf364fee  0          0
                    fpga.intel.com/device-plugin-mode: af
 fpga.intel.com/af-f7df405cbd7acf7222f144b0b93acd18:      1
 fpga.intel.com/region-a9f2d0f3b39857b0b34fd226bf364fee:  0
 fpga.intel.com/af-f7df405cbd7acf7222f144b0b93acd18:      1
 fpga.intel.com/region-a9f2d0f3b39857b0b34fd226bf364fee:  0
  fpga.intel.com/af-f7df405cbd7acf7222f144b0b93acd18      0          0
  fpga.intel.com/region-a9f2d0f3b39857b0b34fd226bf364fee  0          0

the af ID is af-f7df405cbd7acf7222f144b0b93acd18, BUT when I run the demo and replaced fpga.intel.com/arria10-nlb3 to fpga.intel.com/arria10-nlb0. I got

[root@node08 demo]# kubectl apply -f test-fpga-region-ypc.yml
pod/test-fpga-region-ypc2 created
[root@node08 demo]#
[root@node08 demo]# kubectl describe pod test-fpga-region-ypc2
Name:               test-fpga-region-ypc2
Namespace:          default
Priority:           0
PriorityClassName:  <none>
Node:               <none>
Labels:             <none>
Annotations:        kubectl.kubernetes.io/last-applied-configuration:
                      {"apiVersion":"v1","kind":"Pod","metadata":{"annotations":{},"name":"test-fpga-region-ypc2","namespace":"default"},"spec":{"containers":[{...
Status:             Pending
IP:
Containers:
  test-container:
    Image:      ubuntu-demo-opae:devel
    Port:       <none>
    Host Port:  <none>
    Command:
      sh
      /usr/bin/test_fpga.sh
    Limits:
      cpu:                                                 1
      fpga.intel.com/af-d8424dc4a4a3c413f89e433683f9040b:  1
      hugepages-2Mi:                                       20Mi
    Requests:
      cpu:                                                 1
      fpga.intel.com/af-d8424dc4a4a3c413f89e433683f9040b:  1
      hugepages-2Mi:                                       20Mi
    Environment:                                           <none>
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-9mrqb (ro)
Conditions:
  Type           Status
  PodScheduled   False
Volumes:
  default-token-9mrqb:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  default-token-9mrqb
    Optional:    false
QoS Class:       Burstable
Node-Selectors:  <none>
Tolerations:     node.kubernetes.io/not-ready:NoExecute for 300s
                 node.kubernetes.io/unreachable:NoExecute for 300s
Events:
  Type     Reason            Age   From               Message
  ----     ------            ----  ----               -------
  Warning  FailedScheduling  3s    default-scheduler  0/3 nodes are available: 3 Insufficient fpga.intel.com/af-d8424dc4a4a3c413f89e433683f9040b, 3 Insufficient hugepages-2Mi.

Such situation appeared when I use region mode too. My region id is fpga.intel.com/region-a9f2d0f3b39857b0b34fd226bf364fee. BUT when I run the demo, reported that it requests region-9926ab6d6c925a68aabca7d84c545738

So how to set right configuration in yaml.

spec:
  containers:
  - name: test-container
    ...
    resources:
      limits:
        fpga.intel.com/arria10-nlb3: 1  # here
        cpu: 1
        hugepages-2Mi: 20Mi

Thank you!

Document updating of /etc/crio/crio.conf

Some installation (e.g. package in ubuntu) has in default config:

hooks_dir = [
]

This means that no hooks will be discovered and enabled.
Updates to documentation is needed to make sure that /etc/containers/oci/hooks.d is listed there.

intel-fpga-plugin dial tcp i/o timeout

intel-fpga-plugin pod get error

$ kubectl logs intel-fpga-plugin-5dppj -n kube-system
ERROR: Get https://110.1.0.1:443/api/v1/nodes/node10: dial tcp 110.1.0.1:443: i/o timeout

I created Kubernetes cluster as follow. There are two worker nodes in the cluster, node09(110.1.1.109), node10(110.1.1.110)

kubeadm init --kubernetes-version=1.14.2 \
  --apiserver-advertise-address=110.1.1.108  \
  --image-repository registry.aliyuncs.com/google_containers  \
  --service-cidr=110.1.0.0/16  --pod-network-cidr=110.244.0.0/16

I created intel-fpga-plugin DaemonSet as intel-device-plugins-for-kubernetes/cmd/fpga_plugin. And I finished all the guides: FPGA device plugin, FPGA admission controller webhook, FPGA prestart CRI-O hook.

However, logs in intel-fpga-plugin pods and intel-fpga-webhook pod get errors. Theses pods try to access https://110.1.0.1:443, but there is no such a pod with this ip.

Could you guys help me figure out how to fix it? Am I wrong to create cluster with wrong service-cidr?

Restructure README.md file(s)

We need to split current README.md file into several:

  1. main README.md file in top level directory should have high level description of the repo, supported k8s versions, list and pointers to README files for each plugin inside that repo.
  2. CONTRIBUTING.md - description on how to contribute to that repository.
  3. common file building.md or something like that: instructions how to get sources and build binaries.
  4. For each plugin (GPU, FPGA, ...) we need to have dedicated README file, containing pointer to building.md, and then description of functionality and usages for this particular plugin.

Screencast video is not up to date

Can we update the screencase video to match current screencast.sh and the updated clearlinux-demo-opae demo?
The current video shows using intel-fpga-demo-compress:devel which doesn't exist in the repo.

qat_plugin: update loop exhausts VF driver devices

The update loop doing scan() exhausts devices from the VF driver pool

for _, driver := range append(dp.kernelVfDrivers, dp.dpdkDriver) {

After a while, all devices bound to kernelVfDrivers have been unbound and the plugin only scans dpdkDriver directory.

FPGA: Failure on creating container if host OS does not have some python modules

Failure on creating container if host OS does not have some python modules
9926ab6d6c925a68aabca7d84c545738/f7df405cbd7acf7222f144b0b93acd18: can't get bitstream info

/opt/intel/fpga-sw/opae/bin/packager 
jsonschema module has no validatiors() or exceptions()
Traceback (most recent call last):
  File "/usr/lib/python2.7/runpy.py", line 174, in _run_module_as_main
    "__main__", fname, loader, pkg_name)
  File "/usr/lib/python2.7/runpy.py", line 72, in _run_code
    exec code in run_globals
  File "/opt/intel/fpga-sw/opae/bin/packager/__main__.py", line 2, in <module>
  File "/opt/intel/fpga-sw/opae/bin/packager/packager.py", line 35, in <module>
  File "/opt/intel/fpga-sw/opae/bin/packager/afu.py", line 52, in <module>
ImportError: No module named jsonschema

FPGA CRI hook: implement support for OpenCL bitstreams

Current implementation supports only OPAE bitstreams.
It would be great to have OpenCL bitstreams also supported.

OpenVINO project uses OpenCL bitstreams, so supporting OpenCL bitstreams would also help to support OpenVINO workloads.

demo/ubuntu-demo-opencl looks like can't work

Hi,
I try to build demo/ubuntu-demo-opencl, and use it to test gpu_plugin. But it looks like that this demo can't work correctly.

  1. If only run this image using docker in host os, this app crashed.

$ docker run -it --rm -v /dev:/dev -v /etc:/etc -v /opt:/opt --privileged ubuntu-demo-opencl:devel

$ ./run-opencl-example.sh /root/6-1/fft

Segmentation fault (core dumped)

and at the same time, It looks like "clinfo" could not get the gpu infomation.

$ clinfo

Number of platforms 0

  1. If copy "/root/6-1/fft/*" to the host environment and execute it, the app can work correctly.

I'm not sure whether there are something is wrong in my steps. Or need to do some extra setup operations?

kubernetes application

I just allocated one region of fpga for my pod. While , in my pod, i can see all of fpga on this host. The attachment named cern.yml is the detail of my pod, and the 1.png is the output of kubectl get pod -o yaml, and the 2.png is the output of lspci | grep 9c[45] running in my pod, 3.png for fpgainfo fme in my pod and 4.png for fpgainfo portin my pod.

1
2
3
4
cernyml

VCA1(VCA1283LVV) support?

I installed the plug-in in the VCA card , but when I started gpu_plugins, I saw the following error through debug mode:
image

Is VCA1 supported?

FPGA CRI hook: make hook output to be visible in the logs

Currently output of the hook is not visible anywhere, which makes it hard to investigate what's going on when hook doesn't work as expected.

It would be great to have the output visible in the CRI-O systemd logs: journalctl -u crio.

Hopefully it's just a matter of configuring systemd to put the logs into its binary log.

can't build fpga_plugin

What's the golang version I should use? I can't pass the fpga_plugin build:
with golang 1.6 I get the following error:

../../internal/deviceplugin/server.go:18:2: cannot find the package "context" in any of:
$GOROOT
$GOPATH

With Golang 1.7 and 1.8 I get:

*server does not implement v1beta1.DevicePluginServer (wrong type for Allocate method)
		have Allocate("context".Context, *v1beta1.AllocateRequest) (*v1beta1.AllocateResponse, error)
		want Allocate("github.com/intel/intel-device-plugins-for-kubernetes/vendor/golang.org/x/net/context".Context, *v1beta1.AllocateRequest) (*v1beta1.AllocateResponse, error)

and With latest I get:

import cycle not allowed
package github.com/intel/intel-device-plugins-for-kubernetes/cmd/fpga_plugin
	imports flag
	imports errors
	imports runtime
	imports internal/bytealg
	imports internal/cpu
	imports runtime
Makefile:35: recipe for target 'fpga_plugin' failed
make: *** [fpga_plugin] Error 1

FPGA: support multiple FPGA devices

FPGA plugin, Admission Webhook and CRI hook currently support only one FPGA device. They should be extended to support multiple devices. This will help to support more complex FPGA workflows that use more than one accelerated function.

Health monitoring for FPGA

We seen situations where FPGA device become irresponsible after using some bitstreams.
We need to get proper health reporting by device plugin

Upgrade to OPAE 1.3.0

Evaluate and potentially upgrade the tools in init container to OPAE SDK 1.3.0 (or later)

QAT: crypto-perf image fails to build

Here is what I'v got when I tried to build crypto-perf image:

$ cd demo
$ ./build-image.sh crypto-perf
++ dirname ./build-image.sh
+ CWD=.
+ IMG=crypto-perf
+ '[' -z crypto-perf ']'
+ '[' '!' -d crypto-perf ']'
+ PROXY_VARS='http_proxy https_proxy'
+ BUILD_ARGS=
+ for proxy in '$PROXY_VARS'
+ '[' -v http_proxy ']'
++ tr -d ' '
++ echo http://proxy-chain.intel.com:911
+ val=http://proxy-chain.intel.com:911
+ BUILD_ARGS=' --build-arg http_proxy=http://proxy-chain.intel.com:911'
+ RUN_ARGS=' -e http_proxy=http://proxy-chain.intel.com:911'
+ for proxy in '$PROXY_VARS'
+ '[' -v https_proxy ']'
++ tr -d ' '
++ echo http://proxy-chain.intel.com:911
+ val=http://proxy-chain.intel.com:911
+ BUILD_ARGS=' --build-arg http_proxy=http://proxy-chain.intel.com:911 --build-arg https_proxy=http://proxy-chain.intel.com:911'
+ RUN_ARGS=' -e http_proxy=http://proxy-chain.intel.com:911 -e https_proxy=http://proxy-chain.intel.com:911'
+ docker build -t crypto-perf --build-arg http_proxy=http://proxy-chain.intel.com:911 --build-arg https_proxy=http://proxy-chain.intel.com:911 ./crypto-perf/
Sending build context to Docker daemon  2.048kB
Step 1/8 : FROM centos:7
 ---> 5182e96772bf
Step 2/8 : ENV SHELL /bin/bash
 ---> Using cache
 ---> 19e95b4e9030
Step 3/8 : COPY ./resolv.conf /etc/resolv.conf
COPY failed: stat /var/lib/docker/tmp/docker-builder406254565/resolv.conf: no such file or directory

qat_plugin: additional review comments

I've tested the QAT plug-in on my host and read the code a little bit. Here's my observations/improvement ideas:

  • while re-running the plug-in multiple times, I eventually run "out of resources" because the devices remain bound to the dpdk user IO driver. The driver should probably implement:
    for i in `ls -d /sys/bus/pci/drivers/vfio-pci/0000:*`; do echo `basename $i` > /sys/bus/pci/drivers/vfio-pci/unbind; done (for the given dpdk-driver) in its init (Edit: this might be trickier than just unbind)
  • maxNumdevice check looks wrong. 1) the loop gives maxNumdevicefrom each kernel driver resulting N_drivers * maxNumdevice devices. 2) the loop continues for all files even if maxNumdevice is already reached

OpenCL does not work on servers with two PACs

Hi. Using this plug-in,
I'm trying OpenVINO container with FPGA.

When run OpenVINO on my DELL R640 with one PAC,
I could execute inference with FPGA without any problem.

But, on my DELL R740 which has 2 PAC,
I could not execute OpenVINO inference or OpenCL memory diagnostics.

I attached some experiments log.
Please find it.

log summary

please find detailed log of these 6 patterns below.

  • DELL R640 (It has one PAC)

      1. on the pod
      • "It works"
  • DELL R740 (It has two PAC)

      1. on the host
      1. on the pod (each pod is attached 1 PAC)
      1. on the pod (single pod is attached 2 PAC)
      1. on docker container without k8s (container is attached 2 PAC)
      • "It works, but 2 PAC are necessary..."
      1. on docker container without k8s (container is attached 1 PAC)

enviroments

DELL R640 opae version 1.0.2-1
DELL R740 opae version 1.0.2-1

R640

0. on the pod

It works

[kaz@r640-2 openvino]$ k apply -f openvino-pod.yaml
pod/openvino-pod created


[kaz@r640-2 openvino]$ k get po
NAME                                            READY   STATUS    RESTARTS   AGE
intel-fpga-webhook-deployment-98d745549-27cnx   1/1     Running   2          33d
openvino-pod-nfwrc                              1/1     Running   0          11s


[kaz@r640-2 openvino]$ k exec -ti openvino-job-nfwrc bash
root@openvino-pod-nfwrc:/#

root@openvino-pod-nfwrc:/# aocl diagnose
--------------------------------------------------------------------
Device Name:
acl0

BSP Install Location:
/opt/a10_gx_pac_ias_1_1_pv/opencl/opencl_bsp

Vendor: Intel Corp

Physical Dev Name   Status            Information

pac_a10_f300000     Passed            PAC Arria 10 Platform (pac_a10_f300000)
                                      PCIe 216:00.0
                                      FPGA temperature = 52 degrees C.

DIAGNOSTIC_PASSED
--------------------------------------------------------------------

Call "aocl diagnose <device-names>" to run diagnose for specified devices
Call "aocl diagnose all" to run diagnose for all devices


root@openvino-pod-nfwrc:/# aocl diagnose all
Using platform: Intel(R) FPGA SDK for OpenCL(TM)
Using Device with name: pac_a10 : PAC Arria 10 Platform (pac_a10_f300000)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Write top speed = 6192.44 MB/s
Read top speed = 5507.40 MB/s
Throughput = 5849.92 MB/s

DIAGNOSTIC_PASSED


[ INFO ] InferenceEngine:
	API version ............ 1.4
	Build .................. 17328
[ INFO ] Parsing input parameters
[ INFO ] Files were added: 1
[ INFO ]     /opt/intel/computer_vision_sdk/deployment_tools/demo/car.png
[ INFO ] Loading plugin

	API version ............ 1.4
	Build .................. heteroPlugin
	Description ....... heteroPlugin
[ INFO ] Loading network files:
	/root/openvino_models/ir/squeezenet1.1/FP32/squeezenet1.1.xml
	/root/openvino_models/ir/squeezenet1.1/FP32/squeezenet1.1.bin
[ INFO ] Preparing input blobs
[ WARNING ] Image is resized from (787, 259) to (227, 227)
[ INFO ] Batch size is 1
[ INFO ] Preparing output blobs
[ INFO ] Loading model to the plugin
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Top 10 results:

Image /opt/intel/computer_vision_sdk/deployment_tools/demo/car.png

817 0.8741471 label sports car, sport car
511 0.0435212 label convertible
479 0.0435212 label car wheel
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
[ INFO ] Execution successful

R740

1. on the host

[kaz@r740 openvino]$ aocl diagnose
--------------------------------------------------------------------
Device Name:
acl0

Package Pat:
/home/kaz/inteldevstack/a10_gx_pac_ias_1_1_pv/opencl/opencl_bsp

Vendor: Intel Corp

Physical Dev Name   Status            Information

pac_a10_ee00001     Passed            PAC Arria 10 Platform (pac_a10_ee00001)
                                      PCIe 216:00.0
                                      FPGA temperature = 43 degrees C.

DIAGNOSTIC_PASSED
--------------------------------------------------------------------
--------------------------------------------------------------------
Device Name:
acl1

Package Pat:
/home/kaz/inteldevstack/a10_gx_pac_ias_1_1_pv/opencl/opencl_bsp

Vendor: Intel Corp

Physical Dev Name   Status            Information

pac_a10_ee00000     Passed            PAC Arria 10 Platform (pac_a10_ee00000)
                                      PCIe 134:00.0
                                      FPGA temperature = 42 degrees C.

DIAGNOSTIC_PASSED
--------------------------------------------------------------------

Call "aocl diagnose <device-names>" to run diagnose for specified devices
Call "aocl diagnose all" to run diagnose for all devices


[kaz@r740 openvino]$ aocl diagnose all
Using platform: Intel(R) FPGA SDK for OpenCL(TM)
Using Device with name: pac_a10 : PAC Arria 10 Platform (pac_a10_ee00001)
~~~~~~~~~~~~~
~~~~~~~~~~~~~
Write top speed = 6824.68 MB/s
Read top speed = 6797.44 MB/s
Throughput = 6811.06 MB/s

DIAGNOSTIC_PASSED

Using platform: Intel(R) FPGA SDK for OpenCL(TM)
Using Device with name: pac_a10 : PAC Arria 10 Platform (pac_a10_ee00000)
~~~~~~~~~~~~~
~~~~~~~~~~~~~
Write top speed = 6848.64 MB/s
Read top speed = 6766.13 MB/s
Throughput = 6807.38 MB/s

DIAGNOSTIC_PASSED

2. on the pod (each pod is attached 1 PAC)

[kaz@r740 openvino]$ k apply -f openvino-pod.yaml
pod/openvino-pod created

[kaz@r740 openvino]$ k apply -f openvino-pod2.yaml
pod/openvino-pod2 created

[kaz@r740 openvino]$ k get po
NAME                                            READY   STATUS    RESTARTS   AGE
intel-fpga-webhook-deployment-98d745549-27cnx   1/1     Running   2          33d
openvino-pod                                    1/1     Running   0          4s
openvino-pod2                                   1/1     Running   0          2s

[kaz@r740 openvino]$ k exec -ti openvino-pod bash
root@openvino-pod:/# aocl diagnose
Error opening AFC: no driver available
--------------------------------------------------------------------
Device Name:
acl0

BSP Install Location:
/opt/a10_gx_pac_ias_1_1_pv/opencl/opencl_bsp

Vendor: Intel Corp

Physical Dev Name   Status            Information

pac_a10_ee00001     Passed            PAC Arria 10 Platform (pac_a10_ee00001)
                                      PCIe 216:00.0
                                      FPGA temperature = 42 degrees C.

DIAGNOSTIC_PASSED
--------------------------------------------------------------------

Call "aocl diagnose <device-names>" to run diagnose for specified devices
Call "aocl diagnose all" to run diagnose for all devices
root@openvino-pod:/# aocl diagnose all
Error opening AFC: no driver available
Error opening AFC: no driver available

root@openvino-pod:/# exit
exit

[kaz@r740 openvino]$ k exec -ti openvino-pod2 bash
root@openvino-pod2:/# aocl diagnose
Error opening AFC: no driver available
--------------------------------------------------------------------
Device Name:
acl0

BSP Install Location:
/opt/a10_gx_pac_ias_1_1_pv/opencl/opencl_bsp

Vendor: Intel Corp

Physical Dev Name   Status            Information

pac_a10_ee00000     Passed            PAC Arria 10 Platform (pac_a10_ee00000)
                                      PCIe 134:00.0
                                      FPGA temperature = 41 degrees C.

DIAGNOSTIC_PASSED
--------------------------------------------------------------------

Call "aocl diagnose <device-names>" to run diagnose for specified devices
Call "aocl diagnose all" to run diagnose for all devices
root@openvino-pod2:/# aocl diagnose all
Error opening AFC: no driver available
Error opening AFC: no driver available
root@openvino-pod2:/#

3. on the pod (single pod is attached 2 PAC)

[kaz@r740 openvino]$ clear
[kaz@r740 openvino]$ k apply -f openvino-pod12.yaml
pod/openvino-pod12 created
[kaz@r740 openvino]$ k exec -ti openvino-pod12 bash
root@openvino-pod12:/# aocl diagnose
--------------------------------------------------------------------
Device Name:
acl0

BSP Install Location:
/opt/a10_gx_pac_ias_1_1_pv/opencl/opencl_bsp

Vendor: Intel Corp

Physical Dev Name   Status            Information

pac_a10_ee00001     Passed            PAC Arria 10 Platform (pac_a10_ee00001)
                                      PCIe 216:00.0
                                      FPGA temperature = 43 degrees C.

DIAGNOSTIC_PASSED
--------------------------------------------------------------------
--------------------------------------------------------------------
Device Name:
acl1

BSP Install Location:
/opt/a10_gx_pac_ias_1_1_pv/opencl/opencl_bsp

Vendor: Intel Corp

Physical Dev Name   Status            Information

pac_a10_ee00000     Passed            PAC Arria 10 Platform (pac_a10_ee00000)
                                      PCIe 134:00.0
                                      FPGA temperature = 41 degrees C.

DIAGNOSTIC_PASSED
--------------------------------------------------------------------

Call "aocl diagnose <device-names>" to run diagnose for specified devices
Call "aocl diagnose all" to run diagnose for all devices

root@openvino-pod12:/# aocl diagnose all
Error initializing DMA: 1
Error initializing mmd dma
Error initializing bsp
Using platform: Intel(R) FPGA SDK for OpenCL(TM)
Using Device with name: pac_a10 : PAC Arria 10 Platform (pac_a10_ee00001)
~~~~~~~~~~~~~
~~~~~~~~~~~~~
Write top speed = 6853.11 MB/s
Read top speed = 6812.40 MB/s
Throughput = 6832.75 MB/s

DIAGNOSTIC_PASSED

Error initializing DMA: 1
Error initializing mmd dma
Error initializing bsp
Using platform: Intel(R) FPGA SDK for OpenCL(TM)
Can't open device #1
root@openvino-pod12:/#

root@openvino-pod12:/# aocl diagnose acl0
Error initializing DMA: 1
Error initializing mmd dma
Error initializing bsp
Using platform: Intel(R) FPGA SDK for OpenCL(TM)
Using Device with name: pac_a10 : PAC Arria 10 Platform (pac_a10_ee00001)
~~~~~~~~~~~~~
~~~~~~~~~~~~~
Write top speed = 6845.63 MB/s
Read top speed = 6823.58 MB/s
Throughput = 6834.60 MB/s

DIAGNOSTIC_PASSED

root@openvino-pod12:/# aocl diagnose acl1
Error initializing DMA: 1
Error initializing mmd dma
Error initializing bsp
Using platform: Intel(R) FPGA SDK for OpenCL(TM)
Can't open device #1

4. on docker container without k8s (container is attached 2 PAC)

It works, but 2 PAC are necessary...

root@bf083a3cd0a1:/# aocl diagnose
--------------------------------------------------------------------
Device Name:
acl0

BSP Install Location:
/opt/a10_gx_pac_ias_1_1_pv/opencl/opencl_bsp

Vendor: Intel Corp

Physical Dev Name   Status            Information

pac_a10_ee00001     Passed            PAC Arria 10 Platform (pac_a10_ee00001)
                                      PCIe 216:00.0
                                      FPGA temperature = 43 degrees C.

DIAGNOSTIC_PASSED
--------------------------------------------------------------------
--------------------------------------------------------------------
Device Name:
acl1

BSP Install Location:
/opt/a10_gx_pac_ias_1_1_pv/opencl/opencl_bsp

Vendor: Intel Corp

Physical Dev Name   Status            Information

pac_a10_ee00000     Passed            PAC Arria 10 Platform (pac_a10_ee00000)
                                      PCIe 134:00.0
                                      FPGA temperature = 42 degrees C.

DIAGNOSTIC_PASSED
--------------------------------------------------------------------

Call "aocl diagnose <device-names>" to run diagnose for specified devices
Call "aocl diagnose all" to run diagnose for all devices
root@bf083a3cd0a1:/# aocl diagnose all
Using platform: Intel(R) FPGA SDK for OpenCL(TM)
Using Device with name: pac_a10 : PAC Arria 10 Platform (pac_a10_ee00001)
~~~~~~~~~~~~~
~~~~~~~~~~~~~
Write top speed = 6800.08 MB/s
Read top speed = 6746.52 MB/s
Throughput = 6773.30 MB/s

DIAGNOSTIC_PASSED
Using platform: Intel(R) FPGA SDK for OpenCL(TM)
Using Device with name: pac_a10 : PAC Arria 10 Platform (pac_a10_ee00000)
~~~~~~~~~~~~~
~~~~~~~~~~~~~
Write top speed = 6852.69 MB/s
Read top speed = 6690.27 MB/s
Throughput = 6771.48 MB/s

DIAGNOSTIC_PASSED

5. on docker container without k8s (container is attached 1 PAC)

[kaz@r740 openvino]$ docker run --rm -it --mount type=bind,source=/home/kaz/inteldevstack/a10_gx_pac_ias_1_1_pv/,destination=/opt/a10_gx_pac_ias_1_1_pv/ --mount type=bind,source=/home/kaz/inteldevstack/a10_gx_pac_ias_1_1_pv/opencl/opencl_bsp/linux64/lib/,destination=/home/kaz/inteldevstack/a10_gx_pac_ias_1_1_pv/opencl/opencl_bsp/linux64/lib/ --mount type=bind,source=/opt/intel/computer_vision_sdk/bitstreams/a10_dcp_bitstreams/,destination=/opt/intel/computer_vision_sdk/bitstreams/a10_dcp_bitstreams/ --mount type=bind,source=/opt/intel/fpga-sw/opae/lib/,destination=/opt/intel/fpga-sw/opae/lib/ --mount type=bind,source=/opt/intel/fpga-sw/opencl/opencl_bsp/linux64/lib/,destination=/opt/intel/fpga-sw/opencl/opencl_bsp/linux64/lib/ --mount type=bind,source=/home/kaz/inteldevstack/intelFPGA_pro,destination=/opt/intel/intelFPGA_pro --mount type=bind,source=/opt/altera,destination=/opt/altera --mount type=bind,source=/etc/OpenCL/vendors,destination=/etc/OpenCL/vendors --mount type=bind,source=/opt/Intel/OpenCL/Boards,destination=/opt/Intel/OpenCL/Boards --device /dev/intel-fpga-fme.0:/dev/intel-fpga-fme.0 --device /dev/intel-fpga-port.0:/dev/intel-fpga-port.0  --cap-add=IPC_LOCK openvino_fpga:1.0

root@b9d0856a4801:/# aocl diagnose
Error opening AFC: no driver available
--------------------------------------------------------------------
Device Name:
acl0

BSP Install Location:
/opt/a10_gx_pac_ias_1_1_pv/opencl/opencl_bsp

Vendor: Intel Corp

Physical Dev Name   Status            Information

pac_a10_ee00000     Passed            PAC Arria 10 Platform (pac_a10_ee00000)
                                      PCIe 134:00.0
                                      FPGA temperature = 42 degrees C.

DIAGNOSTIC_PASSED
--------------------------------------------------------------------

Call "aocl diagnose <device-names>" to run diagnose for specified devices
Call "aocl diagnose all" to run diagnose for all devices
root@b9d0856a4801:/# aocl diagnose all
Error opening AFC: no driver available
Error opening AFC: no driver available
root@b9d0856a4801:/# exit
exit

switch PAC


[kaz@r740 openvino]$ docker run --rm -it --mount type=bind,source=/home/kaz/inteldevstack/a10_gx_pac_ias_1_1_pv/,destination=/opt/a10_gx_pac_ias_1_1_pv/ --mount type=bind,source=/home/kaz/inteldevstack/a10_gx_pac_ias_1_1_pv/opencl/opencl_bsp/linux64/lib/,destination=/home/kaz/inteldevstack/a10_gx_pac_ias_1_1_pv/opencl/opencl_bsp/linux64/lib/ --mount type=bind,source=/opt/intel/computer_vision_sdk/bitstreams/a10_dcp_bitstreams/,destination=/opt/intel/computer_vision_sdk/bitstreams/a10_dcp_bitstreams/ --mount type=bind,source=/opt/intel/fpga-sw/opae/lib/,destination=/opt/intel/fpga-sw/opae/lib/ --mount type=bind,source=/opt/intel/fpga-sw/opencl/opencl_bsp/linux64/lib/,destination=/opt/intel/fpga-sw/opencl/opencl_bsp/linux64/lib/ --mount type=bind,source=/home/kaz/inteldevstack/intelFPGA_pro,destination=/opt/intel/intelFPGA_pro --mount type=bind,source=/opt/altera,destination=/opt/altera --mount type=bind,source=/etc/OpenCL/vendors,destination=/etc/OpenCL/vendors --mount type=bind,source=/opt/Intel/OpenCL/Boards,destination=/opt/Intel/OpenCL/Boards  --device /dev/intel-fpga-fme.1:/dev/intel-fpga-fme.1 --device /dev/intel-fpga-port.1:/dev/intel-fpga-port.1 --cap-add=IPC_LOCK openvino_fpga:1.0
root@3d83eeca62e8:/# export AOCL_BOARD_PACKAGE_ROOT=/opt/a10_gx_pac_ias_1_1_pv/opencl/opencl_bsp
root@3d83eeca62e8:/# export LD_LIBRARY_PATH=/opt/a10_gx_pac_ias_1_1_pv/opencl/opencl_bsp/linux64/lib:$LD_LIBRARY_PATH:
root@3d83eeca62e8:/# export LD_LIBRARY_PATH=/opt/intel/fpga-sw/opae/lib/:$LD_LIBRARY_PATH:
root@3d83eeca62e8:/#
root@3d83eeca62e8:/# aocl diagnose
Error opening AFC: no driver available
--------------------------------------------------------------------
Device Name:
acl0

BSP Install Location:
/opt/a10_gx_pac_ias_1_1_pv/opencl/opencl_bsp

Vendor: Intel Corp

Physical Dev Name   Status            Information

pac_a10_ee00001     Passed            PAC Arria 10 Platform (pac_a10_ee00001)
                                      PCIe 216:00.0
                                      FPGA temperature = 43 degrees C.

DIAGNOSTIC_PASSED
--------------------------------------------------------------------

Call "aocl diagnose <device-names>" to run diagnose for specified devices
Call "aocl diagnose all" to run diagnose for all devices
root@3d83eeca62e8:/# aocl diagnose all
Error opening AFC: no driver available
Error opening AFC: no driver available

Any chance you of Intel RealSense device plugin?

We are currently building solution on top of Kubernetes and would like to plugin RealSesne cameras all the way into container. It would be really good to see support for USB devices like RealSense.

Automate e2e testing for fpga_admissionwebhook

The proposed design is that the Makefile target e2e-test depending on intel-fpga-admissionwebhook

  1. creates a cluster with kind,
  2. copies the image intel-fpga-admissionwebhook:devel to the cluster's container,
  3. import the image inside the cluster,
  4. deploys the webhook,
  5. creates a test pod like
apiVersion: v1
kind: Pod
metadata:
  name: test-pod-1
  annotations:
      key1: value1
spec:
  restartPolicy: Never
  containers:
    -
      name: test-pod-1-container-1
      image: ubuntu:bionic
      imagePullPolicy: IfNotPresent
      command: [ "ls", "-l", "/" ]
      resources:
        limits:
          fpga.intel.com/arria10-nlb0: 1

  1. waits until the pod gets successfully mutated to
# kubectl get pods -o jsonpath="{..env}"
[map[name:FPGA_AFU_1 value:d8424dc4a4a3c413f89e433683f9040b] map[name:FPGA_REGION_1 value:9926ab6d6c925a68aabca7d84c545738]]
  1. Otherwise times out.
  2. Finally tears down the cluster.

Dockerfile for openssl-qat-engine fails to build with debian:sid as builder

I am not sure what exactly is going on but the ca-certificates package refuses to install. Below at 1 is what the failure looks like using the Dockerfile as is. Below at 2 is when I modified the docker file to try to specify installing ca-certificates right after the apt-get update. This failed on 3 different Clear Linux machines with builds from December and from March. These machines are all behind a proxy. However, another clear linux machine that is behind a transparent proxy (Don't need to set proxy variables) works fine for building. If I change the Dockerfile so that the builder is from ubuntu:latest vs from debian:sid it gets past this ca-certificates error and nearly finishes except for some sed errors. This worked for me a couple days ago with debian:sid so I have no idea what might have changed and can't find any recent issues from others on installing this package on debian.

I have tried the following without success
a. RUN apt-get install --reinstall ca-certificates before installing ca-certificates.
b. RUN dpkg --purge --force-depends ca-certificates before install ca-certificates
c. dpkg --configure -a

Note: When I manually specify to install the ca-certificates I get a lot of errors about TERM not set but if I set "ENV DEBIAN_FRONTEND="noninteractive"" then those errors don't come up. The image still fails to build from debian though.

1

Setting up libalgorithm-merge-perl (0.08-3) ...
Processing triggers for libc-bin (2.28-7) ...
Errors were encountered while processing:
 ca-certificates
E: Sub-process /usr/bin/dpkg returned an error code (1)
The command '/bin/sh -c apt-get update &&     apt-get install -y git build-essential wget libssl-dev openssl libudev-dev pkg-config autoconf autogen libtool &&     git clone https://github.com/intel/QAT_Engine &&     git clone -b OpenSSL_1_1_1-stable https://github.com/openssl/openssl.git &&     wget https://01.org/sites/default/files/downloads/intelr-quickassist-technology/$QAT_DRIVER_RELEASE.tar.gz &&     tar zxf $QAT_DRIVER_RELEASE.tar.gz' returned a non-zero code: 100

2

Unpacking libssl1.1:amd64 (1.1.1b-1) ...
Selecting previously unselected package openssl.
Preparing to unpack .../openssl_1.1.1b-1_amd64.deb ...
Unpacking openssl (1.1.1b-1) ...
Selecting previously unselected package ca-certificates.
Preparing to unpack .../ca-certificates_20190110_all.deb ...
Unpacking ca-certificates (20190110) ...
Setting up libssl1.1:amd64 (1.1.1b-1) ...
Setting up openssl (1.1.1b-1) ...
Setting up ca-certificates (20190110) ...
Updating certificates in /etc/ssl/certs...
mv: cannot move '/tmp/ca-certificates.crt.tmp.sL7u4e' to a subdirectory of itself, 'ca-certificates.crt'
dpkg: error processing package ca-certificates (--configure):
 installed ca-certificates package post-installation script subprocess returned error exit status 1
Processing triggers for libc-bin (2.28-7) ...
Errors were encountered while processing:
 ca-certificates
E: Sub-process /usr/bin/dpkg returned an error code (1)
The command '/bin/sh -c apt-get install -y ca-certificates' returned a non-zero code: 100

Support for GVT-g vGPUs

I'm interested in using GVT-g to virtualize the GPUs in my cluster nodes and assign them to pods (essentially for use with qemu in kubevirt).

I came up with a very naïve implementation, and wanted to check with you whether there's any interest in making this part of the "official" device plugin?

"make build" against vendored Kubernetes v1.13 fails

The code generated with the codegenerator from client-go v8.0 isn't compatible with the latest K8s release. As result make build produces this output:

$ make all
cd cmd/fpga_admissionwebhook; go build
# github.com/intel/intel-device-plugins-for-kubernetes/vendor/k8s.io/client-go/pkg/apis/clientauthentication/v1beta1
../../vendor/k8s.io/client-go/pkg/apis/clientauthentication/v1beta1/zz_generated.conversion.go:39:15: scheme.AddGeneratedConversionFuncs undefined (type *runtime.Scheme has no field or method AddGeneratedConversionFuncs)
# github.com/intel/intel-device-plugins-for-kubernetes/vendor/k8s.io/client-go/pkg/apis/clientauthentication/v1alpha1
../../vendor/k8s.io/client-go/pkg/apis/clientauthentication/v1alpha1/zz_generated.conversion.go:39:15: scheme.AddGeneratedConversionFuncs undefined (type *runtime.Scheme has no field or method AddGeneratedConversionFuncs)
make: *** [Makefile:35: fpga_admissionwebhook] Error 2

Upgrade vendored K8s packages to v1.13 and fix the build issues.

Use FPGA in container

I can use FPGA in a physical machine with opencl, but I can not use it the same way in
container. And I see, I can not aocl program a aocx file in container. Can someone tell me how to use FPGA in a container, whether it doesn't make sense to use it in containers. If not, can give me a artificial intelligence demo to use FPGA in container

FPGA: create Ansible playbook

Due to complexity of the plugin setup and many components involved it would make sense to automate its deployment by creating ansible playbook.

gpu plugin device can only be used by one pod

Hi, currently gpu plugin only exposes one device instance e.g one 'card0' with device node /dev/dri/card0 and /dev/dri/renderD128 for service, so only one pod can use it. But as drm device node could be accessed by any number of clients, only limit to one pod is not good, gpu device can be utilized by many more pods.

I'm not sure what's the best way to handle this. One option is to be able to pass max number of pods for gpu access when plugin start, then plugin will report that number of devices to kubelet service, which could server more pods for gpu access. Idea?

qat_plugin: README: add notes about kernel mode

we should add a brief intro about -mode kernel in qat_plugin README. It should also add a note about the fact that this mode currently requires all UIO devices mounted in all containers.

Multiple afu modes

There are 2 fpga afu on one of our hosts with uuid 18b79ffa-2ee5-4aa0-96ef-4230dafacb5f and d8424dc4-a4a3-c413-f89e-433683f9040b like the attachment named afu.png.
While, if we run fpga_plugin in af mode, we can found only one afu (18b79ffa-2ee5-4aa0-96ef-4230dafacb5f) been registed into k8s, like the attachment name registed_afu.png.
afu
registed_afu

non-root workloads

Implement (with documentation) cluster/node setup scripts and deployments that allows to run non-root workloads.

Need to explicitly check for QAT devices when DPDK drivers are used as other devices (eg NICs) bound to DPDK driver get wrongly counted as QAT devices

When the device plugin runs it scans the /sys/bus/pci/drivers/ directory for the following drivers to detect the VFs associated with them:
dh895xccvf,c6xxvf,c3xxxvf,d15xxvf,igb_uio

The first 4 drivers have no issue as they are the kernel drivers for QAT VFs.
However, when scanning the igb_uio driver for devices attached to it as well as detecting QAT VFs the device plugin will also detect any NIC VFs that are bound to the igb_uio driver.
This results in the QAT device plugin managing and allocating NIC VFs.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.