Giter VIP home page Giter VIP logo

fpga_as_a_service's Introduction

FPGA_as_a_Service

This repository will host FPGA_as_a_Service related projects.

Contents

Name Description
k8s-device-plugin Daemonset deployed on the kubernetes to discover FPGAs inserted in each node and run FPGA accessible containers in the k8s cluster
Xilinx Base Runtime This project maintains unified Docker images with XRT (Xilinx runtime) preinstalled and provides scripts to setup and flash the Alveo cards.
Containerization This project provides script to build Docker Application (image) for multiple cloud vendor: Nimbix, AWS and Azure.
Xilinx Container Runtime Xilinx container runtime is an extension of runC, with modification to add xilinx devices before running containers.
XRM XRM - Xilinx FPGA Resource manager is the software to manage all the FPGA hardware on the system.

fpga_as_a_service's People

Contributors

dcasnowdon avatar durgabhavaniv avatar egallen avatar hasheddan avatar imrickysu avatar kenhill avatar luciferlee avatar mattsnow-amd avatar michalkonieczny91 avatar songc-xil avatar xiaoqun2011 avatar xuhz avatar yuzhang66 avatar zhangyuu avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

fpga_as_a_service's Issues

Run hello world in the pod没有得到输出。

前面的部署过程都没有问题,到最后运行Run hello world in the pod时,输出如下:
root@mypod:# source /opt/xilinx/xrt/setup.sh
XILINX_XRT : /opt/xilinx/xrt
PATH : /opt/xilinx/xrt/bin:/opt/xilinx/xrt/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
LD_LIBRARY_PATH : /opt/xilinx/xrt/lib:/opt/xilinx/xrt/lib:
root@mypod:
# xbutil scan
INFO: Found total 1 card(s), 1 are usable
[0]mgmt:[02:00.0]:0x5000:0x000e:[xclmgmt:2.3.1301,192e706aea53163a04c574f9b3fe9ed76b6ca471:512]
[0]user:[02:00.1]:0x5001:0x000e:[xocl:2.3.1301,192e706aea53163a04c574f9b3fe9ed76b6ca471:129]
root@mypod:~# cd /tmp/alveo-u200/xilinx_u200_xdma_201830_1/test/
root@mypod:/tmp/alveo-u200/xilinx_u200_xdma_201830_1/test# ls -al
total 88120
drwxr-xr-x 1 root root 4096 Dec 25 05:48 .
drwxr-xr-x 1 root root 4096 Apr 16 2019 ..
-r--r--r-- 1 root root 50632826 Apr 16 2019 bandwidth.xclbin
-r--r--r-- 1 root root 29256 Apr 16 2019 kernel_bw.exe
-rwxrwxrwx 1 root root 22744 Apr 16 2019 validate.exe
-rwxrwxrwx 1 root root 39523869 Apr 16 2019 verify.xclbin
root@mypod:/tmp/alveo-u200/xilinx_u200_xdma_201830_1/test# ./validate.exe ./verify.xclbin

Platform information
Platform name: Xilinx
Platform version: OpenCL 1.0
Platform profile: EMBEDDED_PROFILE
Platform extensions: cl_khr_icd

Found 1 compute devices!:
loading ./verify.xclbin

ERROR: Failed to load xclbin.
Error: Failed to create compute program!
在/tmp/alveo-u200/xilinx_u200_xdma_201830_1/test目录下,并没有找到verify.exe文件,从docker hub上拉取的也是最新的镜像,于是我执行的是 /validate.exe ./verify.xclbin 但输出结果并没有hello world部分,可以帮忙看一下吗?谢谢

Try on AWS F1, but result looks not promised?

你好,我这边在AWS F1机器上尝试这个服务。整个log如下。
Note : kubectl version is v.16.3,so I had modified the yaml file.
使用kubectl create, 用一般的yaml,或者aws folder下面的yaml,两个服务都能起来没有报错,但是打log看貌似都不像正常可以工作的样子。
请帮忙确认是否这个状态OK?

ubuntu@ip-:~ /faas/FPGA_as_a_Service/k8s-fpga-device-plugin/trunk$ kubectl get pod -n kube-system
NAME READY STATUS RESTARTS AGE
coredns-67c766df46-k8kk7 1/1 Running 1 46h
coredns-67c766df46-kdlw5 1/1 Running 1 46h
etcd-minikube 1/1 Running 1 46h
kube-addon-manager-minikube 1/1 Running 1 46h
kube-apiserver-minikube 1/1 Running 1 46h
kube-controller-manager-minikube 1/1 Running 1 46h
kube-proxy-xlng4 1/1 Running 1 47h
kube-scheduler-minikube 1/1 Running 1 46h
storage-provisioner 1/1 Running 2 46h
ubuntu@ip-:~ /faas/FPGA_as_a_Service/k8s-fpga-device-plugin/trunk$ kubectl create -f fpga-device-plugin.ymldaemonset.apps/fpga-device-plugin-daemonset created
ubuntu@ip-:~ /faas/FPGA_as_a_Service/k8s-fpga-device-plugin/trunk$ kubectl get pod -n kube-system
NAME READY STATUS RESTARTS AGE
coredns-67c766df46-k8kk7 1/1 Running 1 47h
coredns-67c766df46-kdlw5 1/1 Running 1 47h
etcd-minikube 1/1 Running 1 46h
fpga-device-plugin-daemonset-nvzgp 1/1 Running 0 22s
kube-addon-manager-minikube 1/1 Running 1 46h
kube-apiserver-minikube 1/1 Running 1 46h
kube-controller-manager-minikube 1/1 Running 1 46h
kube-proxy-xlng4 1/1 Running 1 47h
kube-scheduler-minikube 1/1 Running 1 46h
storage-provisioner 1/1 Running 2 47h
ubuntu@ip-:~ /faas/FPGA_as_a_Service/k8s-fpga-device-plugin/trunk$ kubectl logs fpga-device-plugin-daemonset-nvzgp -n kube-system
time="2019-12-15T06:25:14Z" level=info msg="Starting FS watcher."
time="2019-12-15T06:25:14Z" level=info msg="Starting OS watcher."
ubuntu@ip-:~ /faas/FPGA_as_a_Service/k8s-fpga-device-plugin/trunk$ cd aws
ubuntu@ip-:~ /faas/FPGA_as_a_Service/k8s-fpga-device-plugin/trunk/aws$ kubectl create -f aws-fpga-device-plugin.yaml
Error from server (AlreadyExists): error when creating "aws-fpga-device-plugin.yaml": daemonsets.apps "fpga-device-plugin-daemonset" already exists
ubuntu@ip-:~ /faas/FPGA_as_a_Service/k8s-fpga-device-plugin/trunk/aws$ vi aws-fpga-device-plugin.yaml
ubuntu@ip-:~ /faas/FPGA_as_a_Service/k8s-fpga-device-plugin/trunk/aws$ kubectl create -f aws-fpga-device-plugin.yaml
daemonset.apps/aws-fpga-device-plugin-daemonset created
ubuntu@ip-:~ /faas/FPGA_as_a_Service/k8s-fpga-device-plugin/trunk/aws$ kubectl get pod -n kube-system
NAME READY STATUS RESTARTS AGE
aws-fpga-device-plugin-daemonset-d9tgj 1/1 Running 0 20s
coredns-67c766df46-k8kk7 1/1 Running 1 47h
coredns-67c766df46-kdlw5 1/1 Running 1 47h
etcd-minikube 1/1 Running 1 47h
fpga-device-plugin-daemonset-nvzgp 1/1 Running 0 3m21s
kube-addon-manager-minikube 1/1 Running 1 47h
kube-apiserver-minikube 1/1 Running 1 47h
kube-controller-manager-minikube 1/1 Running 1 47h
kube-proxy-xlng4 1/1 Running 1 47h
kube-scheduler-minikube 1/1 Running 1 47h
storage-provisioner 1/1 Running 2 47h
ubuntu@ip-:~ /faas/FPGA_as_a_Service/k8s-fpga-device-plugin/trunk/aws$ kubectl logs aws-fpga-device-plugin-daemonset-d9tgj -n kube-system
time="2019-12-15T06:28:09Z" level=info msg="Starting FS watcher."
time="2019-12-15T06:28:09Z" level=info msg="Starting OS watcher."
ubuntu@ip-:~ /faas/FPGA_as_a_Service/k8s-fpga-device-plugin/trunk/aws$

Following documentation is not working properly.

I have a kubernetes cluster running with two nodes, using Calico CNI. One node has two U55C's and the other has 1 U55c installed, all cards flashed and XRT installed on all nodes. I am following the instructions in this document: https://docs.xilinx.com/r/en-US/Xilinx_Kubernetes_Device_Plugin/Installing-K8s-Device-Plugin-on-Kubernetes

When I start the daemonset via the instruction

kubectl apply -f ./k8s-device-plugin.yml

the pods are stuck in a crashback loop. I get the following when I get the logs for the pods

time="2023-06-27T22:14:23Z" level=info msg="Plugin Version: 1.2.0"
time="2023-06-27T22:14:23Z" level=info msg="Set U30NameConvention: CommonName"
time="2023-06-27T22:14:23Z" level=info msg="Set U30AllocUnit: Card"
time="2023-06-27T22:14:23Z" level=info msg="Set DeviceNameCustomize: False"
time="2023-06-27T22:14:23Z" level=info msg="Virtual Device Mode: OFF"
time="2023-06-27T22:14:23Z" level=warning msg="Invalid input for VirtualNum, will set VirtualNum as 1"
time="2023-06-27T22:14:23Z" level=info msg="VirtualNum: 1"
time="2023-06-27T22:14:23Z" level=info msg="Starting FS watcher."
time="2023-06-27T22:14:23Z" level=info msg="Starting OS watcher."
panic: runtime error: index out of range [1] with length 1

goroutine 5 [running]:
main.GetDevices()
/root/yuzhang/upgrade/k8s-fpga-device-plugin-1/fpga.go:219 +0x18c5
main.NewFPGADevicePlugin.func1()
/root/yuzhang/upgrade/k8s-fpga-device-plugin-1/server.go:149 +0x99
created by main.NewFPGADevicePlugin
/root/yuzhang/upgrade/k8s-fpga-device-plugin-1/server.go:147 +0x438

Could you provide any insight into why this is not working?

FPGA

When running more that one job inside a pod cannot submit more than one job reliably. If more that one job is summitted in succession we get a input output error. This problem can be mitigated by xbutil reset from the host before a pod is spun up but this is not a desirable .

Any feedback would be grateful.

user@mlcluster-interactive-example-jfdz2:~/FPGA_test$ ./host vadd_hw.xclbin 512 0 1 64

 Total Data of 512.000 Mbytes to be written to global memory from host

 Kernel is invoked 1 time and repeats itself 1 times

Found Platform
Platform Name: Xilinx
DEVICE xilinx_u55c_gen3x16_xdma_base_3
INFO: Reading vadd_hw.xclbin
Loading: 'vadd_hw.xclbin'
- host loop iteration #0 of 1 total iterations
kernel_time_in_sec = 0.0421578
Duration using events profiling: 42050286 ns
 match_count = 134217728 mismatch_count = 0 total_data_size = 134217728
Throughput Achieved = 12.7674 GB/s
TEST PASSED
user@mlcluster-interactive-example-jfdz2:~/FPGA_test$ ./host vadd_hw.xclbin 512 0 1 64

 Total Data of 512.000 Mbytes to be written to global memory from host

 Kernel is invoked 1 time and repeats itself 1 times

Found Platform
Platform Name: Xilinx
DEVICE xilinx_u55c_gen3x16_xdma_base_3
INFO: Reading vadd_hw.xclbin
Loading: 'vadd_hw.xclbin'
- host loop iteration #0 of 1 total iterations
XRT build version: 2.14.384
Build hash: 090bb050d570d2b668477c3bd0f979dc3a34b9db
Build date: 2022-12-09 00:55:08
Git branch: 2022.2
PID: 99
UID: 1006
[Mon Apr  8 15:10:45 2024 GMT]
HOST: mlcluster-interactive-example-jfdz2
EXE: /home/gregj/FPGA_test/host
[XRT] ERROR: unable to sync BO: Input/output error
terminate called after throwing an instance of 'xrt_xocl::error'
  what():  event 0 never submitted
Aborted (core dumped)

How can I use k8s-device-plugin to Kria Kv260

I am a beginner to fpga. Today I want to use Kv260 with k8s. I have some questions.

  1. Only should I change fpga.go file and Modify /dev/dri/renderDXXX ?
  2. I saw #6 (comment). Someone say embedded does not have a shell, what does this shell mean ? Is it the same about Thin shell in Zynq-7000 and ZYNQ Ultrascale+ MPSoC Based Embedded Platforms ( https://xilinx.github.io/XRT/2020.2/html/platforms.html ) If I use Kv260, can I load bitstream dynamically ?

Thanks.

修改fpga.go之后,我的插件日志是否正常?

您好,根据您之前的建议,我修改了fpga.go
之后,我在zcu102上重新执行build脚本,使用Dockerfile重新构建arm64的插件镜像
最后,执行命令部署插件,日志如下:
root@zcu102:~ kubectl logs -n kube-system fpga-device-plugin-daemonset-pcvhq
time="2019-12-17T08:05:18Z" level=info msg="Starting FS watcher."
time="2019-12-17T08:05:18Z" level=info msg="Starting OS watcher."
time="2019-12-17T08:05:19Z" level=info msg="Starting to serve on /var/lib/kubelet/device-plugins/drm_minor-20191217-fpga.sock"
2019/12/17 08:05:19 grpc: Server.Serve failed to create ServerTransport: connection error: desc = "transport: write unix /var/lib/kubelet/device-plugins/drm_minor-20191217-fpga.sock->@: write: broken pipe"
time="2019-12-17T08:05:19Z" level=info msg="Registered device plugin with Kubelet xilinx.com/fpga-drm_minor-20191217"
time="2019-12-17T08:05:19Z" level=info msg="Sending 1 device(s) [&Device{ID:a0000000.zyxclmm_drm,Health:Healthy,}] to kubelet"
root@zcu102:~#
看上去和README中的输出一致,但是我注意到最后少了一条:
msg="Receiving request 1"
server.go中的Allocate方法应该输出这条信息,表示kubelet返回给插件的设备信息
但是就目前来看应该没有返回。。。
另外我通过describe node发现在Capacity字段和Allocatable字段都已经显示:
xilinx.com/fpga-drm_minor-20191217:1
接下来我应该怎么做,来保证插件正常部署?

xrt无法发现fpga设备

我在官网上下载U50对应的ubuntu22.04的对应bed包,安装好后mpd服务无法正常启动,查看发现是/dev/下没有xfpga设备,该设备是如何生成的? 我使用lspci | grep Xilinx 能够获取都fpga信息。

The fpga deamon-set plugin does not up as expect.

I don't know what happens there, the fpga deamonset plugin previously works but now it totally "out-of-work".
Here "out-of-work" means I can not get the deammonset pod and if I check the ds status specifically, I get below result:

$ kubectl get daemonset -n kube-system
NAME                           DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR            AGE
fpga-device-plugin-daemonset   0         0         0       0            0           <none>                   115s
kube-flannel-ds                1         1         1       1            1           <none>                   96m
kube-proxy                     1         1         1       1            1           kubernetes.io/os=linux   98m
------
$ kubectl describe ds fpga-device-plugin-daemonset -n kube-system
Name:           fpga-device-plugin-daemonset
Selector:       name=xilinx-fpga-device-plugin
Node-Selector:  <none>
Labels:         <none>
Annotations:    deprecated.daemonset.template.generation: 1
Desired Number of Nodes Scheduled: 0
Current Number of Nodes Scheduled: 0
Number of Nodes Scheduled with Up-to-date Pods: 0
Number of Nodes Scheduled with Available Pods: 0
Number of Nodes Misscheduled: 0
Pods Status:  0 Running / 0 Waiting / 0 Succeeded / 0 Failed
Pod Template:
  Labels:       name=xilinx-fpga-device-plugin
  Annotations:  scheduler.alpha.kubernetes.io/critical-pod: 
  Containers:
   xilinx-fpga-device-plugin:
    Image:        xilinx_k8s_fpga_plugin_lma:0.1
    Port:         <none>
    Host Port:    <none>
    Environment:  <none>
    Mounts:
      /var/lib/kubelet/device-plugins from device-plugin (rw)
  Volumes:
   device-plugin:
    Type:          HostPath (bare host directory volume)
    Path:          /var/lib/kubelet/device-plugins
    HostPathType:  
Events:            <none>

The other components of the Node work perfectly:

$ kubectl get pod -n kube-system
NAME                                     READY   STATUS    RESTARTS   AGE
coredns-f9fd979d6-6n4pj                  1/1     Running   0          74m
coredns-f9fd979d6-9w5wh                  1/1     Running   0          74m
etcd-xeniro-fpga-pc                      1/1     Running   0          74m
kube-apiserver-xeniro-fpga-pc            1/1     Running   0          74m
kube-controller-manager-xeniro-fpga-pc   1/1     Running   0          74m
kube-flannel-ds-gc9mp                    1/1     Running   0          22m
kube-proxy-jnssx                         1/1     Running   0          74m
kube-scheduler-xeniro-fpga-pc            1/1     Running   0          74m

All these happen after somewhat error in myself container failure (a FPGA test which got the container been evicted error) --but I already reboot the computer and all K8S stuff restart from scratch afterwards.
For this fpga plugin image I have been using the docker image build directly from the current repo (previously it used the older version xilinxatg/xilinx_k8s_fpga_plugin, which is ~1 year ago tag)--all have the same issue.
Please let me know what suggestion you have?

The socket communication between pods can not be done

Follow the guidance I can setup all pods/service/deployment, but at last when exec command to do the communicate the below error reported, looks like the socket connection does not established--can you please help to check..

$ kubectl exec test-client python /opt/xilinx/k8s/client/client.py
kubectl exec [POD] [COMMAND] is DEPRECATED and will be removed in a future version. Use kubectl kubectl exec [POD] -- [COMMAND] instead.
Traceback (most recent call last):
  File "/opt/xilinx/k8s/client/client.py", line 33, in <module>
    main()
  File "/opt/xilinx/k8s/client/client.py", line 32, in main
    client_send()
  File "/opt/xilinx/k8s/client/client.py", line 15, in client_send
    client.connect((bind_ip, bind_port))
  File "/usr/lib64/python2.7/socket.py", line 224, in meth
    return getattr(self._sock,name)(*args)
socket.error: [Errno 111] Connection refused
command terminated with exit code 1

rename master branch to something less offensive

Many software projects that are managed and distributed using tools like git use master/dev terminology. The term master has extremely offensive connotations, and the team at GitHub agrees (https://www.zdnet.com/article/github-to-replace-master-with-alternative-term-to-avoid-slavery-references/). We should encourage all software projects to adopt less offensive terminology.

The primary software project I lead has switched to using the term release, instead of master. Since the master branch is mostly used for offering software releases, this seemed to make the most sense. Another term that could replace master is main.

设备插件可以检测到ZCU102板卡中的FPGA资源吗?

作者你好!
此前我已经在zcu102板卡上成功启动ubuntu桌面系统,之后又搭建成功kubernetes集群
(集群中除了zcu102,另一个是x86架构的ubuntu服务器)。
我是否可以在zcu102板卡上部署FPGA设备插件,从而实现对FPGA资源的使用?
期待你的回复!

device-plugin on K3S

Is it compatible with K3S?

This is my output

marco@master-node:~/fpga_service/k8s-device-plugin$ kubectl get daemonset -n kube-system
NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE
svclb-traefik-1c41e552 2 2 2 2 2 25d
svclb-nginx-service-c37f202d 2 2 2 2 2 22d
device-plugin-daemonset 1 1 1 1 1 role=worker 25m

marco@master-node:~/fpga_service/k8s-device-plugin$ kubectl get pod -n kube-system
NAME READY STATUS RESTARTS AGE
helm-install-traefik-crd-g9m76 0/1 Completed 0 25d
helm-install-traefik-4s8jg 0/1 Completed 1 25d
svclb-nginx-service-c37f202d-wv2gj 1/1 Running 3 (3h29m ago) 22d
svclb-traefik-1c41e552-9qt6h 2/2 Running 10 (3h29m ago) 25d
traefik-f4564c4f4-7tm9b 1/1 Running 5 (3h29m ago) 25d
coredns-6799fbcd5-jffjm 1/1 Running 5 (3h29m ago) 25d
local-path-provisioner-6c86858495-7g8c2 1/1 Running 10 (3h28m ago) 25d
metrics-server-54fd9b65b-fvpbf 1/1 Running 10 (3h28m ago) 25d
svclb-traefik-1c41e552-p6b94 2/2 Running 0 18m
svclb-nginx-service-c37f202d-gcq9v 1/1 Running 0 18m
device-plugin-daemonset-6f2j4 1/1 Running 0 18m

marco@master-node:~/fpga_service/k8s-device-plugin$ kubectl logs device-plugin-daemonset-6f2j4 -n kube-system -c device-plugin
time="2024-05-24T10:19:43Z" level=info msg="Plugin Version: 1.3.0"
time="2024-05-24T10:19:43Z" level=info msg="Set U30NameConvention: CommonName"
time="2024-05-24T10:19:43Z" level=info msg="Set U30AllocUnit: Card"
time="2024-05-24T10:19:43Z" level=info msg="Set DeviceNameCustomize: False"
time="2024-05-24T10:19:43Z" level=info msg="Virtual Device Mode: OFF"
time="2024-05-24T10:19:43Z" level=warning msg="Invalid input for VirtualNum, will set VirtualNum as 1"
time="2024-05-24T10:19:43Z" level=info msg="VirtualNum: 1"
time="2024-05-24T10:19:43Z" level=info msg="Starting FS watcher."
time="2024-05-24T10:19:43Z" level=info msg="Starting OS watcher."

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.