I'm using a Skull Canyon NUC box with Iris Pro Graphics 580. Most of the examples run

fig_15_3 example hangs on Iris Pro Graphics 580 about data-parallel-cpp HOT 8 OPEN

apress commented on August 26, 2024

fig_15_3 example hangs on Iris Pro Graphics 580

from data-parallel-cpp.

Comments (8)

jnorwood commented on August 26, 2024

On the same gpu, the fig_15_5 example runs at about 0.11 per iteration and fig_15_7 at about 0.035 per iteration, so the 7 secs per iteration of the fig_15_3 single task example seems extremely slow.

from data-parallel-cpp.

bashbaug commented on August 26, 2024

Interesting, the "single task" version is not going to run very well on most GPUs, but the time you are seeing is excessive.

Could you please include:

What version of the dpcpp compiler you are using, from dpcpp --version?
What driver versions you have installed, from sycl-ls or sycl-ls --verbose?

As a data point, you may also want to try using the OpenCL GPU backend instead of the Level Zero GPU backend. You can do this with the SYCL_BE or SYCL_DEVICE_FILTER environment variables - see here. I don't think this will make a difference (it doesn't on my similar Intel(R) HD Graphics 620 system), but it's worth a try.

Thanks!

from data-parallel-cpp.

jnorwood commented on August 26, 2024

I'm using the most recent docker released version

root@33541cf26757:/workspaces/data-parallel-CPP-main/build# dpcpp --version
Intel(R) oneAPI DPC++ Compiler 2021.2.0 (2021.2.0.20210317)
Target: x86_64-unknown-linux-gnu
Thread model: posix
InstalledDir: /opt/intel/oneapi/compiler/2021.2.0/linux/bin

root@33541cf26757:/workspaces/data-parallel-CPP-main/build# sycl-ls
ACC : Intel(R) FPGA Emulation Platform for OpenCL(TM) 1.2 [2021.11.3.0.17_160000]
CPU : Intel(R) OpenCL 2.1 [2021.11.3.0.17_160000]
GPU : Intel(R) OpenCL HD Graphics 3.0 [21.11.19310]
GPU : Intel(R) Level-Zero 1.0 [1.0.19310]
HOST: SYCL host platform 1.2 [1.2]

I retried using
export SYCL_DEVICE_FILTER=opencl:gpu:2
based on document at github
It still hangs for the original matrixSize=128, but made it through four iterations for matrixSize=100 at about 6.3 sec/iteration.

from data-parallel-cpp.

bashbaug commented on August 26, 2024

I got the most recent docker version working on my system also. Note that it appears there is a slightly newer version thanthe one you are using. I'm not able to reproduce this issue on my end:

root@55f9fbe3ec3b:/workspaces/sycl-book-samples/build# dpcpp --version
Intel(R) oneAPI DPC++/C++ Compiler 2021.3.0 (2021.3.0.20210619)
Target: x86_64-unknown-linux-gnu
Thread model: posix
InstalledDir: /opt/intel/oneapi/compiler/2021.3.0/linux/bin
root@55f9fbe3ec3b:/workspaces/sycl-book-samples/build# sycl-ls
0. ACC : Intel(R) FPGA Emulation Platform for OpenCL(TM) 1.2 [2021.12.6.0.19_160000]
1. CPU : Intel(R) OpenCL 2.1 [2021.12.6.0.19_160000]
2. GPU : Intel(R) OpenCL HD Graphics 3.0 [21.23.20043]
3. GPU : Intel(R) Level-Zero 1.1 [1.1.20043]
4. HOST: SYCL host platform 1.2 [1.2]
root@55f9fbe3ec3b:/workspaces/sycl-book-samples/build# samples/Ch15_gpus/fig_15_3_single_task_matrix_multiplication 
Running on device: Intel(R) HD Graphics 620 [0x5916]
Success!
GFlops: 0.0249472
root@55f9fbe3ec3b:/workspaces/sycl-book-samples/build# SYCL_DEVICE_FILTER=opencl:gpu samples/Ch15_gpus/fig_15_3_single_task_matrix_multiplication 
Running on device: Intel(R) HD Graphics 620 [0x5916]
Success!
GFlops: 0.0249608
root@55f9fbe3ec3b:/workspaces/sycl-book-samples/build#

A couple of possibilities:

Perhaps there was an issue in the older docker image that has been fixed? This would be the best-case scenario. Can you please try grabbing the latest docker image and give it a try?
Maybe there is an issue with your Iris Pro Graphics 580 that does not appear on my HD Graphics 620? I think this is unlikely - if anything your GPU should be faster! - but I suppose it is possible.
Could there be anything else odd happening with your system? Is everything else running OK?

Since (1) is the easiest to check, let's start there first.

from data-parallel-cpp.

jnorwood commented on August 26, 2024

ok, thanks. Yes, I had pulled the latest docker images, but neglected to rebuild my docker environment in vscode and update its compiler paths

after doing that I delete my build directory and then re-created makefiles. So, here's also my cmake configure options ... no optimizations and enabling debug. Maybe that has something to do with the issue.

41 mkdir build
42 cd build
43 CXX=dpcpp cmake -D CMAKE_BUILD_TYPE=Debug -D CMAKE_CXX_FLAGS="-O0" -D NODPL=1 ../

Here is the compiler version showing the update to latest version and the sycl-ls versions
root@4578405bdff6:/workspaces/data-parallel-CPP-main/build# dpcpp --version
Intel(R) oneAPI DPC++/C++ Compiler 2021.3.0 (2021.3.0.20210619)
Target: x86_64-unknown-linux-gnu
Thread model: posix
InstalledDir: /opt/intel/oneapi/compiler/2021.3.0/linux/bin
root@4578405bdff6:/workspaces/data-parallel-CPP-main/build#

root@4578405bdff6:/workspaces/data-parallel-CPP-main/build# sycl-ls
0. ACC : Intel(R) FPGA Emulation Platform for OpenCL(TM) 1.2 [2021.12.6.0.19_160000]

CPU : Intel(R) OpenCL 2.1 [2021.12.6.0.19_160000]
GPU : Intel(R) OpenCL HD Graphics 3.0 [21.23.20043]
GPU : Intel(R) Level-Zero 1.1 [1.1.20043]
HOST: SYCL host platform 1.2 [1.2]

however, the end result with matrixSize=128 is still a hang.
with matrixSize=100 it completes, although with long iterations
with matrixSize=110 it hangs after one 9.2 second iteration.

I'm attaching the screen captures with the iterations showing the matrixSize for 100 and 110 cases.

from data-parallel-cpp.

jnorwood commented on August 26, 2024

I checked that the problem is associated with the disabled optimization. If I override build optimization to -O2 with
make CXX_FLAGS="-O2", then the MatrixSize:128 completes
I normally build with -O0 due to the poor debugger support with -O2 optimization.

Running on device: Intel(R) Iris(TM) Pro Graphics 580 [0x193b]
MatrixSize:128
time:0.979357
time:0.807278
time:0.823998
time:0.853383
Success!
GFlops: 0.00519562

from data-parallel-cpp.

bashbaug commented on August 26, 2024

Thanks for investigating further. I can reproduce the excessive execution time using -O0 also.

I'm checking to see if there is a way to compile the host code with -O0 for easier debugging but to keep the device code (that executes on the GPU, and is leading to the excessive execution time) using a different optimization level.

Would this satisfy your use-case? I see you mentioned above:

I normally build with -O0 due to the poor debugger support with -O2 optimization.

from data-parallel-cpp.

jnorwood commented on August 26, 2024

I already have the work-arounds of reducing MatrixSize and/or using Q{cpu_selector{}}.

With MatrixSize==128 and using gpu_selector I can wait for 4 minutes without executing a single iteration, so I presume something is hung.

Using cpu_selector, fig_15_3_single_task completes an iteration in about 0.4 sec.

There is a document on gdb for gpu: gpu_debug , which I'm linking here for reference. It mentions setting heartbeat_interval, enable_hangcheck and preempt_timeout settings, which I haven't explicitly set.

I'll come back to this problem after finishing the dpc++ book examples and see if I can debug the gpu hang further.

from data-parallel-cpp.

fig_15_3 example hangs on Iris Pro Graphics 580 about data-parallel-cpp HOT 8 OPEN

Comments (8)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent