apress / data-parallel-cpp Goto Github PK

Source code for 'Data Parallel C++: Mastering DPC++ for Programming of Heterogeneous Systems using C++ and SYCL' by James Reinders, Ben Ashbaugh, James Brodman, Michael Kinsner, John Pennycook, Xinmin Tian (Apress, 2020).

Home Page: https://www.apress.com/9781484255735

License: Other

CMake 100.00%

data-parallel-cpp's Introduction

Data Parallel C++ Book Source Samples

This repository accompanies Data Parallel C++: Programming Accelerated Systems using C++ and SYCL by James Reinders, Ben Ashbaugh, James Brodman, Michael Kinsner, John Pennycook, Xinmin Tian second edition (Apress, 2023), and the first edition (Apress, 2020).

Purpose of this branch (main)

This branch (main) contains source code expanded from the Second Edition of the DPC++ book (available now!). We say 'expanded' because they include code not listed in the book, and we will update it as needed to keep it useful - as we did after the first edition was published. We welcome feedback. The sycl121_original_publication branch contains the source code published in the first edition. The first edition's book source was primarily based on the older SYCL 1.2.1 specification, and many enhancements and changes were added by the time the SYCL 2020 specification was published after our book. Since current toolchains which support SYCL are based on SYCL 2020, this main branch is intended to be compatible with recent compiler and toolchain releases.

The Second Edition of the DPC++ book, available now, is based on the updated code examples in this main branch.

Overview

Many of the samples in the book are snips from the more complete files in this repository. The full files contain supporting code, such as header inclusions, which are not shown in every listing within the book. The complete listings are intended to compile and be modifiable for experimentation.

Samples in this repository are updated to align with the most recent changes to the language and toolchains, and are more current than captured in the book text due to lag between finalization and actual publication of a print book. If experimenting with the code samples, start with the versions in this repository. DPC++ and SYCL are evolving to be more powerful and easier to use, and updates to the sample code in this repository are a good sign of forward progress!

Download the files as a zip using the green button, or clone the repository to your machine using Git.

How to Build the Samples

The samples in this repository are intended to compile with any modern C++ with SYCL compiler. We have tested it with the open source DPC++ project toolchain linked below, and with the 2023.2 release and newer of the oneAPI prebuilt icpx compilers based on the DPC++ open source project. If you have an older toolchain installed, you may encounter compilation errors due to evolution of the features and extensions. Recent testing verified that AdaptiveCpp (previously HipSYCL), with a few rare exceptions that should be resolved soon, is able to support all these examples as well. We will welcome any feedback regarding compatibility with any C++ compiler that has SYCL support.

Prerequisites

An installed SYCL toolchain. See below for details on the tested DPC++ toolchain
CMake 3.10 or newer (Linux) or CMake 3.25 or newer (Windows)
Ninja or Make - to use the build steps described below

To build and use these examples, you will need an installed DPC++ toolchain. For one such toolchain, please visit:

https://software.intel.com/content/www/us/en/develop/tools/oneapi/components/dpc-compiler.html

Alternatively, much of the toolchain can be built directly from:

https://github.com/intel/llvm

Some of the samples require other dependencies. To disable samples requiring these dependencies use the CMake variables described below.

Setting Up an Environment to Build the Samples

Setup environment variables if using a oneAPI / DPC++ implementation:

On Windows:

\path\to\inteloneapi\setvars.bat

On Linux:

source /path/to/inteloneapi/setvars.sh

Building the Samples:

Note: CMake supports different generators to create build files for different build systems. Some popular generators are Unix Makefiles or Ninja when building from the command line, and Visual Studio-based generators when building from a Windows IDE. The examples below generate build files for Unix Makefiles, but feel free to substitute a different generator, if preferred.

Create build files using CMake. For example:

mkdir build && cd build
cmake -G "Unix Makefiles" ..

Build with the generated build files:
```
cmake --build . --target install --parallel
```
Or, use the generated Makefiles directly:
```
make install -j8
```

If your SYCL compiler is not detected automatically, or to explicitly specify a different SYCL compiler, use the CMAKE_CXX_COMPILER variable. For example:

cmake -G "Unix Makefiles" -DCMAKE_CXX_COMPILER=/path/to/your/sycl/compiler ..

CMake Variables:

The following CMake variables are supported. To specify one of these variables via the command line generator, use the CMake syntax -D<option name>=<value>. See your CMake documentation for more details.

Variable	Type	Description
NODPL	BOOL	Disable samples that require the oneAPI DPC++ Library (oneDPL). Default: `FALSE`
NODPCT	BOOL	Disable samples that require the DPC++ Compatibility Tool (dpct). Default: `FALSE`
NOL0	BOOL	Disable samples that require the oneAPI Level Zero Headers and Loader. Default: `TRUE`
WITHCUDA	BOOL	Enable CUDA device support for the samples. Default: `FALSE`

data-parallel-cpp's People

Contributors

Stargazers

Watchers

Forkers

kgeorgen xuebai5 ozturkosu zstreeter oleksandr-pavlyk rspwfpgas andromem shengwenliang isabella232 profcab breyerml retonym diankun piotrfusik yuxianch rdguerrerom baek-parallel-computing robocon2011 mfkiwl ezhangle utjfritz jbrodman bijoumd78 dm-vodopyanov khileshchauhan codernodez alexbatashev bashbaug cahitkargi kevinfeng83 curiousinvention jpotoff sonersteiner wellalbuquerque mkolod krishnakumarg1984 azurecloudmonk wee-free-scot jnoodles pennycook l30nardosv steffenlarsen maze516 haoweizhangintel jfboy233 gevarakelyan umeshdeshmukh crhcrhcrhcrh zjjzby yskale fab65 wdamon-intel chaofanl tavakkoliamirmohammad xarchx80 jiangd22 chanrun0218 xtian-github dmageelanl acse-bn20 raylouis feng-y-28 jamesreinders pravin25 wiegerw rushikeshsonwane03 zhewang1-intc fuguangz lmmxzy tudousizzz richardsonjf chsasank yonghuazhang-buaa ike79 jacksky64 vkd0726 s311354 keshavaspanda svetlanasieber pappan12

data-parallel-cpp's Issues

Typos on p. 74 in reference to Fig. 3-6

On p. 74, there are two instances of myAccessor which, I presume, should be my_accessor (as seen in the source code in Fig. 3-6).

initializing events as non-signaled

Windows has synchronization events that can be initialized as non-signaled.
See https://docs.microsoft.com/en-us/windows/win32/sync/using-event-objects

Is there a way to initialize events as non-signaled in sycl/dspc++?
For example, if the fig_8_7 example code is rearranged as shown below, it will immediately execute the parallel_for operation that waits on e1 and e2. Is there a way to initialize e1 and e2 as non-signaled in this case?

is_host() returns 0 for device selected by host_selector

I modified example fig_2_7 to display has(aspect) and is_something returns. I'm puzzled why a device selected by host_selector would return 0 for is_host().

I'm running all these in the docker environment for oneapi and using cmake for the builds on ubuntu 20.04

I'm attaching my modified example code.
fig_2_7_implicit_default_selector.zip

fig_14_18-20_inclusive_scan sample fails to build on Windows..

Hi,
just compiled samples on Windows with Oneapi beta 10 installed without DPL and all samples built correctly excepting one:
I'm on: Compilador de optimización de C/C++ de Microsoft (R) versión 19.27.29112 para x64

e:\a\data-parallel-CPP-main\build>ninja
[1/2] Building CXX object samples/Ch14_common_parallel_patterns/CMakeFiles/fig_14_18-20_inclusive_scan.dir/fig_14_18-20_inclusive_scan.cpp.obj
FAILED: samples/Ch14_common_parallel_patterns/CMakeFiles/fig_14_18-20_inclusive_scan.dir/fig_14_18-20_inclusive_scan.cpp.obj
C:\PROGRA~2\Intel\oneAPI\compiler\latest\windows\bin\CLANG_~1.EXE    -O3 -DNDEBUG -D_DLL -D_MT -Xclang --dependent-lib=msvcrt   -fsycl -fsycl-unnamed-lambda -ferror-limit=1 -Wall -Wpedantic -std=gnu++17 -MD -MT samples/Ch14_common_parallel_patterns/CMakeFiles/fig_14_18-20_inclusive_scan.dir/fig_14_18-20_inclusive_scan.cpp.obj -MF samples\Ch14_common_parallel_patterns\CMakeFiles\fig_14_18-20_inclusive_scan.dir\fig_14_18-20_inclusive_scan.cpp.obj.d -o samples/Ch14_common_parallel_patterns/CMakeFiles/fig_14_18-20_inclusive_scan.dir/fig_14_18-20_inclusive_scan.cpp.obj -c ../samples/Ch14_common_parallel_patterns/fig_14_18-20_inclusive_scan.cpp
In file included from ../samples/Ch14_common_parallel_patterns/fig_14_18-20_inclusive_scan.cpp:5:
In file included from C:\PROGRA~2\Intel\oneAPI\compiler\latest\windows\bin\..\include\sycl\CL/sycl.hpp:11:
In file included from C:\PROGRA~2\Intel\oneAPI\compiler\latest\windows\bin\..\include\sycl\CL/sycl/ONEAPI/atomic.hpp:11:
In file included from C:\PROGRA~2\Intel\oneAPI\compiler\latest\windows\bin\..\include\sycl\CL/sycl/ONEAPI/atomic_accessor.hpp:11:
In file included from C:\PROGRA~2\Intel\oneAPI\compiler\latest\windows\bin\..\include\sycl\CL/sycl/ONEAPI/atomic_enums.hpp:12:
In file included from C:\PROGRA~2\Intel\oneAPI\compiler\latest\windows\bin\..\include\sycl\CL/sycl/access/access.hpp:10:
In file included from C:\PROGRA~2\Intel\oneAPI\compiler\latest\windows\bin\..\include\sycl\CL/sycl/detail/common.hpp:109:
In file included from C:\Program Files (x86)\Microsoft Visual Studio\2019\Community\VC\Tools\MSVC\14.27.29110\include\iostream:11:
In file included from C:\Program Files (x86)\Microsoft Visual Studio\2019\Community\VC\Tools\MSVC\14.27.29110\include\istream:11:
In file included from C:\Program Files (x86)\Microsoft Visual Studio\2019\Community\VC\Tools\MSVC\14.27.29110\include\ostream:11:
In file included from C:\Program Files (x86)\Microsoft Visual Studio\2019\Community\VC\Tools\MSVC\14.27.29110\include\ios:11:
In file included from C:\Program Files (x86)\Microsoft Visual Studio\2019\Community\VC\Tools\MSVC\14.27.29110\include\xlocnum:12:
C:\Program Files (x86)\Microsoft Visual Studio\2019\Community\VC\Tools\MSVC\14.27.29110\include\cmath:167:12: error: SYCL kernel cannot call a dllimport function
    return _CSTD log2f(_Xx);
           ^
C:\Program Files (x86)\Microsoft Visual Studio\2019\Community\VC\Tools\MSVC\14.27.29110\include\yvals_core.h:1228:15: note: expanded from macro '_CSTD'
#define _CSTD ::
              ^
C:\Program Files (x86)\Windows Kits\10\include\10.0.20246.0\ucrt\corecrt_math.h:570:47: note: 'log2f' declared here
    _Check_return_ _ACRTIMP float     __cdecl log2f(_In_ float _X);
                                              ^
1 error generated.
ninja: build stopped: subcommand failed.

modified fig_3_11_depends_on doesn't work with multi cores

I modified fig_3_11_depends_on to further test it, since the example didn't display anything that could show its execution. I am providing a modified version that uses event.wait() for each event. It creates the expected results.

I've left in the commented out depends_on statements. If I uncomment those and comment out the event.wait() statements, the results are not as expected.

Also, in general, I don't see an example to use the depends_on with the execution blocks rearranged. I tried by creating event objects at the entry of main, but I see no example to set these events in an initial state or to have the event objects updated without an assignment that would overwrite the address of the object.

I'm going to attach my modified example
fig_3_11_depends_on.zip

I'm using the oneapi docker installation on ubuntu 20.04, and building the examples with cmake generated makefiles, but with -O0 optimizations.

Reference for collective functions section at end of chapter 9

Hi,

At the end of chapter 9, there is discussion of some collective functions including "broadcast", "any_of"/"all_of", "shuffle"/"shuffle_up"/"shuffle_down" and "shuffle_xor".

None of these seem to come up in a search of the reference linked from Chapter 1 at http://tinyurl.com/dpcppref

Have these methods been moved / removed from the core standard? If not, is there a URL with further details on how these work?

I wasn't able to quite work out exactly what they all do from that section of the book, so was looking for some more description or examples.

Build error in Ch18 examples

CMakeCache.txt
When I try to build the book's examples from this repo, with Ubuntu 20.04:

.../mkoneapi-book-examples/data-parallel-CPP$ mkdir build                                                                                                                                
.../mkoneapi-book-examples/data-parallel-CPP$ cd build                                                                                                                                   
.../data-parallel-CPP/build$ source /opt/intel/oneapi/setvars.sh
:: initializing oneAPI environment ...
   bash: BASH_VERSION = 5.0.17(1)-release
:: compiler -- latest
:: debugger -- latest
:: dev-utilities -- latest
:: dpl -- latest
:: tbb -- latest
:: vpl -- latest
:: oneAPI environment initialized ::
.../data-parallel-CPP/build$ cmake -G Ninja -DCMAKE_TOOLCHAIN_FILE=../dpcpp_toolchain.cmake ..
-- No build type selected, default to Release
-- The C compiler identification is Clang 12.0.0
-- The CXX compiler identification is Clang 12.0.0
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Check for working C compiler: /opt/intel/oneapi/compiler/2021.2.0/linux/bin/clang - skipped
-- Detecting C compile features
-- Detecting C compile features - done
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Check for working CXX compiler: /opt/intel/oneapi/compiler/2021.2.0/linux/bin/clang++ - skipped
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- Configuring done
-- Generating done
-- Build files have been written to: /space/mkoneapi-book-examples/data-parallel-CPP/build
build $ cmake -G Ninja -DCMAKE_TOOLCHAIN_FILE=../dpcpp_toolchain.cmake ..
build $ ninja

....


[219/250] Building CXX object samples/Ch18_using_libs/CMakeFiles/fig_18_15_pstl_usm.dir/fig_18_15_pstl_usm.cpp.o
FAILED: samples/Ch18_using_libs/CMakeFiles/fig_18_15_pstl_usm.dir/fig_18_15_pstl_usm.cpp.o 
/opt/intel/oneapi/compiler/2021.2.0/linux/bin/clang++   -O3 -DNDEBUG -fsycl -fsycl-unnamed-lambda -ferror-limit=1 -Wall -Wpedantic -std=gnu++17 -MD -MT samples/Ch18_using_libs/CMakeFiles/fig_18_15_pstl_usm.dir/fig_18_15_pstl_usm.cpp.o -MF samples/Ch18_using_libs/CMakeFiles/fig_18_15_pstl_usm.dir/fig_18_15_pstl_usm.cpp.o.d -o samples/Ch18_using_libs/CMakeFiles/fig_18_15_pstl_usm.dir/fig_18_15_pstl_usm.cpp.o -c ../samples/Ch18_using_libs/fig_18_15_pstl_usm.cpp
In file included from ../samples/Ch18_using_libs/fig_18_15_pstl_usm.cpp:6:
In file included from /opt/intel/oneapi/dpl/2021.2.0/linux/include/oneapi/dpl/execution:32:
In file included from /usr/lib/gcc/x86_64-linux-gnu/9/../../../../include/c++/9/execution:32:
In file included from /usr/lib/gcc/x86_64-linux-gnu/9/../../../../include/c++/9/pstl/glue_execution_defs.h:52:
In file included from /usr/lib/gcc/x86_64-linux-gnu/9/../../../../include/c++/9/pstl/algorithm_impl.h:25:
In file included from /usr/lib/gcc/x86_64-linux-gnu/9/../../../../include/c++/9/pstl/parallel_backend.h:14:
/usr/lib/gcc/x86_64-linux-gnu/9/../../../../include/c++/9/pstl/parallel_backend_tbb.h:70:10: error: no member named 'task' in namespace 'tbb'
    tbb::task::self().group()->cancel_group_execution();
    ~~~~~^
fatal error: too many errors emitted, stopping now [-ferror-limit=]
2 errors generated.

fig_15_3 example hangs on Iris Pro Graphics 580

I'm using a Skull Canyon NUC box with Iris Pro Graphics 580. Most of the examples run ok on this, but the fig_15_3 example hangs with current matrixSize=128 setting. If I lower matrixSize to 96, it executes very slowly.. over 6 secs per iteration. At matrixSize=100 it seeming stalls after a couple of iterations. I modified the code to add both async and queue exception catch, but no error is caught. I'll attach my code, the stack backtraces and the system monitor showing hang after a couple of iterations when matrixSize=100. Looks like CPU goes to 100% and stays there. Only 3 thread running. This is the single task matrix multiplication. The parallel versions in the following examples work ok on gpu. All examples work ok on cpu.

fig_15_3_single_task_matrix_multiplication_mod.zip

Fig 1-1 & Fig 1-3

I am attempting to run these on Windows 10 using Intel's DPCPP integrated into Visual Studio 2019 (16.8.5). The project builds successfully using the CMake scripts using the following steps in the instructions, with the following modifications:
(1) cmake -G "Visual Studio 16 2019" -DNODPL=1 -DCMAKE_TOOLCHAIN_FILE=../dpcpp_toolhain.cmake ..
(2) Open the resulting project in Visual Studio, ensure compiler for Fig 1-1 and 1-3 are set to Intel(R) oneAPI DPC++ Compiler
(3) Build only Figure 1-1 or Figure 1-3.
(4) Set Figure 1-1 (or Figure 1-3) as the start up project.
(5) Run the project.

The program crashes in line 16 of Figure 1-1: queue Q with std::length_error. For additional details, the machine is Windows 10 Home 64-bit using an Intel Core i9 9th Generation processor. As a sanity check, I tested a sample code from Intel, retrieved from https://software.intel.com/content/www/us/en/develop/documentation/get-started-with-dpcpp-compiler/top.html, using Intel's DPCPP in Visual Studio 2019, and had no issues with the data structure itself. That code ran to completion. Are there additional steps that should be taken for this to run in this environment that are not documented in the instructions?

device.get_info exceptions for il_version, ext_intel_pci_address, ext_intel_max_mem_bandwidth

I modified the fig_2_9 example code to do device.get_info() calls on all the available enum values. I see exceptions on the three stated in the title and am puzzled why. I'm running all these in the docker environment for oneapi and using cmake for the builds on ubuntu 20.04.

I'll include the modified example.

fig_2_9_host_selector.zip

fig_6_8 example code references wrong array in kernel loop

The read_only_data is allocated as malloc_shared, but, instead, the uninitialized data is accessed in the kernel loop of fig_6_8.

Aside from that, the prefetch does almost nothing to impact the execution time of this example, so it needs a better example. I added multi-loop time measurements and tried several modifications of the code to effectively create pre-fetches. I'll attach my test code.

fig_6_8_prefetch_memadvise_tests.zip

This shows the code added to use a read to cause the prefetch, along with the times for setup (to do the prefetches), followed by times of multiple passes of write to the data array. Note that the times are all about the same after the initial prefetches, so apparently the buffer sizes selected aren't really a good test case for use of prefetch.

fig_6_9_queries, need better documentation of get_info possible return types

The get_info return types are documented in a sycl manual
// device info of type described in Table 4.20 of SYCL Spec
and are in a device_traits.def file
// types are in /opt/intel/oneapi/compiler/2021.3.0/linux/include/sycl/CL/sycl/info/device_traits.def
The types aren't easy to find in the debugger source browsing and aren't easy to obtain by any type of introspection that I could find.
I ended up just doing assign to auto variables and then running a linux demangle utility, __cxa_demangle, on the typeid(var).name(), and printing out return types. I don't believe MSVC has this particular demangle utility available.
__cxa_demangle doc
fig_6_9_queries_return_types.zip

Several of the get_info calls can throw exceptions.

fig_6_8_prefetch_memadvise, prefetch doesn't improve performance

I added measurements of fig_6_8_prefetch_memadvise and noticed that the prefetch didn't improve the performance.
So, I created a simplified test of using the prefetch vs a simple read access of the first position in each block. The read access, of course, does apparently prefetch the memory so that subsequent use of it is faster. The prefetch, from all attempts I tried, has no effect. Below shows modified example results.
setup-time is time to do read access for the start of each block.
pfsetup-time is time to do prefetch hints for the start of each block.
time is time to do accesses following the initial setup-time accesses
pftime is time to do accesses following the prefetch hints
note that the first "time" value is low, but the first pftime value does not look improved.
after first time and pftime, all timings are about the same, as expected.

I'm testing on NUC Skull Canyon box, building with debug enabled and -O0 optimization.
root@4578405bdff6:/workspaces/data-parallel-CPP-main/build# dpcpp --version
Intel(R) oneAPI DPC++/C++ Compiler 2021.3.0 (2021.3.0.20210619)
root@4578405bdff6:/workspaces/data-parallel-CPP-main/build# sycl-ls
0. ACC : Intel(R) FPGA Emulation Platform for OpenCL(TM) 1.2 [2021.12.6.0.19_160000]

CPU : Intel(R) OpenCL 2.1 [2021.12.6.0.19_160000]
GPU : Intel(R) OpenCL HD Graphics 3.0 [21.23.20043]
GPU : Intel(R) Level-Zero 1.1 [1.1.20043]
HOST: SYCL host platform 1.2 [1.2]

I'm attaching the test program I created to test prefetch, which is a modified fig_6_8 example, built with cmake in Ubuntu 20.04

fig_6_8_prefetch_memadvise_test.zip

Typo in comment in Figure 4-11

There's a comment:

// Return the offset of this item (if with_offset == true)

but the template parameter is spelled WithOffset.

debugging examples in Docker container within vscode on ubuntu

In addition to Intel's info on debugging in vscode, there were a couple of mods I did in devcontainer.json.

"runArgs": [ "--cap-add=SYS_PTRACE", "--security-opt", "seccomp=unconfined", "--device=/dev/dri" ],

added the --device=/dev/dri was required to get access to the GPU devices

uncommented the mounts line

// Uncomment to use the Docker CLI from inside the container. See https://aka.ms/vscode-remote/samples/docker-from-docker.

"mounts": [ "source=/var/run/docker.sock,target=/var/run/docker.sock,type=bind" ],

changed my cmake build to remove optimization and add DEBUG mode. Also just using make instead of Ninja

CXX=dpcpp cmake -D CMAKE_BUILD_TYPE=Debug -D CMAKE_CXX_FLAGS="-O0" -D NODPL=1 ../

I had to change the FPGA devices to FPGA emulator in some projects.

I also had to change from default_device to cpu_device in some projects. The default_device is a GPU, which doesn't work in all the programs.

I'm attaching a launch.json file and my modified devcontainer.json for the examples

vscodefiles.zip

Mistakes in the "Data Parallel C++" book and sample programs

First of all, I wanted to say that I really liked reading your book. It would have been a great help back when I started learning SYCL for my master thesis.

However, while reading, I found a few mistakes and wanted to share them with you (if this is the wrong place or inappropriate please let me know):

On page 167 in figure 6-8: you create the read_only_data array, initialize it, and explicitly tell the runtime that this array is read-only used in the kernel (using the mem_advise function). However, in the example code (as well as in the code samples) you never access the read_only_data array in the kernel (which, hence, is never used). I guess the kernel instead of being

Q.parallel_for(range{BLOCK_SIZE}, e, [=](id<1> i) {
    data[b * BLOCK_SIZE + i] += data[i];
});

should be

Q.parallel_for(range{BLOCK_SIZE}, e, [=](id<1> i) {
    data[b * BLOCK_SIZE + i] += read_only_data[i];
});

(see PR #5)

On page 232 in figure 9-13: the first time a lambda kernel function is explicitly named (<class MatrixMultiplication>). However, the naming of lambda kernels is first explained 15 pages later on (page 247 in figure 10-4).
On page 265 in figure 11-1: the first and third code lines are missing comment blocks (before vec Class declaration and vec Class Members).
On page 267 in figure 11-5: the first operator example has a superfluous comma and brace and a missing comma in its parameter list:

template <typename dataT, int numElements>
vec<dataT, numElements>
operatorOP(const dataT, &lhs)
  const vec<dataT, numElements> &rhs);

should be

template <typename dataT, int numElements>
vec<dataT, numElements>
operatorOP(const dataT &lhs,
  const vec<dataT, numElements> &rhs);

On page 272: in the second last line, "[...] will vectorize [...])", contains wrongly colored characters (link color).
On page 310 in figure 13-5: the links are not clickable (since the whole table isn't clickable/selectable). Other tables aren't clickable too, however this one contains hyperlinks.
On page 330: the exclusive interval should read [0, i) (or [0, i[) instead of [0, i)].
On page 342/343 in figures 14-14 and 14-15: the offsets to the SYCL kernel were deprecated in the SYCL 2020 specification. I guess this happened after this book was finished. However, with the current DPC++ compiler (2021.1 (2020.10.0.1113)) the respective code samples don't work correctly anymore. To fix this, the offset must be applied by hand (see PR #6).
On page 399 figure 16-6: the assert statement

assert(vecA.size() == vecB.size() == vecC.size());

should be (already correct in the code samples)

assert(vecA.size() == vecB.size() && vecB.size() == vecC.size());

In the code sample for figure 16-5: the number of loaded bytes for the vector triad is saved as int. This causes an integer overflow (resulting in a negative bandwidth) when entering an array_size larger than 89'478'485. However, I had to use an array_size of 100'000'000 to get representative runtimes (~0.15s). Changing the type of the triad_byte variable to std::size_t should fix this issue (see PR #7).
On page 415: I couldn't find anything regarding the compiler macro D__SYCL_DISABLE_ID_TO_INT_CONV__. However, I found the compiler flag -fsycl-id-queries-fit-in-int that seems to fulfill the same purpose.
On page 536 Ep-4: there is an additional newline after the template declaration and before the struct.
In some code examples you use the intel:: namespace. However, to be able to compile the respective code samples the ONEAPI or INTEL namespaces must be used. Are there efforts made to introduce all this functionality in a unified intel:: namespace and therefore you used the later in the example codes in the book?

More personal opinion based points regarding the code examples:

On page 416 in figure 16-18: you used int vals[] = {300, 200, 100, 000}; Personally, I wouldn't use 000 but 0 (I guess you used 000 here for a better alignment with the other numbers). However, I don't believe that the majority of new C++ programmers know that number literals starting with a 0 are represented as octal numbers. So they will be surprised when playing around with the samples and changing, e.g., 000 to 012 and realizing that 012 != 12.
Same code examples are rather inconsistent compared to each other, e.g.:
- You mentioned the sycl::buffer::get_access method on page 185, but only ever used it in one code example on page 405 (figure 16-10).
- Sometimes you use the STL algorithms std::fill and std.::iota and sometimes plain for-loops
- Sometimes you use item.get_global_id(0) and sometime item.get_global_id()[0]. I know that the former is more or less syntactic sugar for the later. However, they could possible do different things resulting in the reader thinking, why sometimes one version is preferred over the other.
- Sometimes constructs are used that are more often found in C than in C++ (e.g., DBL_MAX on page 399 instead of std::numeric_limits<double>::max() or "normal" C malloc (not USM malloc) in some provided code samples).

And last but not least, one question out of curiosity:
on page 243 in figure 10-1 you mentioned that one disadvantage of lambda expressions is that they can't be templated. C++20 introduced templated lambdas (see https://en.cppreference.com/w/cpp/language/lambda), e.g.,:

auto glambda = []<class T>(T a, auto&& b) { return a < b; };

Do you know, whether there are any plans for future SYCL/DPC++ versions to also support templated lambdas?

fig_14_11_array_reduction, out-of-bounds on Shared USM catched by DeviceSanitizer

Error message

====ERROR: DeviceSanitizer: out-of-bounds-access on Shared USM                                                                                                                                            
READ of size 4 at kernel <_ZTSZZN4sycl3_V16detail16NDRangeReductionILNS1_9reduction8strategyE1EE3runINS1_9auto_nameELi1ENS0_3ext6oneapi12experimental10propertiesISt5tupleIJEEEEZNS1_22reduction_parallel_
forIS7_LS4_0ELi1ESE_JNS1_14reduction_implIiSt4plusIvELi1ELm16ELb0EPiEEZZ4mainENKUlRNS0_7handlerEE_clESM_EUlNS0_2idILi1EEERT_E_EEEvSM_NS0_5rangeIXT1_EEET2_DpT3_EUlSQ_DpRT0_E_SK_EEvSM_RSt10shared_ptrINS1_
10queue_implEENS0_8nd_rangeIXT0_EEERT1_RT3_RSV_ENKUlSQ_E_clISJ_EEDaSQ_EUlNS0_7nd_itemILi1EEEE_> LID(0, 0, 0) GID(0, 0, 0)                                                                                 
  #0 decltype(std::forward<int&>(fp) + std::forward<int&>(fp0)) std::plus<void>::operator()<int&, int&>(int&, int&) const /usr/lib64/gcc/x86_64-suse-linux/7/../../../../include/c++/7/bits/stl_function.h
:238

I think the right parallel_for should be this:

  q.submit([&](handler& h) {
     // BEGIN CODE SNIP
     h.parallel_for(
         range{N},
         reduction(span<int, B>(histogram, B), plus<>()),
         [=](id<1> i, auto& histogram) {
           histogram[data[i] % B]++;
         });
     // END CODE SNIP
   }).wait();

Figure 9-14 template parameter "T" not explained in the book

The listing https://github.com/Apress/data-parallel-CPP/blob/63e9eceb909a857705ec7acdbff2559114a7aa0b/samples/Ch09_work_item_communication/fig_9_14_ndrange_sub_group_matmul.cpp defines the template parameter T outside of the book snip. This causes confusion to the book reader who is unaware what is T.

book's fig_1_3_race example fails if gpu has max_sub_devices=0

I played around with the race example long enough to figure out it would not fail on my gpu. I could make a race occur on a cpu_selector device by changing the memcpy kernel to a parallel_for kernel copy and inserting a decrement delay count in its kernel code.

With my GPU as the default device, no size delay would create the race, so I'm presuming it is because my GPU is only capable of one thread, which I verified by printing partition_max_sub_devices.

default_selector: Selected device: Intel(R) Iris(TM) Pro Graphics 580 [0x193b]
-> Device vendor: Intel(R) Corporation
-> Device partition_max_sub_devices: 0

I suggest changing the race example to explicitly use a cpu selector and to insert some delays in the first kernel loop.

host_accessor with read_write access mode locks up with += operation

In playing with fig_3_13 example modifications, I see that this code executes properly

host_accessor host_accCrw(C, read_write); for (int i = 0; i < N; i++) { host_accCrw[i] = host_accCrw[i] +100000; }
while this code apparently is not handled correctly, since it locks up in execution
host_accessor host_accCrw(C, read_write); for (int i = 0; i < N; i++) { host_accCrw[i] += 100000; }
I'm using oneapi docker images on ubuntu 20.04

I'll attach my modified code
fig_3_13_read_after_write.zip

This shows the hang in accessor.hpp line 1083 when the += operator is attempted.

Default build not finding dpcpp

On Ubuntu 20.04 (LTS) with Intel oneAPI beta10 the CMake build fails.

c++: error: unrecognized command line option ‘-fsycl’
c++: error: unrecognized command line option ‘-fsycl-unnamed-lambda’
c++: error: unrecognized command line option ‘-ferror-limit=1’

Commands used

git clone https://github.com/Apress/data-parallel-CPP.git
cd data-parallel-CPP
source /opt/intel/oneapi/setvars.sh
cmake -B  build
cmake --build build

Note that on the system /usr/bin/c++ links to /usr/bin/g++ via /etc/alternatives/c++

It looks like dpcpp is not being used.