Comments (4)
This error usually indicates that the kernel launch crashed due to asking for too many resources (too much shared memory or invalid threadblock size).
Could you recompile with the -debug flag appended to the HL_TARGET variable and post the last part which corresponds to the crash?
In most cases that I encountered such an error, it was due to the compiler not being able to determine specific bounds for the scheduler. Most such applications had alot of boundary conditions and the bounds derived during scheduling were different than the ones actually needed at the end, so it was quite difficult to fix.
from halideautogpu.
Thanks for the quick reply!
I added -debug
to HL_TARGET
and get the following:
g++ -Dcuda_alloc -std=c++11 -I ../../distrib/include/ -I ../../distrib/tools/ -I ../support/ -Wall -Werror -Wno-unused-function -Wcast-qual -Wignored-qualifiers -Wno-comment -Wsign-compare -Wno-unknown-warning-option -Wno-psabi -frtti -I./bin -Wall -O3 process.cpp bin/deepcamera.a bin/deepcamera_auto_schedule.a bin/deepcamera_simple_auto_schedule.a bin/deepcamera_auto_schedule_store.a bin/deepcamera_auto_schedule_no_fus.a -o bin/process -ldl -lpthread -lz -ltinfo -lpng16 -ljpeg -I/usr/include/libpng16 -I/usr/include/libpng16/..
./bin/process ../images/gray.png 8 1 1 10 ./bin/out.png
Entering Pipeline deepcamera_auto_schedule
Input Buffer input: buffer(0, 0x0, 0x7f6d0931c080, 1, uint16, {0, 1536, 1}, {0, 2560, 1536})
Output Buffer output: buffer(0, 0x0, 0x7f6d08b1b080, 0, uint16, {0, 2048, 1}, {0, 2048, 2048})
CUDA: halide_cuda_initialize_kernels (user_context: 0x0, state_ptr: 0x562fc7acdb48, ptx_src: 0x562fc7897c20, size: 42052
load_libcuda (user_context: 0x0)
Loaded CUDA runtime library: libcuda.so
Got device 0
Tesla V100-SXM2-16GB
total memory: 4095 MB
max threads per block: 1024
warp size: 32
max block size: 1024 1024 64
max grid size: 2147483647 65535 65535
max shared memory per block: 49152
max constant memory per block: 65536
compute capability 7.0
cuda cores: 80 x 0 = 0
cuCtxCreate 0 -> 0x562fc7ca01b0(3020)
cuModuleLoadData 0x562fc7897c20, 42052 -> 0x562fc842efb0
Time: 2.834470e-01 ms
halide_copy_to_device validating input buffer: buffer(0, 0x0, 0x7f6d0931c080, 1, uint16, {0, 1536, 1}, {0, 2560, 1536})
halide_device_malloc validating input buffer: buffer(0, 0x0, 0x7f6d0931c080, 1, uint16, {0, 1536, 1}, {0, 2560, 1536})
halide_device_malloc: target device interface 0x562fc7ac6108
CUDA: halide_cuda_device_malloc (user_context: 0x0, buf: 0x7ffe912f9890)
allocating buffer(0, 0x0, 0x7f6d0931c080, 1, uint16, {0, 1536, 1}, {0, 2560, 1536})
cuMemAlloc 7864320 -> 0x7f6ce7800000
Time: 1.600770e-01 ms
halide_copy_to_device 0x7ffe912f9890 host is dirty
c.extent[0] = 1536
c.extent[1] = 2560
CUDA: halide_cuda_buffer_copy (user_context: 0x0, src: 0x7ffe912f9890, dst: 0x7ffe912f9890)
from host to device, 0x7f6d0931c080 -> 0x7f6ce7800000, 7864320 bytes
cuMemcpyHtoD(0x7f6ce7800000, 0x7f6d0931c080, 7864320)
Time: 9.949730e-01 ms
halide_copy_to_device validating input buffer: buffer(0, 0x0, 0x7f6d08b1b080, 0, uint16, {0, 2048, 1}, {0, 2048, 2048})
halide_device_malloc validating input buffer: buffer(0, 0x0, 0x7f6d08b1b080, 0, uint16, {0, 2048, 1}, {0, 2048, 2048})
halide_device_malloc: target device interface 0x562fc7ac6108
CUDA: halide_cuda_device_malloc (user_context: 0x0, buf: 0x7ffe912f9920)
allocating buffer(0, 0x0, 0x7f6d08b1b080, 0, uint16, {0, 2048, 1}, {0, 2048, 2048})
cuMemAlloc 8388608 -> 0x7f6cd2000000
Time: 1.766050e-01 ms
CUDA: halide_cuda_run (user_context: 0x0, entry: kernel_output_s0_y_y_o___block_id_y, blocks: 34x171x1, threads: 2048x2048x1, shmem: 19006078
Got context.
Got module 0x562fc842efb0
Got function 0x562fc843cee0
halide_cuda_run 0 4 [0x0 ...] 0
halide_cuda_run 1 4 [0x60000000000 ...] 0
halide_cuda_run 2 4 [0x80000000600 ...] 0
halide_cuda_run 3 4 [0x9ff00000800 ...] 0
halide_cuda_run 4 4 [0x9ff ...] 0
halide_cuda_run 5 4 [0x60000000000 ...] 0
halide_cuda_run 6 4 [0xa0000000600 ...] 0
halide_cuda_run 7 8 [0x7f6ce7800000 ...] 1
halide_cuda_run 8 8 [0x7f6cd2000000 ...] 1
halide_cuda_run translated arg7 [0x7f6ce7800000 ...]
halide_cuda_run translated arg8 [0x7f6cd2000000 ...]
Error: CUDA: cuLaunchKernel failed: CUDA_ERROR_INVALID_VALUE
Makefile:49: recipe for target 'bin/out.png' failed
make: *** [bin/out.png] Aborted (core dumped)
The application I am compiling is here: https://github.com/dillonhuff/HalideAutoGPU/tree/dhuff_experiments/TACO_Benchmarks/deepcamera
from halideautogpu.
So for some reason the threadblock dimensions appear to be equal to the entire image.
I've run into this issue a couple of times in the past but unfortunately there's no easy fix. The bounds derived by the compiler during the scheduling step are different than the final ones and schedules that appear valid cause the kernel to crash at runtime.
In any case I tried to debug this issue on the app you linked and there are 2 quick fixes:
- Replace the downsample method with something similar to the one in the lensblur app:
https://github.com/halide/Halide/blob/0bb61721124bbfed0067a5902c141dbf65367599/apps/lens_blur/lens_blur_generator.cpp#L60-L63
Or in the above application replacing gaussian_blur with something like this:
vector<Func> gauss_pyramid(Func l0) {
vector<Func> gPyramid;
vector<Func> gPyramid_clamped;
gPyramid.resize(pyramid_levels);
gPyramid_clamped.resize(pyramid_levels);
gPyramid[0](x, y) =
l0(x, y);
gPyramid_clamped[0](x, y) =
l0(x, y);
Expr w = input.dim(0).extent(), h = input.dim(1).extent();
for (int j = 1; j < pyramid_levels; j++) {
gPyramid[j](x, y) =
downsample(gPyramid[j - 1])(x, y);
w /= 2;
h /= 2;
gPyramid_clamped[j] = BoundaryConditions::repeat_edge(gPyramid[j], {{0, w}, {0, h}});
}
return gPyramid_clamped;
}
- For the apps that crash, export HL_GPU_L2_COST=10, to tell the scheduler to fuse less stages and avoid this bounds explosion altogether...
In both cases the code seems to run fine, although I cannot really test its functionality.
from halideautogpu.
@savsiout I tried that and it worked. Thanks!
from halideautogpu.
Related Issues (7)
- Instructions about using this GPU auto-scheduler HOT 4
- Backport to Halide main repo?
- Do the benchmarks measure the time to transfer the images from the host (CPU) to the GPU and back HOT 2
- Invalid Schedule error when trying a new application HOT 2
- Error: CUDA: cuModuleLoadData failed: CUDA_ERROR_INVALID_PTX on the K80 HOT 3
- Error when auto-scheduling when using Float(16) HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from halideautogpu.