tue-ee-es / halideautogpu Goto Github PK

License: Other

CMake 1.13% Makefile 2.01% C++ 88.91% C 1.80% Python 2.98% Shell 0.56% Java 1.03% MATLAB 0.05% Objective-C 0.01% Objective-C++ 0.16% LLVM 1.35% Smarty 0.01%

halideautogpu's Issues

Do the benchmarks measure the time to transfer the images from the host (CPU) to the GPU and back

Hello again! I have a question about what is included in the times reported by the ./run_tests.sh script. Do the times reported for each version of the application include the time required to transfer the input buffer from host (CPU) memory to device (GPU) memory and then to transfer the resulting output buffer from the device back to the host?

I assume that the times are computed by the benchmark code included with each app (for example):

HalideAutoGPU/TACO_Benchmarks/local_laplacian/process.cpp

Lines 33 to 46 in 3856605

 Buffer<uint16_t> output(input.width(), input.height(), 3); 

 local_laplacian(input, levels, alpha/(levels-1), beta, output); 

 output.device_sync(); 

 multi_way_bench({ 

 {"Manual", [&]() { local_laplacian(input, levels, alpha/(levels-1), beta, output); output.device_sync(); }}, 

 #ifndef NO_AUTO_SCHEDULE 

 {"Nested auto-scheduled", [&]() { local_laplacian_auto_schedule_store(input, levels, alpha/(levels-1), beta, output); output.device_sync(); }}, 

 {"Auto-scheduled", [&]() { local_laplacian_auto_schedule(input, levels, alpha/(levels-1), beta, output); output.device_sync(); }}, 

 {"No-fusion auto-scheduled", [&]() { local_laplacian_auto_schedule_no_fus(input, levels, alpha/(levels-1), beta, output); output.device_sync(); }}, 

 {"Simple auto-scheduled", [&]() { local_laplacian_simple_auto_schedule(input, levels, alpha/(levels-1), beta, output); output.device_sync(); }} 

 #endif 

 } 

 );

But I'm not very familiar with the Halide CUDA runtime, so I could not tell what is actually included in the function calls that are timed in the benchmark.

Error: CUDA: cuLaunchKernel failed: CUDA_ERROR_INVALID_VALUE

@savsiout I've been trying to run some larger pipelines and have seen the following error several times, where the auto-scheduler runs to completion, but then the CUDA runtime seems to crash:

g++ -Dcuda_alloc -std=c++11 -I ../../distrib/include/ -I ../../distrib/tools/ -I ../support/  -Wall -Werror -Wno-unused-function -Wcast-qual -Wignored-qualifiers -Wno-comment -Wsign-compare -Wno-unknown-warning-option -Wno-psabi  -frtti -I./bin -Wall -O3 process.cpp bin/deepcamera.a bin/deepcamera_auto_schedule.a bin/deepcamera_simple_auto_schedule.a bin/deepcamera_auto_schedule_store.a bin/deepcamera_auto_schedule_no_fus.a -o bin/process  -ldl -lpthread -lz -ltinfo -lpng16  -ljpeg -I/usr/include/libpng16 -I/usr/include/libpng16/..   
./bin/process ../images/gray.png 8 1 1 10 ./bin/out.png
Error: CUDA: cuLaunchKernel failed: CUDA_ERROR_INVALID_VALUE
Makefile:48: recipe for target 'bin/out.png' failed
make: *** [bin/out.png] Aborted (core dumped)

Have you seen this error before and do you have any suggestions about how I could fix it? I can provide more details about the applications that crash if that would be helpful.

Instructions about using this GPU auto-scheduler

Hey guys, I'm a PhD student at Stanford working with Pat Hanrahan and Mark Horowitz on custom image processing / ML hardware. We're looking for a good GPU auto-scheduling baseline and Andrew Adams pointed us to your paper: http://www.es.ele.tue.nl/~tbasten/papers/TACO_GPU_camera_ready.pdf (which I think this repo is for).

I'd like to run your GPU auto-scheduler, but I don't see the setup instructions for it. Do I need to do anything special to get it running / configure parameters for a given GPU?

Error when auto-scheduling when using Float(16)

@savsiout hello again!

I'm trying to auto-schedule a gaussian pyramid with 16 bit floats as the basic number type instead of 32 bit floats. My current code is here:
https://github.com/dillonhuff/HalideAutoGPU/blob/195ee850bae24ebffdfef3a9828340630f3045db/TACO_Benchmarks/gausspyramid_fp16/gausspyramid_generator.cpp#L23-L32

When I run this code through the auto scheduler it runs and seems to generate code, but when I try to run the generated code on a V100 GPU on AWS I get the following error:

g++ -Dcuda_alloc -std=c++11 -I ../../distrib/include/ -I ../../distrib/tools/ -I ../support/  -Wall -Werror -Wno-unused-function -Wcast-qual -Wignored-qualifiers -Wno-comment -Wsign-compare -Wno-unknown-warning-option -Wno-psabi  -frtti -I./bin -Wall -O3 process.cpp bin/gausspyramid.a bin/gausspyramid_auto_schedule.a bin/gausspyramid_simple_auto_schedule.a bin/gausspyramid_auto_schedule_store.a bin/gausspyramid_auto_schedule_no_fus.a -o bin/process  -ldl -lpthread -lz -ltinfo -lpng16  -ljpeg -I/usr/include/libpng16 -I/usr/include/libpng16/..   
./bin/process ../images/gray.png 8 1 1 10 ./bin/out.png
Error: CUDA: cuModuleLoadData failed: CUDA_ERROR_INVALID_PTX
Makefile:51: recipe for target 'bin/out.png' failed
make: *** [bin/out.png] Aborted (core dumped)

When I use Float(32) I do not have this problem. Do you have any idea how I could fix this issue?

Thanks in advance!

Invalid Schedule error when trying a new application

Hey guys, I'm trying to add a new application (a simple gaussian pyramid) to the repo and auto-schedule it. The application is on my fork here: https://github.com/dillonhuff/HalideAutoGPU/tree/dhuff_experiments/TACO_Benchmarks/gausspyramid

When I run make test I get the following error:

ubuntu@ip-172-31-72-207:~/HalideAutoGPU/TACO_Benchmarks/gausspyramid$ make test
g++ -Dcuda_alloc -std=c++11 -I ../../distrib/include/ -I ../../distrib/tools/ -I ../support/  -Wall -Werror -Wno-unused-function -Wcast-qual -Wignored-qualifiers -Wno-comment -Wsign-compare -Wno-unknown-warning-option -Wno-psabi  -frtti -g gausspyramid_generator.cpp ../autoscheduler/SimpleAutoSchedule.cpp ../autoscheduler/AutoSchedule.cpp ../autoscheduler/DerivativeUtils.cpp ../../distrib/lib/libHalide.a ../../distrib/tools/GenGen.cpp -o bin/gausspyramid.generator  -ldl -lpthread -lz -ltinfo -lz -lrt -ldl -ltinfo -lpthread -lm -rdynamic
bin/gausspyramid.generator -g gausspyramid -o ./bin -f gausspyramid target=host-cuda-cuda_capability_61-no_runtime auto_schedule=false
bin/gausspyramid.generator -g gausspyramid -o ./bin -f gausspyramid_auto_schedule target=host-cuda-cuda_capability_61 auto_schedule=true -p ../autoscheduler/bin/libauto_schedule.so  -e static_library,h,schedule

================
Pipeline graph:
================
ds: {ds.update(0)}
ds.update(0): {f0}
ds$1: {ds$1.update(0)}
ds$1.update(0): {f1}
ds$2: {ds$2.update(0)}
ds$2.update(0): {f2}
ds$3: {ds$3.update(0)}
ds$3.update(0): {f3}
ds$4: {ds$4.update(0)}
ds$4.update(0): {f4}
ds$5: {ds$5.update(0)}
ds$5.update(0): {f5}
ds$6: {ds$6.update(0)}
ds$6.update(0): {output}
f0: {ds$1.update(0)}
f1: {ds$2.update(0)}
f2: {ds$3.update(0)}
f3: {ds$4.update(0)}
f4: {ds$5.update(0)}
f5: {ds$6.update(0)}
repeat_edge: {ds.update(0)}
================

================
Pipeline bounds:
================
ds -> {[-6, 25], [-6, 25]}
ds$1 -> {[-5, 26], [-5, 26]}
ds$2 -> {[-4, 27], [-4, 27]}
ds$3 -> {[-3, 28], [-3, 28]}
ds$4 -> {[-2, 29], [-2, 29]}
ds$5 -> {[-1, 30], [-1, 30]}
ds$6 -> {[0, 31], [0, 31]}
f0 -> {[-6, 25], [-6, 25]}
f1 -> {[-5, 26], [-5, 26]}
f2 -> {[-4, 27], [-4, 27]}
f3 -> {[-3, 28], [-3, 28]}
f4 -> {[-2, 29], [-2, 29]}
f5 -> {[-1, 30], [-1, 30]}
input -> {[0, 24], [0, 24]}
output -> {[0, 31], [0, 31]}
repeat_edge -> {[-7, 24], [-7, 24]}
===============
g name output
GROUP OF output
SH MEM 896.000000
 ACT THR 1024.000000f
 OCC 0.500000f
inlined f0
inlined f1
inlined f2
inlined f3
inlined f4
inlined f5
inlined repeat_edge
// Target: x86-64-linux-avx-avx2-cuda-cuda_capability_61-f16c-fma-sse41
// MachineParams: 32,16777216,4

// Delete this line if not using Generator
Pipeline pipeline = get_pipeline();

Var x_i("x_i");
Var x_o("x_o");
Var y_i("y_i");
Var y_o("y_o");

Func ds = pipeline.get_func(6);
Func ds_1 = pipeline.get_func(9);
Func ds_2 = pipeline.get_func(12);
Func ds_3 = pipeline.get_func(15);
Func ds_4 = pipeline.get_func(18);
Func ds_5 = pipeline.get_func(21);
Func ds_6 = pipeline.get_func(24);
Func output = pipeline.get_func(28);

{
    Var x = ds.args()[0];
    Var y = ds.args()[1];
    RVar reduce$x(ds.update(0).get_schedule().rvars()[0].var);
    RVar reduce$y(ds.update(0).get_schedule().rvars()[1].var);
    ds
        .reorder(x, y)
        .compute_at(output, x_o)
        .gpu_threads(x)
        .gpu_threads(y);
    ds.update(0)
        .reorder(reduce$x, reduce$y, x, y)
        .gpu_threads(x)
        .gpu_threads(y)
        .unroll(reduce$x)
        .unroll(reduce$y);
}
{
    Var x = ds_1.args()[0];
    Var y = ds_1.args()[1];
    RVar reduce$1$x(ds_1.update(0).get_schedule().rvars()[0].var);
    RVar reduce$1$y(ds_1.update(0).get_schedule().rvars()[1].var);
    ds_1
        .reorder(x, y)
        .compute_at(output, x_o)
        .gpu_threads(x)
        .gpu_threads(y);
    ds_1.update(0)
        .reorder(reduce$1$x, reduce$1$y, x, y)
        .gpu_threads(x)
        .gpu_threads(y)
        .unroll(reduce$1$x)
        .unroll(reduce$1$y);
}
{
    Var x = ds_2.args()[0];
    Var y = ds_2.args()[1];
    RVar reduce$2$x(ds_2.update(0).get_schedule().rvars()[0].var);
    RVar reduce$2$y(ds_2.update(0).get_schedule().rvars()[1].var);
    ds_2
        .reorder(x, y)
        .compute_at(output, x_o)
        .gpu_threads(x)
        .gpu_threads(y);
    ds_2.update(0)
        .reorder(reduce$2$x, reduce$2$y, x, y)
        .gpu_threads(x)
        .gpu_threads(y)
        .unroll(reduce$2$x)
        .unroll(reduce$2$y);
}
{
    Var x = ds_3.args()[0];
    Var y = ds_3.args()[1];
    RVar reduce$3$x(ds_3.update(0).get_schedule().rvars()[0].var);
    RVar reduce$3$y(ds_3.update(0).get_schedule().rvars()[1].var);
    ds_3
        .reorder(x, y)
        .compute_at(output, x_o)
        .gpu_threads(x)
        .gpu_threads(y);
    ds_3.update(0)
        .reorder(reduce$3$x, reduce$3$y, x, y)
        .gpu_threads(x)
        .gpu_threads(y)
        .unroll(reduce$3$x)
        .unroll(reduce$3$y);
}
{
    Var x = ds_4.args()[0];
    Var y = ds_4.args()[1];
    RVar reduce$4$x(ds_4.update(0).get_schedule().rvars()[0].var);
    RVar reduce$4$y(ds_4.update(0).get_schedule().rvars()[1].var);
    ds_4
        .reorder(x, y)
        .compute_at(output, x_o)
        .gpu_threads(x)
        .gpu_threads(y);
    ds_4.update(0)
        .reorder(reduce$4$x, reduce$4$y, x, y)
        .gpu_threads(x)
        .gpu_threads(y)
        .unroll(reduce$4$x)
        .unroll(reduce$4$y);
}
{
    Var x = ds_5.args()[0];
    Var y = ds_5.args()[1];
    RVar reduce$5$x(ds_5.update(0).get_schedule().rvars()[0].var);
    RVar reduce$5$y(ds_5.update(0).get_schedule().rvars()[1].var);
    ds_5
        .reorder(x, y)
        .compute_at(output, x_o)
        .gpu_threads(x)
        .gpu_threads(y);
    ds_5.update(0)
        .reorder(reduce$5$x, reduce$5$y, x, y)
        .gpu_threads(x)
        .gpu_threads(y)
        .unroll(reduce$5$x)
        .unroll(reduce$5$y);
}
{
    Var x = ds_6.args()[0];
    Var y = ds_6.args()[1];
    RVar reduce$6$x(ds_6.update(0).get_schedule().rvars()[0].var);
    RVar reduce$6$y(ds_6.update(0).get_schedule().rvars()[1].var);
    ds_6
        .reorder(x, y)
        .compute_at(output, x_o)
        .gpu_threads(x)
        .gpu_threads(y);
    ds_6.update(0)
        .reorder(reduce$6$x, reduce$6$y, x, y)
        .gpu_threads(x)
        .gpu_threads(y)
        .unroll(reduce$6$x)
        .unroll(reduce$6$y);
}
{
    Var x = output.args()[0];
    Var y = output.args()[1];
    output
        .compute_root()
        .split(x, x_o, x_i, 8)
        .split(y, y_o, y_i, 4)
        .reorder(x_i, y_i, x_o, y_o)
        .gpu_threads(x_i)
        .gpu_threads(y_i)
        .gpu_blocks(y_o)
        .gpu_blocks(x_o);
}


TOTAL INLINES 7
HL_USE_SIMPLE_AUTOSCHEDULER=1 \
bin/gausspyramid.generator -g gausspyramid -o ./bin -f gausspyramid_simple_auto_schedule target=host-cuda-cuda_capability_61-no_runtime auto_schedule=false -e static_library,h
HL_AUTO_FOLDED_FUSION=1 \
bin/gausspyramid.generator  -g gausspyramid -o ./bin -f gausspyramid_auto_schedule_store target=host-cuda-cuda_capability_61 auto_schedule=true  -p ../autoscheduler/bin/libauto_schedule.so -e static_library,h,schedule 

================
Pipeline graph:
================
ds: {ds.update(0)}
ds.update(0): {f0}
ds$1: {ds$1.update(0)}
ds$1.update(0): {f1}
ds$2: {ds$2.update(0)}
ds$2.update(0): {f2}
ds$3: {ds$3.update(0)}
ds$3.update(0): {f3}
ds$4: {ds$4.update(0)}
ds$4.update(0): {f4}
ds$5: {ds$5.update(0)}
ds$5.update(0): {f5}
ds$6: {ds$6.update(0)}
ds$6.update(0): {output}
f0: {ds$1.update(0)}
f1: {ds$2.update(0)}
f2: {ds$3.update(0)}
f3: {ds$4.update(0)}
f4: {ds$5.update(0)}
f5: {ds$6.update(0)}
repeat_edge: {ds.update(0)}
================

================
Pipeline bounds:
================
ds -> {[-6, 25], [-6, 25]}
ds$1 -> {[-5, 26], [-5, 26]}
ds$2 -> {[-4, 27], [-4, 27]}
ds$3 -> {[-3, 28], [-3, 28]}
ds$4 -> {[-2, 29], [-2, 29]}
ds$5 -> {[-1, 30], [-1, 30]}
ds$6 -> {[0, 31], [0, 31]}
f0 -> {[-6, 25], [-6, 25]}
f1 -> {[-5, 26], [-5, 26]}
f2 -> {[-4, 27], [-4, 27]}
f3 -> {[-3, 28], [-3, 28]}
f4 -> {[-2, 29], [-2, 29]}
f5 -> {[-1, 30], [-1, 30]}
input -> {[0, 24], [0, 24]}
output -> {[0, 31], [0, 31]}
repeat_edge -> {[-7, 24], [-7, 24]}
===============
g name output
GROUP OF output
SH MEM 896.000000
 ACT THR 1024.000000f
 OCC 0.500000f
inlined f0
inlined f1
inlined f2
inlined f3
inlined f4
inlined f5
inlined repeat_edge
Stage ds$5 compute at ds$6 , reduce$6$x
Stage ds$5 compute at ds$6 , reduce$6$y
Stage ds$5 compute at ds$6 , reduce$6$y
Stage ds$5 compute at ds$6 , reduce$6$x
Stage ds$5 compute at ds$6 , reduce$6$y
Stage ds$5 compute at ds$6 , reduce$6$y
Stage ds$4 compute at ds$5 , reduce$5$x
Stage ds$4 compute at ds$5 , reduce$5$y
Stage ds$4 compute at ds$5 , reduce$5$y
Stage ds$4 compute at ds$5 , reduce$5$x
Stage ds$4 compute at ds$5 , reduce$5$y
Stage ds$4 compute at ds$5 , reduce$5$y
Stage ds$3 compute at ds$4 , reduce$4$x
Stage ds$3 compute at ds$4 , reduce$4$y
Stage ds$3 compute at ds$4 , reduce$4$y
Stage ds$3 compute at ds$4 , reduce$4$x
Stage ds$3 compute at ds$4 , reduce$4$y
Stage ds$3 compute at ds$4 , reduce$4$y
Stage ds$2 compute at ds$3 , reduce$3$x
Stage ds$2 compute at ds$3 , reduce$3$y
Stage ds$2 compute at ds$3 , reduce$3$y
Stage ds$2 compute at ds$3 , reduce$3$x
Stage ds$2 compute at ds$3 , reduce$3$y
Stage ds$2 compute at ds$3 , reduce$3$y
Stage ds$1 compute at ds$2 , reduce$2$x
Stage ds$1 compute at ds$2 , reduce$2$y
Stage ds$1 compute at ds$2 , reduce$2$y
Stage ds$1 compute at ds$2 , reduce$2$x
Stage ds$1 compute at ds$2 , reduce$2$y
Stage ds$1 compute at ds$2 , reduce$2$y
Stage ds compute at ds$1 , reduce$1$x
Stage ds compute at ds$1 , reduce$1$y
Stage ds compute at ds$1 , reduce$1$y
Stage ds compute at ds$1 , reduce$1$x
Stage ds compute at ds$1 , reduce$1$y
Stage ds compute at ds$1 , reduce$1$y
cons ds$6 prod ds$6
cons output prod ds$6
cons ds$5 prod ds$5
cons f5 prod ds$5
cons ds$4 prod ds$4
cons f4 prod ds$4
cons ds$3 prod ds$3
cons f3 prod ds$3
cons ds$2 prod ds$2
cons f2 prod ds$2
cons ds$1 prod ds$1
cons f1 prod ds$1
cons ds prod ds
cons f0 prod ds
// Target: x86-64-linux-avx-avx2-cuda-cuda_capability_61-f16c-fma-sse41
// MachineParams: 32,16777216,4

// Delete this line if not using Generator
Pipeline pipeline = get_pipeline();

Var x_i("x_i");
Var x_o("x_o");
Var y_i("y_i");
Var y_o("y_o");

Func ds = pipeline.get_func(6);
Func ds_1 = pipeline.get_func(9);
Func ds_2 = pipeline.get_func(12);
Func ds_3 = pipeline.get_func(15);
Func ds_4 = pipeline.get_func(18);
Func ds_5 = pipeline.get_func(21);
Func ds_6 = pipeline.get_func(24);
Func output = pipeline.get_func(28);

{
    Var x = ds.args()[0];
    Var y = ds.args()[1];
    RVar reduce$x(ds.update(0).get_schedule().rvars()[0].var);
    RVar reduce$y(ds.update(0).get_schedule().rvars()[1].var);
    ds
        .reorder(x, y)
        .compute_at(ds_1, x);
    ds.update(0)
        .reorder(reduce$x, reduce$y, x, y)
        .gpu_threads(x)
        .gpu_threads(y)
        .unroll(reduce$x)
        .unroll(reduce$y);
}
{
    Var x = ds_1.args()[0];
    Var y = ds_1.args()[1];
    RVar reduce$1$x(ds_1.update(0).get_schedule().rvars()[0].var);
    RVar reduce$1$y(ds_1.update(0).get_schedule().rvars()[1].var);
    ds_1
        .reorder(x, y)
        .compute_at(ds_2, x);
    ds_1.update(0)
        .reorder(reduce$1$x, reduce$1$y, x, y)
        .gpu_threads(x)
        .gpu_threads(y)
        .unroll(reduce$1$x)
        .unroll(reduce$1$y);
}
{
    Var x = ds_2.args()[0];
    Var y = ds_2.args()[1];
    RVar reduce$2$x(ds_2.update(0).get_schedule().rvars()[0].var);
    RVar reduce$2$y(ds_2.update(0).get_schedule().rvars()[1].var);
    ds_2
        .reorder(x, y)
        .compute_at(ds_3, x);
    ds_2.update(0)
        .reorder(reduce$2$x, reduce$2$y, x, y)
        .gpu_threads(x)
        .gpu_threads(y)
        .unroll(reduce$2$x)
        .unroll(reduce$2$y);
}
{
    Var x = ds_3.args()[0];
    Var y = ds_3.args()[1];
    RVar reduce$3$x(ds_3.update(0).get_schedule().rvars()[0].var);
    RVar reduce$3$y(ds_3.update(0).get_schedule().rvars()[1].var);
    ds_3
        .reorder(x, y)
        .compute_at(ds_4, x);
    ds_3.update(0)
        .reorder(reduce$3$x, reduce$3$y, x, y)
        .gpu_threads(x)
        .gpu_threads(y)
        .unroll(reduce$3$x)
        .unroll(reduce$3$y);
}
{
    Var x = ds_4.args()[0];
    Var y = ds_4.args()[1];
    RVar reduce$4$x(ds_4.update(0).get_schedule().rvars()[0].var);
    RVar reduce$4$y(ds_4.update(0).get_schedule().rvars()[1].var);
    ds_4
        .reorder(x, y)
        .compute_at(ds_5, x);
    ds_4.update(0)
        .reorder(reduce$4$x, reduce$4$y, x, y)
        .gpu_threads(x)
        .gpu_threads(y)
        .unroll(reduce$4$x)
        .unroll(reduce$4$y);
}
{
    Var x = ds_5.args()[0];
    Var y = ds_5.args()[1];
    RVar reduce$5$x(ds_5.update(0).get_schedule().rvars()[0].var);
    RVar reduce$5$y(ds_5.update(0).get_schedule().rvars()[1].var);
    ds_5
        .reorder(x, y)
        .compute_at(ds_6, x);
    ds_5.update(0)
        .reorder(reduce$5$x, reduce$5$y, x, y)
        .gpu_threads(x)
        .gpu_threads(y)
        .unroll(reduce$5$x)
        .unroll(reduce$5$y);
}
{
    Var x = ds_6.args()[0];
    Var y = ds_6.args()[1];
    RVar reduce$6$x(ds_6.update(0).get_schedule().rvars()[0].var);
    RVar reduce$6$y(ds_6.update(0).get_schedule().rvars()[1].var);
    ds_6
        .reorder(x, y)
        .compute_at(output, x_o)
        .gpu_threads(x)
        .gpu_threads(y);
    ds_6.update(0)
        .reorder(reduce$6$x, reduce$6$y, x, y)
        .gpu_threads(x)
        .gpu_threads(y)
        .unroll(reduce$6$x)
        .unroll(reduce$6$y);
}
{
    Var x = output.args()[0];
    Var y = output.args()[1];
    output
        .compute_root()
        .split(x, x_o, x_i, 8)
        .split(y, y_o, y_i, 4)
        .reorder(x_i, y_i, x_o, y_o)
        .gpu_threads(x_i)
        .gpu_threads(y_i)
        .gpu_blocks(y_o)
        .gpu_blocks(x_o);
}


TOTAL INLINES 7
Error at ../../distrib/tools/GenGen.cpp:4:
Invalid schedule: Loop over ds$5.s1.y.__thread_id_y cannot be inside of loop over ds$6.s1.x.__thread_id_x
Aborted (core dumped)
Makefile:29: recipe for target 'bin/gausspyramid_auto_schedule_store.a' failed
make: *** [bin/gausspyramid_auto_schedule_store.a] Error 134

Any idea what I'm doing wrong here?

In case it helps I've noticed that replacing this code:
https://github.com/dillonhuff/HalideAutoGPU/blob/c9575bc42575303af62111d691db7e56c6b74a77/TACO_Benchmarks/gausspyramid/gausspyramid_generator.cpp#L31-L38

With this code:
https://github.com/dillonhuff/HalideAutoGPU/blob/c9575bc42575303af62111d691db7e56c6b74a77/TACO_Benchmarks/gausspyramid/gausspyramid_generator.cpp#L27-L29

Seems to remove the error.

Thanks!

Backport to Halide main repo?

Is there a plan/timeline to backport this work to Halide main repo to benefit the community? Thanks!

Error: CUDA: cuModuleLoadData failed: CUDA_ERROR_INVALID_PTX on the K80

Hey Savvas, I'm back with another question. I'm trying to run the AutoScheduler on a K80 using the same setup as I've been using for the V100. Now I am getting an error when I try to run the auto-scheduled code:

g++ -Dcuda_alloc -std=c++11 -I ../../distrib/include/ -I ../../distrib/tools/ -I ../support/  -Wall -Werror -Wno-unused-function -Wcast-qual -Wignored-qualifiers -Wno-comment -Wsign-compare -Wno-unknown-warning-option -Wno-psabi  -frtti -I./bin -Wall -O3 process.cpp bin/downsample.a bin/downsample_auto_schedule.a bin/downsample_simple_auto_schedule.a bin/downsample_auto_schedule_store.a bin/downsample_auto_schedule_no_fus.a -o bin/process  -ldl -lpthread -lz -ltinfo -lpng16  -ljpeg -I/usr/include/libpng16 -I/usr/include/libpng16/..   
./bin/process ../images/gray.png 8 1 1 10 ./bin/out.png
Error: CUDA: cuModuleLoadData failed: CUDA_ERROR_INVALID_PTX
Makefile:51: recipe for target 'bin/out.png' failed
make: *** [bin/out.png] Aborted (core dumped)

nvidia-smi -i 0 gives:

ubuntu@ip-172-31-48-185:~/HalideAutoGPU/TACO_Benchmarks/downsample$ nvidia-smi -i 0
Wed Sep  2 22:41:50 2020       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.51.05    Driver Version: 450.51.05    CUDA Version: 11.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla K80           On   | 00000000:00:1E.0 Off |                    0 |
| N/A   35C    P8    29W / 149W |      0MiB / 11441MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

Do you have any ideas about how I can fix this?

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.

	Buffer<uint16_t> output(input.width(), input.height(), 3);
	local_laplacian(input, levels, alpha/(levels-1), beta, output);
	output.device_sync();

	multi_way_bench({
	{"Manual", [&]() { local_laplacian(input, levels, alpha/(levels-1), beta, output); output.device_sync(); }},
	#ifndef NO_AUTO_SCHEDULE
	{"Nested auto-scheduled", [&]() { local_laplacian_auto_schedule_store(input, levels, alpha/(levels-1), beta, output); output.device_sync(); }},
	{"Auto-scheduled", [&]() { local_laplacian_auto_schedule(input, levels, alpha/(levels-1), beta, output); output.device_sync(); }},
	{"No-fusion auto-scheduled", [&]() { local_laplacian_auto_schedule_no_fus(input, levels, alpha/(levels-1), beta, output); output.device_sync(); }},
	{"Simple auto-scheduled", [&]() { local_laplacian_simple_auto_schedule(input, levels, alpha/(levels-1), beta, output); output.device_sync(); }}
	#endif
	}
	);