Giter VIP home page Giter VIP logo

halideautogpu's Introduction

Halide

Halide is a programming language designed to make it easier to write high-performance image processing code on modern machines. Halide currently targets:

  • CPU architectures: X86, ARM, MIPS, Hexagon, PowerPC
  • Operating systems: Linux, Windows, Mac OS X, Android, iOS, Qualcomm QuRT
  • GPU Compute APIs: CUDA, OpenCL, OpenGL, OpenGL Compute Shaders, Apple Metal, Microsoft Direct X 12

Rather than being a standalone programming language, Halide is embedded in C++. This means you write C++ code that builds an in-memory representation of a Halide pipeline using Halide's C++ API. You can then compile this representation to an object file, or JIT-compile it and run it in the same process.

For more detail about what Halide is, see http://halide-lang.org.

For API documentation see http://halide-lang.org/docs

To see some example code, look in the tutorials directory.

If you've acquired a full source distribution and want to build Halide, see the notes below.

Build Status

Linux
linux build status

Building Halide

TL;DR

Have llvm-6.0 or greater installed and run make in the root directory of the repository (where this README is).

Acquiring LLVM

Building Halide requires at least LLVM 6.0, along with the matching version of Clang. llvm-config and clang must be somewhere in the path. If your OS does not have packages for llvm-6.0, you can find binaries for it at http://llvm.org/releases/download.html. Download an appropriate package and then either install it, or at least put the bin subdirectory in your path. (This works well on OS X and Ubuntu.)

If you want to build it yourself, first check it out from subversion:

% svn co https://llvm.org/svn/llvm-project/llvm/branches/release_60 llvm6.0
% svn co https://llvm.org/svn/llvm-project/cfe/branches/release_60 llvm6.0/tools/clang

Then build it like so:

% cd llvm6.0
% mkdir build
% cd build
% cmake -DLLVM_ENABLE_TERMINFO=OFF -DLLVM_TARGETS_TO_BUILD="X86;ARM;NVPTX;AArch64;Mips;PowerPC" -DLLVM_ENABLE_ASSERTIONS=ON -DCMAKE_BUILD_TYPE=Release ..
% make -j8

then to point Halide to it:

export LLVM_CONFIG=<path to llvm>/build/bin/llvm-config
export CLANG=<path to llvm>/build/bin/clang

Building Halide with make

With LLVM_CONFIG and CLANG set (or llvm-config and clang in your path), you should be able to just run make in the root directory of the Halide source tree. make run_tests will run the JIT test suite, and make test_apps will make sure all the apps compile and run (but won't check their output).

There is no make install yet. If you want to make an install package, run make distrib.

Building Halide out-of-tree with make

If you wish to build Halide in a separate directory, you can do that like so:

% cd ..
% mkdir halide_build
% cd halide_build
% make -f ../Halide/Makefile

Building Halide with cmake

If you wish to use cmake to build Halide, the build procedure is:

% mkdir cmake_build
% cd cmake_build
% cmake -DLLVM_DIR=/path-to-llvm-build/lib/cmake/llvm -DCMAKE_BUILD_TYPE=Release /path/to/halide
% make -j8

LLVM_DIR should be the folder in the LLVM installation or build tree that contains LLVMConfig.cmake.

Building Halide and LLVM on Windows

Acquire MSVC 2015 Update 3 or newer. Earlier versions may work but are not part of our tests. MSBuild and cmake should also be in your path. The instructions below assume Halide is checked out under C:\Code\Halide, and LLVM and Clang are checked out under C:\Code\llvm.

% mkdir C:\Code\llvm-build
% cd C:\Code\llvm-build
% cmake -DCMAKE_INSTALL_PREFIX=../llvm-install -DLLVM_ENABLE_TERMINFO=OFF -DLLVM_TARGETS_TO_BUILD=X86;ARM;NVPTX;AArch64;Mips;Hexagon -DLLVM_ENABLE_ASSERTIONS=ON -DLLVM_BUILD_32_BITS=OFF -DCMAKE_BUILD_TYPE=Release ../llvm -G "Visual Studio 14 Win64"

For a 32-bit build use:

% cmake -DCMAKE_INSTALL_PREFIX=../llvm-install -DLLVM_ENABLE_TERMINFO=OFF -DLLVM_TARGETS_TO_BUILD=X86;ARM;NVPTX;AArch64;Mips;Hexagon -DLLVM_ENABLE_ASSERTIONS=ON -DLLVM_BUILD_32_BITS=ON -DCMAKE_BUILD_TYPE=Release ../llvm -G "Visual Studio 14"

Then build it like so:

% MSBuild.exe /m /t:Build /p:Configuration=Release .\INSTALL.vcxproj

You can substitute Debug for Release in both commands if you want a debug build.

To configure and build Halide:

% mkdir C:\Code\halide-build
% cd C:\Code\halide-build
% cmake -DLLVM_DIR=../llvm-install/lib/cmake/llvm -DCMAKE_BUILD_TYPE=Release -G "Visual Studio 14 Win64" ../halide
% MSBuild.exe /m /t:Build /p:Configuration=Release .\ALL_BUILD.vcxproj

Building Halide and LLVM on Windows using mingw

The makefile method above should work from inside a "mingw64" shell (not the default shell) in an msys2 installation.

If all else fails...

Do what the build-bots do: https://buildbot.halide-lang.org/master/#/builders

If the column that best matches your system is red, then maybe things aren't just broken for you. If it's green, then you can click the "stdio" links in the latest build to see what commands the build bots run, and what the output was.

Some useful environment variables

HL_TARGET=... will set Halide's AOT compilation target.

HL_JIT_TARGET=... will set Halide's JIT compilation target.

HL_DEBUG_CODEGEN=1 will print out pseudocode for what Halide is compiling. Higher numbers will print more detail.

HL_NUM_THREADS=... specifies the size of the thread pool. This has no effect on OS X or iOS, where we just use grand central dispatch.

HL_TRACE_FILE=... specifies a binary target file to dump tracing data into (ignored unless at least one trace_ feature is enabled in HL_TARGET or HL_JIT_TARGET). The output can be parsed programmatically by starting from the code in utils/HalideTraceViz.cpp.

Using Halide on OSX

Precompiled Halide distributions are built using XCode's command-line tools with Apple clang 500.2.76. This means that we link against libc++ instead of libstdc++. You may need to adjust compiler options accordingly if you're using an older XCode which does not default to libc++.

For parallelism, Halide automatically uses Apple's Grand Central Dispatch, so it is not possible to control the number of threads used without overriding the parallel runtime entirely.

Halide OpenGL/GLSL backend

Halide's OpenGL backend offloads image processing operations to the GPU by generating GLSL-based fragment shaders.

Compared to other GPU-based processing options such as CUDA and OpenCL, OpenGL has two main advantages: it is available on basically every desktop computer and mobile device, and it is generally well supported across different hardware vendors.

The main disadvantage of OpenGL as an image processing framework is that the computational capabilities of fragment shaders are quite restricted. In general, the processing model provided by OpenGL is most suitable for filters where each output pixel can be expressed as a simple function of the input pixels. This covers a wide range of interesting operations like point-wise filters and convolutions; but a few common image processing operations such as histograms or recursive filters are notoriously hard to express in GLSL.

Writing OpenGL-Based Filters

To enable code generation for OpenGL, include opengl in the target specifier passed to Halide. Since OpenGL shaders are limited in their computational power, you must also specify a CPU target for those parts of the filter that cannot or should not be computed on the GPU. Examples of valid target specifiers are

host-opengl
x86-opengl-debug

Adding debug, as in the second example, adds additional logging output and is highly recommended during development.

By default, filters compiled for OpenGL targets run completely on the CPU. Execution on the GPU must be enabled for individual Funcs by appropriate scheduling calls.

GLSL fragment shaders implicitly iterate over two spatial dimensions x,y and the color channel. Due to the way color channels handled in GLSL, only filters for which the color index is a compile-time constant can be scheduled. The main consequence is that the range of color variables must be explicitly specified for both input and output buffers before scheduling:

ImageParam input;
Func f;
Var x, y, c;
f(x, y, c) = ...;

input.set_bounds(2, 0, 3);   // specify color range for input
f.bound(c, 0, 3);            // and output
f.glsl(x, y, c);

JIT Compilation

For JIT compilation Halide attempts to load the system libraries for opengl and creates a new context to use for each module. Windows is not yet supported.

Examples for JIT execution of OpenGL-based filters can be found in test/opengl.

AOT Compilation

When AOT (ahead-of-time) compilation is used, Halide generates OpenGL-enabled object files that can be linked to and called from a host application. In general, this is fairly straightforward, but a few things must be taken care of.

On Linux, OS X, and Android, Halide creates its own OpenGL context unless the current thread already has an active context. On other platforms you have to link implementations of the following two functions with your Halide code:

extern "C" int halide_opengl_create_context(void *) {
    return 0;  // if successful
}

extern "C" void *halide_opengl_get_proc_addr(void *, const char *name) {
    ...
}

Halide allocates and deletes textures as necessary. Applications may manage the textures by hand by setting the buffer_t::dev field; this is most useful for reusing image data that is already stored in textures. Some rudimentary checks are performed to ensure that externally allocated textures have the correct format, but in general that's the responsibility of the application.

It is possible to let render directly to the current framebuffer; to do this, set the dev field of the output buffer to the value returned by halide_opengl_output_client_bound. The example in apps/HelloAndroidGL demonstrates this technique.

Some operating systems can delete the OpenGL context of suspended applications. If this happens, Halide needs to re-initialize itself with the new context after the application resumes. Call halide_opengl_context_lost to reset Halide's OpenGL state after this has happened.

Limitations

The current implementation of the OpenGL backend targets the common subset of OpenGL 2.0 and OpenGL ES 2.0 which is widely available on both mobile devices and traditional computers. As a consequence, only a subset of the Halide language can be scheduled to run using OpenGL. Some important limitations are:

  • Reductions cannot be implemented in GLSL and must be run on the CPU.

  • OpenGL ES 2.0 only supports uint8 buffers.

    Support for floating point texture is available, but requires OpenGL (ES) 3.0 or the texture_float extension, which may not work on all mobile devices.

  • OpenGL ES 2.0 has very limited support for integer arithmetic. For maximum compatibility, consider doing all computations using floating point, even when using integer textures.

  • Only 2D images with 3 or 4 color channels can be scheduled. Images with one or two channels require OpenGL (ES) 3.0 or the texture_rg extension.

  • Not all builtin functions provided by Halide are currently supported, for example fast_log, fast_exp, fast_pow, reinterpret, bit operations, random_float, random_int cannot be used in GLSL code.

The maximum texture size in OpenGL is GL_MAX_TEXTURE_SIZE, which is often smaller than the image of interest; on mobile devices, for example, GL_MAX_TEXTURE_SIZE is commonly 2048. Tiling must be used to process larger images.

Planned features:

  • Support for half-float textures and arithmetic

  • Support for integer textures and arithmetic

(Note that OpenGL Compute Shaders are supported with a separate OpenGLCompute backend.)

Halide for Hexagon HVX

Halide supports offloading work to Qualcomm Hexagon DSP on Qualcomm Snapdragon 820 devices or newer. The Hexagon DSP provides a set of 64 and 128 byte vector instructions - the Hexagon Vector eXtensions (HVX). HVX is well suited to image processing, and Halide for Hexagon HVX will generate the appropriate HVX vector instructions from a program authored in Halide.

Halide can be used to compile Hexagon object files directly, by using a target such as hexagon-32-qurt-hvx_64 or hexagon-32-qurt-hvx_128.

Halide can also be used to offload parts of a pipeline to Hexagon using the hexagon scheduling directive. To enable the hexagon scheduling directive, include the hvx_64 or hvx_128 target features in your target. The currently supported combination of targets is to use the HVX target features with an x86 linux host (to use the simulator) or with an ARM android target (to use Hexagon DSP hardware). For examples of using the hexagon scheduling directive on both the simulator and a Hexagon DSP, see the blur example app.

To build and run an example app using the Hexagon target,

  1. Obtain and build LLVM and Clang v5.0 or later from llvm.org
  2. Download and install the Hexagon SDK and version 8.0 Hexagon Tools
  3. Build and run an example for Hexagon HVX

1. Obtain and build LLVM and clang v5.0 or later from llvm.org

The Hexagon backend is currently under development. So it's best to use trunk llvm. These are the same instructions as above for building Clang/LLVM, but for trunk Clang/LLVM instead of 5.0.

cd <path to llvm>
svn co http://llvm.org/svn/llvm-project/llvm/trunk .
svn co http://llvm.org/svn/llvm-project/cfe/trunk ./tools/clang
# Or:
#    git clone http://llvm.org/git/llvm.git .
#    git clone http://llvm.org/git/clang.git llvm/tools
mkdir build
cd build
cmake -DLLVM_ENABLE_TERMINFO=OFF -DLLVM_TARGETS_TO_BUILD="X86;ARM;NVPTX;AArch64;Mips;PowerPC;Hexagon" -DLLVM_ENABLE_ASSERTIONS=ON -DCMAKE_BUILD_TYPE=Release ..
make -j8
export LLVM_CONFIG=<path to llvm>/build/bin/llvm-config
export CLANG=<path to llvm>/build/bin/clang

2. Download and install the Hexagon SDK and version 8.0 Hexagon Tools

Go to https://developer.qualcomm.com/software/hexagon-dsp-sdk/tools

  1. Select the Hexagon Series 600 Software and download the 3.0 version for Linux.
  2. untar the installer
  3. Run the extracted installer to install the Hexagon SDK and Hexagon Tools, selecting Installation of Hexagon SDK into /location/of/SDK/Hexagon_SDK/3.0 and the Hexagon tools into /location/of/SDK/Hexagon_Tools/8.0
  4. Set an environment variable to point to the SDK installation location
export SDK_LOC=/location/of/SDK

3. Build and run an example for Hexagon HVX

In addition to running Hexagon code on device, Halide also supports running Hexagon code on the simulator from the Hexagon tools.

To build and run the blur example in Halide/apps/blur on the simulator:

cd apps/blur
export HL_HEXAGON_SIM_REMOTE=../../src/runtime/hexagon_remote/bin/v60/hexagon_sim_remote
export HL_HEXAGON_TOOLS=$SDK_LOC/Hexagon_Tools/8.0/Tools/
LD_LIBRARY_PATH=../../src/runtime/hexagon_remote/bin/host/:$HL_HEXAGON_TOOLS/lib/iss/:. HL_TARGET=host-hvx_128 make test

To build and run the blur example in Halide/apps/blur on Android:

To build the example for Android, first ensure that you have a standalone toolchain created from the NDK using the make-standalone-toolchain.sh script:

export ANDROID_NDK_HOME=$SDK_LOC/Hexagon_SDK/3.0/tools/android-ndk-r10d/
export ANDROID_ARM64_TOOLCHAIN=<path to put new arm64 toolchain>
$ANDROID_NDK_HOME/build/tools/make-standalone-toolchain.sh --arch=arm64 --platform=android-21 --install-dir=$ANDROID_ARM64_TOOLCHAIN

Now build and run the blur example using the script to run it on device:

export HL_HEXAGON_TOOLS=$SDK_LOC/HEXAGON_Tools/8.0/Tools/
HL_TARGET=arm-64-android-hvx_128 ./adb_run_on_device.sh

halideautogpu's People

Stargazers

LgxGeoGo avatar  avatar German Novikov avatar chunying avatar SoufianeKHIAT avatar Bizhao Shi avatar  avatar  avatar  avatar Jiacheng Pan avatar  avatar

Watchers

James Cloos avatar

Forkers

thedevleon mfkiwl

halideautogpu's Issues

Error when auto-scheduling when using Float(16)

@savsiout hello again!

I'm trying to auto-schedule a gaussian pyramid with 16 bit floats as the basic number type instead of 32 bit floats. My current code is here:
https://github.com/dillonhuff/HalideAutoGPU/blob/195ee850bae24ebffdfef3a9828340630f3045db/TACO_Benchmarks/gausspyramid_fp16/gausspyramid_generator.cpp#L23-L32

When I run this code through the auto scheduler it runs and seems to generate code, but when I try to run the generated code on a V100 GPU on AWS I get the following error:

g++ -Dcuda_alloc -std=c++11 -I ../../distrib/include/ -I ../../distrib/tools/ -I ../support/  -Wall -Werror -Wno-unused-function -Wcast-qual -Wignored-qualifiers -Wno-comment -Wsign-compare -Wno-unknown-warning-option -Wno-psabi  -frtti -I./bin -Wall -O3 process.cpp bin/gausspyramid.a bin/gausspyramid_auto_schedule.a bin/gausspyramid_simple_auto_schedule.a bin/gausspyramid_auto_schedule_store.a bin/gausspyramid_auto_schedule_no_fus.a -o bin/process  -ldl -lpthread -lz -ltinfo -lpng16  -ljpeg -I/usr/include/libpng16 -I/usr/include/libpng16/..   
./bin/process ../images/gray.png 8 1 1 10 ./bin/out.png
Error: CUDA: cuModuleLoadData failed: CUDA_ERROR_INVALID_PTX
Makefile:51: recipe for target 'bin/out.png' failed
make: *** [bin/out.png] Aborted (core dumped)

When I use Float(32) I do not have this problem. Do you have any idea how I could fix this issue?

Thanks in advance!

Error: CUDA: cuLaunchKernel failed: CUDA_ERROR_INVALID_VALUE

@savsiout I've been trying to run some larger pipelines and have seen the following error several times, where the auto-scheduler runs to completion, but then the CUDA runtime seems to crash:

g++ -Dcuda_alloc -std=c++11 -I ../../distrib/include/ -I ../../distrib/tools/ -I ../support/  -Wall -Werror -Wno-unused-function -Wcast-qual -Wignored-qualifiers -Wno-comment -Wsign-compare -Wno-unknown-warning-option -Wno-psabi  -frtti -I./bin -Wall -O3 process.cpp bin/deepcamera.a bin/deepcamera_auto_schedule.a bin/deepcamera_simple_auto_schedule.a bin/deepcamera_auto_schedule_store.a bin/deepcamera_auto_schedule_no_fus.a -o bin/process  -ldl -lpthread -lz -ltinfo -lpng16  -ljpeg -I/usr/include/libpng16 -I/usr/include/libpng16/..   
./bin/process ../images/gray.png 8 1 1 10 ./bin/out.png
Error: CUDA: cuLaunchKernel failed: CUDA_ERROR_INVALID_VALUE
Makefile:48: recipe for target 'bin/out.png' failed
make: *** [bin/out.png] Aborted (core dumped)

Have you seen this error before and do you have any suggestions about how I could fix it? I can provide more details about the applications that crash if that would be helpful.

Instructions about using this GPU auto-scheduler

Hey guys, I'm a PhD student at Stanford working with Pat Hanrahan and Mark Horowitz on custom image processing / ML hardware. We're looking for a good GPU auto-scheduling baseline and Andrew Adams pointed us to your paper: http://www.es.ele.tue.nl/~tbasten/papers/TACO_GPU_camera_ready.pdf (which I think this repo is for).

I'd like to run your GPU auto-scheduler, but I don't see the setup instructions for it. Do I need to do anything special to get it running / configure parameters for a given GPU?

Error: CUDA: cuModuleLoadData failed: CUDA_ERROR_INVALID_PTX on the K80

Hey Savvas, I'm back with another question. I'm trying to run the AutoScheduler on a K80 using the same setup as I've been using for the V100. Now I am getting an error when I try to run the auto-scheduled code:

g++ -Dcuda_alloc -std=c++11 -I ../../distrib/include/ -I ../../distrib/tools/ -I ../support/  -Wall -Werror -Wno-unused-function -Wcast-qual -Wignored-qualifiers -Wno-comment -Wsign-compare -Wno-unknown-warning-option -Wno-psabi  -frtti -I./bin -Wall -O3 process.cpp bin/downsample.a bin/downsample_auto_schedule.a bin/downsample_simple_auto_schedule.a bin/downsample_auto_schedule_store.a bin/downsample_auto_schedule_no_fus.a -o bin/process  -ldl -lpthread -lz -ltinfo -lpng16  -ljpeg -I/usr/include/libpng16 -I/usr/include/libpng16/..   
./bin/process ../images/gray.png 8 1 1 10 ./bin/out.png
Error: CUDA: cuModuleLoadData failed: CUDA_ERROR_INVALID_PTX
Makefile:51: recipe for target 'bin/out.png' failed
make: *** [bin/out.png] Aborted (core dumped)

nvidia-smi -i 0 gives:

ubuntu@ip-172-31-48-185:~/HalideAutoGPU/TACO_Benchmarks/downsample$ nvidia-smi -i 0
Wed Sep  2 22:41:50 2020       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.51.05    Driver Version: 450.51.05    CUDA Version: 11.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla K80           On   | 00000000:00:1E.0 Off |                    0 |
| N/A   35C    P8    29W / 149W |      0MiB / 11441MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

Do you have any ideas about how I can fix this?

Do the benchmarks measure the time to transfer the images from the host (CPU) to the GPU and back

Hello again! I have a question about what is included in the times reported by the ./run_tests.sh script. Do the times reported for each version of the application include the time required to transfer the input buffer from host (CPU) memory to device (GPU) memory and then to transfer the resulting output buffer from the device back to the host?

I assume that the times are computed by the benchmark code included with each app (for example):

Buffer<uint16_t> output(input.width(), input.height(), 3);
local_laplacian(input, levels, alpha/(levels-1), beta, output);
output.device_sync();
multi_way_bench({
{"Manual", [&]() { local_laplacian(input, levels, alpha/(levels-1), beta, output); output.device_sync(); }},
#ifndef NO_AUTO_SCHEDULE
{"Nested auto-scheduled", [&]() { local_laplacian_auto_schedule_store(input, levels, alpha/(levels-1), beta, output); output.device_sync(); }},
{"Auto-scheduled", [&]() { local_laplacian_auto_schedule(input, levels, alpha/(levels-1), beta, output); output.device_sync(); }},
{"No-fusion auto-scheduled", [&]() { local_laplacian_auto_schedule_no_fus(input, levels, alpha/(levels-1), beta, output); output.device_sync(); }},
{"Simple auto-scheduled", [&]() { local_laplacian_simple_auto_schedule(input, levels, alpha/(levels-1), beta, output); output.device_sync(); }}
#endif
}
);

But I'm not very familiar with the Halide CUDA runtime, so I could not tell what is actually included in the function calls that are timed in the benchmark.

Invalid Schedule error when trying a new application

Hey guys, I'm trying to add a new application (a simple gaussian pyramid) to the repo and auto-schedule it. The application is on my fork here: https://github.com/dillonhuff/HalideAutoGPU/tree/dhuff_experiments/TACO_Benchmarks/gausspyramid

When I run make test I get the following error:

ubuntu@ip-172-31-72-207:~/HalideAutoGPU/TACO_Benchmarks/gausspyramid$ make test
g++ -Dcuda_alloc -std=c++11 -I ../../distrib/include/ -I ../../distrib/tools/ -I ../support/  -Wall -Werror -Wno-unused-function -Wcast-qual -Wignored-qualifiers -Wno-comment -Wsign-compare -Wno-unknown-warning-option -Wno-psabi  -frtti -g gausspyramid_generator.cpp ../autoscheduler/SimpleAutoSchedule.cpp ../autoscheduler/AutoSchedule.cpp ../autoscheduler/DerivativeUtils.cpp ../../distrib/lib/libHalide.a ../../distrib/tools/GenGen.cpp -o bin/gausspyramid.generator  -ldl -lpthread -lz -ltinfo -lz -lrt -ldl -ltinfo -lpthread -lm -rdynamic
bin/gausspyramid.generator -g gausspyramid -o ./bin -f gausspyramid target=host-cuda-cuda_capability_61-no_runtime auto_schedule=false
bin/gausspyramid.generator -g gausspyramid -o ./bin -f gausspyramid_auto_schedule target=host-cuda-cuda_capability_61 auto_schedule=true -p ../autoscheduler/bin/libauto_schedule.so  -e static_library,h,schedule

================
Pipeline graph:
================
ds: {ds.update(0)}
ds.update(0): {f0}
ds$1: {ds$1.update(0)}
ds$1.update(0): {f1}
ds$2: {ds$2.update(0)}
ds$2.update(0): {f2}
ds$3: {ds$3.update(0)}
ds$3.update(0): {f3}
ds$4: {ds$4.update(0)}
ds$4.update(0): {f4}
ds$5: {ds$5.update(0)}
ds$5.update(0): {f5}
ds$6: {ds$6.update(0)}
ds$6.update(0): {output}
f0: {ds$1.update(0)}
f1: {ds$2.update(0)}
f2: {ds$3.update(0)}
f3: {ds$4.update(0)}
f4: {ds$5.update(0)}
f5: {ds$6.update(0)}
repeat_edge: {ds.update(0)}
================

================
Pipeline bounds:
================
ds -> {[-6, 25], [-6, 25]}
ds$1 -> {[-5, 26], [-5, 26]}
ds$2 -> {[-4, 27], [-4, 27]}
ds$3 -> {[-3, 28], [-3, 28]}
ds$4 -> {[-2, 29], [-2, 29]}
ds$5 -> {[-1, 30], [-1, 30]}
ds$6 -> {[0, 31], [0, 31]}
f0 -> {[-6, 25], [-6, 25]}
f1 -> {[-5, 26], [-5, 26]}
f2 -> {[-4, 27], [-4, 27]}
f3 -> {[-3, 28], [-3, 28]}
f4 -> {[-2, 29], [-2, 29]}
f5 -> {[-1, 30], [-1, 30]}
input -> {[0, 24], [0, 24]}
output -> {[0, 31], [0, 31]}
repeat_edge -> {[-7, 24], [-7, 24]}
===============
g name output
GROUP OF output
SH MEM 896.000000
 ACT THR 1024.000000f
 OCC 0.500000f
inlined f0
inlined f1
inlined f2
inlined f3
inlined f4
inlined f5
inlined repeat_edge
// Target: x86-64-linux-avx-avx2-cuda-cuda_capability_61-f16c-fma-sse41
// MachineParams: 32,16777216,4

// Delete this line if not using Generator
Pipeline pipeline = get_pipeline();

Var x_i("x_i");
Var x_o("x_o");
Var y_i("y_i");
Var y_o("y_o");

Func ds = pipeline.get_func(6);
Func ds_1 = pipeline.get_func(9);
Func ds_2 = pipeline.get_func(12);
Func ds_3 = pipeline.get_func(15);
Func ds_4 = pipeline.get_func(18);
Func ds_5 = pipeline.get_func(21);
Func ds_6 = pipeline.get_func(24);
Func output = pipeline.get_func(28);

{
    Var x = ds.args()[0];
    Var y = ds.args()[1];
    RVar reduce$x(ds.update(0).get_schedule().rvars()[0].var);
    RVar reduce$y(ds.update(0).get_schedule().rvars()[1].var);
    ds
        .reorder(x, y)
        .compute_at(output, x_o)
        .gpu_threads(x)
        .gpu_threads(y);
    ds.update(0)
        .reorder(reduce$x, reduce$y, x, y)
        .gpu_threads(x)
        .gpu_threads(y)
        .unroll(reduce$x)
        .unroll(reduce$y);
}
{
    Var x = ds_1.args()[0];
    Var y = ds_1.args()[1];
    RVar reduce$1$x(ds_1.update(0).get_schedule().rvars()[0].var);
    RVar reduce$1$y(ds_1.update(0).get_schedule().rvars()[1].var);
    ds_1
        .reorder(x, y)
        .compute_at(output, x_o)
        .gpu_threads(x)
        .gpu_threads(y);
    ds_1.update(0)
        .reorder(reduce$1$x, reduce$1$y, x, y)
        .gpu_threads(x)
        .gpu_threads(y)
        .unroll(reduce$1$x)
        .unroll(reduce$1$y);
}
{
    Var x = ds_2.args()[0];
    Var y = ds_2.args()[1];
    RVar reduce$2$x(ds_2.update(0).get_schedule().rvars()[0].var);
    RVar reduce$2$y(ds_2.update(0).get_schedule().rvars()[1].var);
    ds_2
        .reorder(x, y)
        .compute_at(output, x_o)
        .gpu_threads(x)
        .gpu_threads(y);
    ds_2.update(0)
        .reorder(reduce$2$x, reduce$2$y, x, y)
        .gpu_threads(x)
        .gpu_threads(y)
        .unroll(reduce$2$x)
        .unroll(reduce$2$y);
}
{
    Var x = ds_3.args()[0];
    Var y = ds_3.args()[1];
    RVar reduce$3$x(ds_3.update(0).get_schedule().rvars()[0].var);
    RVar reduce$3$y(ds_3.update(0).get_schedule().rvars()[1].var);
    ds_3
        .reorder(x, y)
        .compute_at(output, x_o)
        .gpu_threads(x)
        .gpu_threads(y);
    ds_3.update(0)
        .reorder(reduce$3$x, reduce$3$y, x, y)
        .gpu_threads(x)
        .gpu_threads(y)
        .unroll(reduce$3$x)
        .unroll(reduce$3$y);
}
{
    Var x = ds_4.args()[0];
    Var y = ds_4.args()[1];
    RVar reduce$4$x(ds_4.update(0).get_schedule().rvars()[0].var);
    RVar reduce$4$y(ds_4.update(0).get_schedule().rvars()[1].var);
    ds_4
        .reorder(x, y)
        .compute_at(output, x_o)
        .gpu_threads(x)
        .gpu_threads(y);
    ds_4.update(0)
        .reorder(reduce$4$x, reduce$4$y, x, y)
        .gpu_threads(x)
        .gpu_threads(y)
        .unroll(reduce$4$x)
        .unroll(reduce$4$y);
}
{
    Var x = ds_5.args()[0];
    Var y = ds_5.args()[1];
    RVar reduce$5$x(ds_5.update(0).get_schedule().rvars()[0].var);
    RVar reduce$5$y(ds_5.update(0).get_schedule().rvars()[1].var);
    ds_5
        .reorder(x, y)
        .compute_at(output, x_o)
        .gpu_threads(x)
        .gpu_threads(y);
    ds_5.update(0)
        .reorder(reduce$5$x, reduce$5$y, x, y)
        .gpu_threads(x)
        .gpu_threads(y)
        .unroll(reduce$5$x)
        .unroll(reduce$5$y);
}
{
    Var x = ds_6.args()[0];
    Var y = ds_6.args()[1];
    RVar reduce$6$x(ds_6.update(0).get_schedule().rvars()[0].var);
    RVar reduce$6$y(ds_6.update(0).get_schedule().rvars()[1].var);
    ds_6
        .reorder(x, y)
        .compute_at(output, x_o)
        .gpu_threads(x)
        .gpu_threads(y);
    ds_6.update(0)
        .reorder(reduce$6$x, reduce$6$y, x, y)
        .gpu_threads(x)
        .gpu_threads(y)
        .unroll(reduce$6$x)
        .unroll(reduce$6$y);
}
{
    Var x = output.args()[0];
    Var y = output.args()[1];
    output
        .compute_root()
        .split(x, x_o, x_i, 8)
        .split(y, y_o, y_i, 4)
        .reorder(x_i, y_i, x_o, y_o)
        .gpu_threads(x_i)
        .gpu_threads(y_i)
        .gpu_blocks(y_o)
        .gpu_blocks(x_o);
}


TOTAL INLINES 7
HL_USE_SIMPLE_AUTOSCHEDULER=1 \
bin/gausspyramid.generator -g gausspyramid -o ./bin -f gausspyramid_simple_auto_schedule target=host-cuda-cuda_capability_61-no_runtime auto_schedule=false -e static_library,h
HL_AUTO_FOLDED_FUSION=1 \
bin/gausspyramid.generator  -g gausspyramid -o ./bin -f gausspyramid_auto_schedule_store target=host-cuda-cuda_capability_61 auto_schedule=true  -p ../autoscheduler/bin/libauto_schedule.so -e static_library,h,schedule 

================
Pipeline graph:
================
ds: {ds.update(0)}
ds.update(0): {f0}
ds$1: {ds$1.update(0)}
ds$1.update(0): {f1}
ds$2: {ds$2.update(0)}
ds$2.update(0): {f2}
ds$3: {ds$3.update(0)}
ds$3.update(0): {f3}
ds$4: {ds$4.update(0)}
ds$4.update(0): {f4}
ds$5: {ds$5.update(0)}
ds$5.update(0): {f5}
ds$6: {ds$6.update(0)}
ds$6.update(0): {output}
f0: {ds$1.update(0)}
f1: {ds$2.update(0)}
f2: {ds$3.update(0)}
f3: {ds$4.update(0)}
f4: {ds$5.update(0)}
f5: {ds$6.update(0)}
repeat_edge: {ds.update(0)}
================

================
Pipeline bounds:
================
ds -> {[-6, 25], [-6, 25]}
ds$1 -> {[-5, 26], [-5, 26]}
ds$2 -> {[-4, 27], [-4, 27]}
ds$3 -> {[-3, 28], [-3, 28]}
ds$4 -> {[-2, 29], [-2, 29]}
ds$5 -> {[-1, 30], [-1, 30]}
ds$6 -> {[0, 31], [0, 31]}
f0 -> {[-6, 25], [-6, 25]}
f1 -> {[-5, 26], [-5, 26]}
f2 -> {[-4, 27], [-4, 27]}
f3 -> {[-3, 28], [-3, 28]}
f4 -> {[-2, 29], [-2, 29]}
f5 -> {[-1, 30], [-1, 30]}
input -> {[0, 24], [0, 24]}
output -> {[0, 31], [0, 31]}
repeat_edge -> {[-7, 24], [-7, 24]}
===============
g name output
GROUP OF output
SH MEM 896.000000
 ACT THR 1024.000000f
 OCC 0.500000f
inlined f0
inlined f1
inlined f2
inlined f3
inlined f4
inlined f5
inlined repeat_edge
Stage ds$5 compute at ds$6 , reduce$6$x
Stage ds$5 compute at ds$6 , reduce$6$y
Stage ds$5 compute at ds$6 , reduce$6$y
Stage ds$5 compute at ds$6 , reduce$6$x
Stage ds$5 compute at ds$6 , reduce$6$y
Stage ds$5 compute at ds$6 , reduce$6$y
Stage ds$4 compute at ds$5 , reduce$5$x
Stage ds$4 compute at ds$5 , reduce$5$y
Stage ds$4 compute at ds$5 , reduce$5$y
Stage ds$4 compute at ds$5 , reduce$5$x
Stage ds$4 compute at ds$5 , reduce$5$y
Stage ds$4 compute at ds$5 , reduce$5$y
Stage ds$3 compute at ds$4 , reduce$4$x
Stage ds$3 compute at ds$4 , reduce$4$y
Stage ds$3 compute at ds$4 , reduce$4$y
Stage ds$3 compute at ds$4 , reduce$4$x
Stage ds$3 compute at ds$4 , reduce$4$y
Stage ds$3 compute at ds$4 , reduce$4$y
Stage ds$2 compute at ds$3 , reduce$3$x
Stage ds$2 compute at ds$3 , reduce$3$y
Stage ds$2 compute at ds$3 , reduce$3$y
Stage ds$2 compute at ds$3 , reduce$3$x
Stage ds$2 compute at ds$3 , reduce$3$y
Stage ds$2 compute at ds$3 , reduce$3$y
Stage ds$1 compute at ds$2 , reduce$2$x
Stage ds$1 compute at ds$2 , reduce$2$y
Stage ds$1 compute at ds$2 , reduce$2$y
Stage ds$1 compute at ds$2 , reduce$2$x
Stage ds$1 compute at ds$2 , reduce$2$y
Stage ds$1 compute at ds$2 , reduce$2$y
Stage ds compute at ds$1 , reduce$1$x
Stage ds compute at ds$1 , reduce$1$y
Stage ds compute at ds$1 , reduce$1$y
Stage ds compute at ds$1 , reduce$1$x
Stage ds compute at ds$1 , reduce$1$y
Stage ds compute at ds$1 , reduce$1$y
cons ds$6 prod ds$6
cons output prod ds$6
cons ds$5 prod ds$5
cons f5 prod ds$5
cons ds$4 prod ds$4
cons f4 prod ds$4
cons ds$3 prod ds$3
cons f3 prod ds$3
cons ds$2 prod ds$2
cons f2 prod ds$2
cons ds$1 prod ds$1
cons f1 prod ds$1
cons ds prod ds
cons f0 prod ds
// Target: x86-64-linux-avx-avx2-cuda-cuda_capability_61-f16c-fma-sse41
// MachineParams: 32,16777216,4

// Delete this line if not using Generator
Pipeline pipeline = get_pipeline();

Var x_i("x_i");
Var x_o("x_o");
Var y_i("y_i");
Var y_o("y_o");

Func ds = pipeline.get_func(6);
Func ds_1 = pipeline.get_func(9);
Func ds_2 = pipeline.get_func(12);
Func ds_3 = pipeline.get_func(15);
Func ds_4 = pipeline.get_func(18);
Func ds_5 = pipeline.get_func(21);
Func ds_6 = pipeline.get_func(24);
Func output = pipeline.get_func(28);

{
    Var x = ds.args()[0];
    Var y = ds.args()[1];
    RVar reduce$x(ds.update(0).get_schedule().rvars()[0].var);
    RVar reduce$y(ds.update(0).get_schedule().rvars()[1].var);
    ds
        .reorder(x, y)
        .compute_at(ds_1, x);
    ds.update(0)
        .reorder(reduce$x, reduce$y, x, y)
        .gpu_threads(x)
        .gpu_threads(y)
        .unroll(reduce$x)
        .unroll(reduce$y);
}
{
    Var x = ds_1.args()[0];
    Var y = ds_1.args()[1];
    RVar reduce$1$x(ds_1.update(0).get_schedule().rvars()[0].var);
    RVar reduce$1$y(ds_1.update(0).get_schedule().rvars()[1].var);
    ds_1
        .reorder(x, y)
        .compute_at(ds_2, x);
    ds_1.update(0)
        .reorder(reduce$1$x, reduce$1$y, x, y)
        .gpu_threads(x)
        .gpu_threads(y)
        .unroll(reduce$1$x)
        .unroll(reduce$1$y);
}
{
    Var x = ds_2.args()[0];
    Var y = ds_2.args()[1];
    RVar reduce$2$x(ds_2.update(0).get_schedule().rvars()[0].var);
    RVar reduce$2$y(ds_2.update(0).get_schedule().rvars()[1].var);
    ds_2
        .reorder(x, y)
        .compute_at(ds_3, x);
    ds_2.update(0)
        .reorder(reduce$2$x, reduce$2$y, x, y)
        .gpu_threads(x)
        .gpu_threads(y)
        .unroll(reduce$2$x)
        .unroll(reduce$2$y);
}
{
    Var x = ds_3.args()[0];
    Var y = ds_3.args()[1];
    RVar reduce$3$x(ds_3.update(0).get_schedule().rvars()[0].var);
    RVar reduce$3$y(ds_3.update(0).get_schedule().rvars()[1].var);
    ds_3
        .reorder(x, y)
        .compute_at(ds_4, x);
    ds_3.update(0)
        .reorder(reduce$3$x, reduce$3$y, x, y)
        .gpu_threads(x)
        .gpu_threads(y)
        .unroll(reduce$3$x)
        .unroll(reduce$3$y);
}
{
    Var x = ds_4.args()[0];
    Var y = ds_4.args()[1];
    RVar reduce$4$x(ds_4.update(0).get_schedule().rvars()[0].var);
    RVar reduce$4$y(ds_4.update(0).get_schedule().rvars()[1].var);
    ds_4
        .reorder(x, y)
        .compute_at(ds_5, x);
    ds_4.update(0)
        .reorder(reduce$4$x, reduce$4$y, x, y)
        .gpu_threads(x)
        .gpu_threads(y)
        .unroll(reduce$4$x)
        .unroll(reduce$4$y);
}
{
    Var x = ds_5.args()[0];
    Var y = ds_5.args()[1];
    RVar reduce$5$x(ds_5.update(0).get_schedule().rvars()[0].var);
    RVar reduce$5$y(ds_5.update(0).get_schedule().rvars()[1].var);
    ds_5
        .reorder(x, y)
        .compute_at(ds_6, x);
    ds_5.update(0)
        .reorder(reduce$5$x, reduce$5$y, x, y)
        .gpu_threads(x)
        .gpu_threads(y)
        .unroll(reduce$5$x)
        .unroll(reduce$5$y);
}
{
    Var x = ds_6.args()[0];
    Var y = ds_6.args()[1];
    RVar reduce$6$x(ds_6.update(0).get_schedule().rvars()[0].var);
    RVar reduce$6$y(ds_6.update(0).get_schedule().rvars()[1].var);
    ds_6
        .reorder(x, y)
        .compute_at(output, x_o)
        .gpu_threads(x)
        .gpu_threads(y);
    ds_6.update(0)
        .reorder(reduce$6$x, reduce$6$y, x, y)
        .gpu_threads(x)
        .gpu_threads(y)
        .unroll(reduce$6$x)
        .unroll(reduce$6$y);
}
{
    Var x = output.args()[0];
    Var y = output.args()[1];
    output
        .compute_root()
        .split(x, x_o, x_i, 8)
        .split(y, y_o, y_i, 4)
        .reorder(x_i, y_i, x_o, y_o)
        .gpu_threads(x_i)
        .gpu_threads(y_i)
        .gpu_blocks(y_o)
        .gpu_blocks(x_o);
}


TOTAL INLINES 7
Error at ../../distrib/tools/GenGen.cpp:4:
Invalid schedule: Loop over ds$5.s1.y.__thread_id_y cannot be inside of loop over ds$6.s1.x.__thread_id_x
Aborted (core dumped)
Makefile:29: recipe for target 'bin/gausspyramid_auto_schedule_store.a' failed
make: *** [bin/gausspyramid_auto_schedule_store.a] Error 134

Any idea what I'm doing wrong here?

In case it helps I've noticed that replacing this code:
https://github.com/dillonhuff/HalideAutoGPU/blob/c9575bc42575303af62111d691db7e56c6b74a77/TACO_Benchmarks/gausspyramid/gausspyramid_generator.cpp#L31-L38

With this code:
https://github.com/dillonhuff/HalideAutoGPU/blob/c9575bc42575303af62111d691db7e56c6b74a77/TACO_Benchmarks/gausspyramid/gausspyramid_generator.cpp#L27-L29

Seems to remove the error.

Thanks!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.