Giter VIP home page Giter VIP logo

mlir-air's People

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

mlir-air's Issues

Multi Core DMA Matrix Scalar Add Example Fails

From branch debugging_matrix_scalar_add, I am working on getting my example multi_core_dma working (this file). The single core version works in this branch, but when I increase the herd size from 1x1 to 2x2, the example does not work any more.

To be specific, to replicate run:

cd programming_examples/matrix_scalar_add/multi_core_dma
make

When I inspect programming_examples/matrix_scalar_add/multi_core_dma/build/air_project/npu.air.mlir, I notice that it looks like all of the data is only going to one core instead of being distributed to all of the cores.

      aiex.npu.dma_memcpy_nd(0, 0, %arg0[0, 0, 0, 0][1, 1, 8, 16][0, 0, 32]) {id = 0 : i64, metadata = @airMemcpyId9} : memref<32x16xi32>
      aiex.npu.dma_memcpy_nd(0, 0, %arg0[0, 0, 16, 0][1, 1, 8, 16][0, 0, 32]) {id = 1 : i64, metadata = @airMemcpyId9} : memref<32x16xi32>
      aiex.npu.dma_memcpy_nd(0, 0, %arg0[0, 0, 0, 16][1, 1, 8, 16][0, 0, 32]) {id = 2 : i64, metadata = @airMemcpyId9} : memref<32x16xi32>
      aiex.npu.dma_memcpy_nd(0, 0, %arg0[0, 0, 16, 16][1, 1, 8, 16][0, 0, 32]) {id = 3 : i64, metadata = @airMemcpyId9} : memref<32x16xi32>
      aiex.npu.dma_memcpy_nd(0, 0, %arg1[0, 0, 0, 0][1, 1, 8, 16][0, 0, 32]) {id = 4 : i64, metadata = @airMemcpyId10} : memref<32x16xi32>
      aiex.npu.dma_memcpy_nd(0, 0, %arg1[0, 0, 16, 0][1, 1, 8, 16][0, 0, 32]) {id = 5 : i64, metadata = @airMemcpyId10} : memref<32x16xi32>
      aiex.npu.dma_memcpy_nd(0, 0, %arg1[0, 0, 0, 16][1, 1, 8, 16][0, 0, 32]) {id = 6 : i64, metadata = @airMemcpyId10} : memref<32x16xi32>
      aiex.npu.dma_memcpy_nd(0, 0, %arg1[0, 0, 16, 16][1, 1, 8, 16][0, 0, 32]) {id = 7 : i64, metadata = @airMemcpyId10} : memref<32x16xi32>
      aiex.npu.sync {channel = 0 : i32, column = 0 : i32, column_num = 1 : i32, direction = 0 : i32, row = 0 : i32, row_num = 1 : i32}
      aiex.npu.sync {channel = 0 : i32, column = 0 : i32, column_num = 1 : i32, direction = 0 : i32, row = 0 : i32, row_num = 1 : i32}
      aiex.npu.sync {channel = 0 : i32, column = 0 : i32, column_num = 1 : i32, direction = 0 : i32, row = 0 : i32, row_num = 1 : i32}
      aiex.npu.sync {channel = 0 : i32, column = 0 : i32, column_num = 1 : i32, direction = 0 : i32, row = 0 : i32, row_num = 1 : i32}

Multi Core Channel Matrix Scalar Add Example Fails

From branch debugging_matrix_scalar_add, I am working on getting my example multi_core_channel working (this file). The single core version works in this branch, but when I increase the herd size from 1x1 to 2x2, the example does not work any more.

To be specific, to replicate run:

cd programming_examples/matrix_scalar_add/multi_core_channel
make

When I inspect programming_examples/matrix_scalar_add/multi_core_channel/build/air_project/npu.air.mlir, I notice that it looks like all of the data is only going to one core instead of being distributed to all of the cores.

      aiex.npu.dma_memcpy_nd(0, 0, %arg0[0, 0, 0, 0][1, 1, 8, 16][0, 0, 32]) {id = 0 : i64, metadata = @airMemcpyId3} : memref<32x16xi32>
      aiex.npu.dma_memcpy_nd(0, 0, %arg1[0, 0, 0, 0][1, 1, 8, 16][0, 0, 32]) {id = 1 : i64, metadata = @airMemcpyId4} : memref<32x16xi32>
      aiex.npu.sync {channel = 0 : i32, column = 0 : i32, column_num = 1 : i32, direction = 0 : i32, row = 0 : i32, row_num = 1 : i32}
      aiex.npu.dma_memcpy_nd(0, 0, %arg0[0, 0, 0, 16][1, 1, 8, 16][0, 0, 32]) {id = 0 : i64, metadata = @airMemcpyId3} : memref<32x16xi32>
      aiex.npu.dma_memcpy_nd(0, 0, %arg1[0, 0, 0, 16][1, 1, 8, 16][0, 0, 32]) {id = 1 : i64, metadata = @airMemcpyId4} : memref<32x16xi32>
      aiex.npu.sync {channel = 0 : i32, column = 0 : i32, column_num = 1 : i32, direction = 0 : i32, row = 0 : i32, row_num = 1 : i32}
      aiex.npu.dma_memcpy_nd(0, 0, %arg0[0, 0, 16, 0][1, 1, 8, 16][0, 0, 32]) {id = 0 : i64, metadata = @airMemcpyId3} : memref<32x16xi32>
      aiex.npu.dma_memcpy_nd(0, 0, %arg1[0, 0, 16, 0][1, 1, 8, 16][0, 0, 32]) {id = 1 : i64, metadata = @airMemcpyId4} : memref<32x16xi32>
      aiex.npu.sync {channel = 0 : i32, column = 0 : i32, column_num = 1 : i32, direction = 0 : i32, row = 0 : i32, row_num = 1 : i32}
      aiex.npu.dma_memcpy_nd(0, 0, %arg0[0, 0, 16, 16][1, 1, 8, 16][0, 0, 32]) {id = 0 : i64, metadata = @airMemcpyId3} : memref<32x16xi32>
      aiex.npu.dma_memcpy_nd(0, 0, %arg1[0, 0, 16, 16][1, 1, 8, 16][0, 0, 32]) {id = 1 : i64, metadata = @airMemcpyId4} : memref<32x16xi32>
      aiex.npu.sync {channel = 0 : i32, column = 0 : i32, column_num = 1 : i32, direction = 0 : i32, row = 0 : i32, row_num = 1 : i32}

Missing `test_library.h` building runtime library for target aarch64

Description

AIR reports a missing header during runtime library build for target aarch64. The test_library.h header is available under /runtime_lib/aarch64/test_lib/include, but for some reason the compiler only looks for /runtime_lib/x86_64/test_lib/include. I was able to work around this by setting CPLUS_INCLUDE_PATH but there probably is some hard-coded path pointed to x86.

Tool commit points

  • MLIR-AIR @c3a9b505f06936a3e4c81c221ca9fac2a7d6dbad
  • MLIR-AIE @d21ca563e0c0fd100a4bbd98d194e770ce33bd79

Repeat this issue

Compile AIR with runtime lib target aarch64:

cmake .. \
    -GNinja \
    -DCMAKE_C_COMPILER=clang \
    -DCMAKE_CXX_COMPILER=clang++ \
    -DCMAKE_INSTALL_PREFIX="../${INSTALL_DIR}" \
    -DArch=arm64 \
    -DgccVer=10.2.0 \
    -DCMAKE_USE_TOOLCHAIN=FALSE \
    -DCMAKE_USE_TOOLCHAIN_AIRHOST=TRUE \
    -DLLVM_DIR=${LLVM_DIR}/build/lib/cmake/llvm \
    -DMLIR_DIR=${LLVM_DIR}/build/lib/cmake/mlir \
    -DAIE_DIR=${MLIR_AIE_DIR}/build/lib/cmake/aie \
    -Dpybind11_DIR=${PYTHON_ROOT}/pybind11/share/cmake/pybind11 \
    -DAIR_RUNTIME_TARGETS:STRING="aarch64" \
    -Daarch64_TOOLCHAIN_FILE=/home/niansong/mlir-air/cmake/modules/toolchain_aarch64.cmake \
    -DBUILD_SHARED_LIBS=OFF \
    -DLLVM_USE_LINKER=lld \
    -DXILINX_XAIE_INCLUDE_DIR=/home/niansong/mlir-air/install/runtime_lib/aarch64/xaiengine/include \
    -DXILINX_XAIE_LIBS=/home/niansong/mlir-air/install/runtime_lib/aarch64/xaiengine/lib \
    -DCMAKE_MODULE_PATH=${CMAKEMODULES_DIR}/ \
    |& tee cmake.log

During build there's error message:

1 error generated.
[21/30] Building CXX object airhost/CMakeFiles/airhost.dir/queue.cpp.o
FAILED: airhost/CMakeFiles/airhost.dir/queue.cpp.o
/usr/bin/clang++-10 --sysroot=/group/xrlabs/platforms/vck190-pynq-v2.7/sysroot -DLIBXAIENGINEV2 -I/proj/rdi/staff/erweiw/niansong_end_of_internship/mlir-air/runtime_lib/airhost/include -I/proj/rdi/staff/erweiw/niansong_end_of_internship/mlir-air/utils/mlir-aie/include -I/proj/rdi/staff/erweiw/niansong_end_of_internship/mlir-air/utils/mlir-aie/build/include/../runtime_lib/x86_64/test_lib/include -I/group/xrlabs/platforms/vck190-pynq-v2.7/sysroot/opt/xaienginev2/include --sysroot=/group/xrlabs/platforms/vck190-pynq-v2.7/sysroot --target=aarch64-linux-gnu --gcc-toolchain=/group/xrlabs/platforms/vck190-pynq-v2.7/sysroot/usr -fuse-ld=lld-10 -Wno-unused-command-line-argument -std=gnu++14 -fPIC -MD -MT airhost/CMakeFiles/airhost.dir/queue.cpp.o -MF airhost/CMakeFiles/airhost.dir/queue.cpp.o.d -o airhost/CMakeFiles/airhost.dir/queue.cpp.o -c /proj/rdi/staff/erweiw/niansong_end_of_internship/mlir-air/runtime_lib/airhost/queue.cpp
In file included from /proj/rdi/staff/erweiw/niansong_end_of_internship/mlir-air/runtime_lib/airhost/queue.cpp:20:
/proj/rdi/staff/erweiw/niansong_end_of_internship/mlir-air/runtime_lib/airhost/include/air_host_impl.h:11:10: fatal error: 'test_library.h' file not found
#include "test_library.h"
         ^~~~~~~~~~~~~~~~
1 error generated.

Confirm that verifiers are in place to verify allocations to L1 / L2 memory are only made inside Herds / Segments

aie.herds are intentionally isolated from above to allow parallel compilation.

memory allocations to L1 memory should only be made within a herd.
those allocations must not be yielded outside of a herd ( would break execution model).

memory allocation to L2 memory should only be made within a segment
those allocations must not be yielded outside of a segment ( would break execution model).

Can we check that verifiers check this behaviour, and documentation communicates it.

github runner sets max locked memory too low for some tests

Some RyzenAI self-hosted runners fail on tests using large buffers with an allocation error:

[XRT] ERROR: Failed to allocate host memory buffer (mmap(len=16777216, prot=3, flags=8193, offset=4294967296) failed (err=11): Resource temporarily unavailable), make sure host bank is enabled (see xbutil configure --host-mem)
terminate called after throwing an instance of 'xrt_core::system_error'
  what():  mmap(len=16777216, prot=3, flags=8193, offset=4294967296) failed (err=11): Resource temporarily unavailable

but the same error does not occur when running the tests manually, even on the same machine with the same binaries.

The problem is that the "max locked memory" limit is too low when running the tests under the github runner. As an example, on one github runner machine the limit seen by user is:

$ ulimit -l
3694888

But the value reported in a workflow is 8192. Buffer allocation in the driver may fail as a result, with mmap returning error num 11, "Resource Temporarily Unavailable". The fix is to increase the limit in the workflow script. Because a normal use is not able to increase the limit, one workaround is to add the following to the workflow script:

sudo prlimit -lunlimited --pid $$

with a corresponding line in the sudoers file to allow the command. e.g.:

%github ALL=(ALL) NOPASSWD: /usr/bin/prlimit *

x86_64-petalinux-linux/bin/ld: cannot find -lgcc_s

When I make petalinux_build, I encountered this problem:

make[1]: Entering directory '/home/pynq/projects/mlir-air/platforms/xilinx_vck190_air/petalinux/build/tmp/work/versal_generic-xilinx-linux/linux-xlnx/5.10+gitAUTOINC+568989d441-r0/linux-versal_generic-standard-build'
GEN Makefile
HOSTCC scripts/basic/fixdep
/opt/petalinux/2021.2/components/yocto/buildtools_extended/sysroots/x86_64-petalinux-linux/usr/bin/../lib/gcc/x86_64-petalinux-linux/10.2.0/../../../../x86_64-petalinux-linux/bin/ld: cannot find -lgcc_s

I'm not sure how to solve it, which library should I use, aarch64 or x86? Thank you!

dispatch_packet_t': expected expression

Hello everyone,

When building mlid-air-pcie I find the following error when compiling mlir-air/runtime_lib/airhost/memory.cpp:

dispatch_packet_t': expected expression
      uint64_t signal_offset = offsetof(dispatch_packet_t, completion_signal);
                                        ^
/people/gioi152/src/tools/xilinx/mlir-air/runtime_lib/airhost/memory.cpp:270:60: error: use of undeclared iden\
tifier 'completion_signal'
      uint64_t signal_offset = offsetof(dispatch_packet_t, completion_signal);
                                                           ^
/people/gioi152/src/tools/xilinx/mlir-air/runtime_lib/airhost/memory.cpp:327:41: error: unexpected type name '\
dispatch_packet_t': expected expression
      uint64_t signal_offset = offsetof(dispatch_packet_t, completion_signal);
                                        ^
/people/gioi152/src/tools/xilinx/mlir-air/runtime_lib/airhost/memory.cpp:327:60: error: use of undeclared iden\
tifier 'completion_signal'
      uint64_t signal_offset = offsetof(dispatch_packet_t, completion_signal);
                                                           ^
4 errors generated.

Am I missing any header?

Also, not a big problem, but I need to pass the fourth parameter to build-mlir-air-pcie.sh (which is the location of libXAIE just built and installed) instead of three (as shown here) or cmake won't be able to find it.

Access memory allocated in segment in the herd

I'm trying to make an example where I allocate it in the segment (in L2 memory) and then use it as a target for the dma_memcpy_nd in the herd within the segment. I have written an example for that in a branch called alloc-check-example.

In the current form of the code, where I try to send the allocated memory to the herd through an operand/argument, I get the following error:

  File "mlir-air/install-xrt/python/air/dialects/_air_ops_ext.py", line 105, in <listcomp>
    operand_types = [s.type for s in sizes] * 2 + [o.type for o in operands]
AttributeError: 'AllocOp' object has no attribute 'type'
make: *** [Makefile:9: run] Error 1

If I don't try to send it from the segment to the herd as a herd operand, I get this error instead:

air._mlir_libs._site_initialize.<locals>.MLIRError: Unable to parse module assembly:
error: "-":21:11: 'air.dma_memcpy_nd' op using value defined outside the region
 note: "-":21:11: see current operation: "air.dma_memcpy_nd"(<<UNKNOWN SSA VALUE>>, %arg4, %2, %3, %4, %5, %6, %7) <{operandSegmentSizes = array<i32: 0, 1, 0, 0, 0, 1, 2, 2, 2>}> : (memref<16x8xi32, 1 : i32>, memref<32x16xi32>, index, index, index, index, index, index) -> ()
 note: "-":11:9: required by region isolation constraints
make: *** [Makefile:9: run] Error 1

I believe one of these two things should work, but I'm not sure which one (or both).

Limit on ChannelGet/ChannelPut operations?

I was working on #642 (which is not ready to merge) and I started to see some behavior I didn't understand. So I tried to make a minimal example and I likewise have some unexpected behavior.

The example I came up with uses somewhat ridiculous things like tiny 1x1 data tiles to make it really obvious to me what was going on in the output. I found that my example fails for a 32x16 image and even an 8x8 image, but it is successful for a 4x4 image.

My example is here. It's not supposed to pass the python test harness, I've just been looking at the output to see what it's doing.

As usual, you can run with:

make clean && make

When it fails, it seems like no output is received (the output stays the original value of 0xFFFFFFFFs in the test harness). When it succeeds, each value in the output image is increasing by 2, e.g. for the 4x4 image,

0000 0002 0004 0006 
0008 000a 000c 000e 
0010 0012 0014 0016 
0018 001a 001c 001e

Because it only works for small images, I'm guessing I'm running into a limit on the number of channel operations/copies allowed at some point, but I have not yet confirmed this theory.

In submitting this issue I'm hoping to discover:

  • Is this expected behavior when using many ChannelGet/ChannelPut ops? If so, what is the limit (and is there a way to catch it before a programmer runs into trouble?)
  • If it's not expected behavior, maybe a bug fix?

AIR lowering pipeline failed for mmult with l2 tile size 64,64,64, l1 tile size 32,32,32

Description

This issue happens during calling aircc.py to lower air IR generated from linalg.

Tool commit points

  • MLIR-AIR @c3a9b505f06936a3e4c81c221ca9fac2a7d6dbad
  • MLIR-AIE @d21ca563e0c0fd100a4bbd98d194e770ce33bd79

Repeat the issue

Input: air.mlir

module {
  func.func @forward(%arg0: memref<128x128xi32>, %arg1: memref<128x128xi32>, %arg2: memref<128x128xi32>) {
    linalg.matmul ins(%arg0, %arg1 : memref<128x128xi32>, memref<128x128xi32>) outs(%arg2 : memref<128x128xi32>)
    return
  }
}

Compilation command:

air-opt air.mlir \
	-o air.opt.mlir \
	-buffer-results-to-out-params \
	-air-linalg-codegen='l2-tile-size=64,64,64 l2-promote=true l1-tile-size=32,32,32 l1-promote=true' \
	-air-par-to-herd \
	-air-copy-to-dma \
	-canonicalize -cse \

aircc.py \
    -row-offset=3 \
    -col-offset=5 \
    ./air.opt.mlir \
    -o air.mlir.a \
    --host-target=aarch64-linux-gnu \
    --sysroot=${SYSROOT}

Error message:

loc("-":28:11): error: block with no terminator, has 
"scf.for"(%1, %2, %3) ({
^bb0(%arg7: index):
  "scf.for"(%1, %3, %0) ({
  ^bb0(%arg8: index):
    "scf.for"(%1, %3, %0) ({
    ^bb0(%arg9: index):
      "scf.for"(%1, %3, %0) ({
      ^bb0(%arg10: index):
        %4 = "memref.load"(<<UNKNOWN SSA VALUE>>, %arg8, %arg10) : (memref<32x32xi32, 2>, index, index) -> i32
        %5 = "memref.load"(<<UNKNOWN SSA VALUE>>, %arg10, %arg9) : (memref<32x32xi32, 2>, index, index) -> i32
        %6 = "memref.load"(<<UNKNOWN SSA VALUE>>, %arg8, %arg9) : (memref<32x32xi32, 2>, index, index) -> i32
        %7 = "arith.muli"(%4, %5) : (i32, i32) -> i32
        %8 = "arith.addi"(%6, %7) : (i32, i32) -> i32
        "memref.store"(%8, <<UNKNOWN SSA VALUE>>, %arg8, %arg9) : (i32, memref<32x32xi32, 2>, index, index) -> ()
        "scf.yield"() : () -> ()
      }) : (index, index, index) -> ()
      "scf.yield"() : () -> ()
    }) : (index, index, index) -> ()
    "scf.yield"() : () -> ()
  }) : (index, index, index) -> ()
  "memref.dealloc"(<<UNKNOWN SSA VALUE>>) : (memref<32x32xi32, 2>) -> ()
  "memref.dealloc"(<<UNKNOWN SSA VALUE>>) : (memref<32x32xi32, 2>) -> ()
  "memref.dealloc"(<<UNKNOWN SSA VALUE>>) : (memref<32x32xi32, 2>) -> ()
  "scf.yield"() : () -> ()
}) : (index, index, index) -> ()
Traceback (most recent call last):
  File "/home/niansong/mlir-air/install/bin/aircc.py", line 13, in <module>
    main()
  File "/home/niansong/mlir-air/install/python/air/compiler/aircc/main.py", line 316, in main
    run(module)
  File "/home/niansong/mlir-air/install/python/air/compiler/aircc/main.py", line 138, in run
    run_passes('builtin.module('+pass_pipeline+')', air_to_aie_module, opts,
  File "/home/niansong/mlir-air/install/python/air/compiler/aircc/main.py", line 77, in run_passes
    PassManager.parse(pass_pipeline).run(mlir_module)
RuntimeError: Failure while executing pass pipeline.

VCK190 pynq platform build failure

Hi, I am not sure if this is me or a real build issue. Can you give some advice on the error below?

This occurs toward the end of 'make pynq' all stages up to this seem to work correctly. Here is where it stops:

Platform created:
./platform_repo/xilinx_vck190_air/export/xilinx_vck190_air/xilinx_vck190_air.xpfm
make[1]: *** No rule to make target '../petalinux/images/linux/sdk.sh', needed by 'prep_sysroot'. Stop.
make[1]: Leaving directory '/enc/sandbox/mlir-air/platforms/xilinx_vck190_air/aie_platform'
make: *** [Makefile:42: platform] Error 2

I am using Ubuntu 20 LTS & v2021.2 tools as requested.

I am going through the make file to see if I can figure it out. Will update if I see why.

Worker-to-Worker Channel Example

This is part of my effort to write examples using channels in a variety of ways (#648).

I've been having some issues with getting worker-to-worker (core-to-core within a herd) data movement with channels to work. It's quite possible I just have a bug in my own code that I have not yet found, or that my own assumptions aren't accurate. If someone could take a look to see if it's reasonable, and maybe look into a fix if it's not my bug, that would be great!

My example code is here. You can recreate the issue on the worker2worker branch by:

cd programming_examples/channel_examples/worker_to_worker
make

Failure to build since rocm/hsa addition

Before the patch #367, I could build mlir-air without any rocm/hsa. Now, my build fails with

/home/jamesn/labs/mlir-air/python/../runtime_lib/airhost/include/air_queue.h:12:10: fatal error: 'hsa/hsa.h' file not found
#include "hsa/hsa.h"

I don't want to build the runtime, I am only using the compiler (mlir subdirectory) Can we have an option to just build the compiler?

Herd parameters allow general behaviour, but current lowering does not support this.

Current behaviour allows general parameters to be passed into the herd ( herds are isolated from above), but we don't have a path to lowering these at the moment.

AIE RTPs are one way of lowering these.

The proposed AIE logical dialect would also support core.tasks with general parameters ( which could be lowered from herd parameters).

We should add a verifier that indicates that the lowering is not currently complete ( rejecting code with extra arguments).
We should add a lowering path to enable scalar parameters to be passed into the herd.

[xilinx_vck5000_air] make error: "exceeds the maximum 16G DDR capacity" of AXI NOC IP

Hi,

I'm trying to build the AIR platform for VCK5000 with Vivado 2022.1. I'm hitting the following error when running make all under mlir-air/platforms/xilinx_vck5000_air

ERROR: [IP_Flow 19-5481] Logical NOC instance '/axi_noc_0' has a total of '32G' DDR memory assigned, this exceeds the maximum '16G' DDR capacity for this IP. Please review your NOC DDR memory configuration.

I'm not sure what I'm missing. I'm using the VCK5000 v1.0 board files. Please find the attached log file for reference.

Thanks!

vivado.log

Multiple launches, herd with single core

I'm working on an example where I have an 2D matrix of input data, then I break it into four data tiles, and then I am attempt to process one data tile per one compute tile in a variety of ways using AIR constructs. I am sanity checking my programs by making each compute core add a unique tile_num to each value in the data tile they modify, so I can reassure myself that the compute tile I think is doing some work is actually the compute tile doing the work.

Anyways, I am trying to compose an example of this scenario that uses four launches, where the herd size is 1x1. My first attempt is here where I have a while loop within the herd because I hear the kernel will be persistent across launches.

Anyways, even with that persistence, I'd like to somehow parameterize the herd with the launch indices so I can calculate a unique tile_num per launch. Is this something that is possible to do? If not, how do I reassure myself that one data tile is being processed per launch?

`air-dependency` pass failed on mmult code generated from Triton

Description

Seems like there's an empty vector in ::AIRDependency::createPartialMemref function that caused this issue.

Tool commit points

  • MLIR-AIR @c3a9b505f06936a3e4c81c221ca9fac2a7d6dbad
  • MLIR-AIE @d21ca563e0c0fd100a4bbd98d194e770ce33bd79

Repeat this issue

Input: mmult.triton.air.mlir

#map = affine_map<(d0, d1) -> (d0, d1)>
module {
  func.func @matmul_kernel(%arg0: memref<*xi32>, %arg1: memref<*xi32>, %arg2: memref<*xi32>, %arg3: i32, %arg4: i32, %arg5: i32, %arg6: i32, %arg7: i32, %arg8: i32, %arg9: i32, %arg10: i32, %arg11: i32, %arg12: i32, %arg13: i32, %arg14: i32) {
    %c0_i32 = arith.constant 0 : i32
    %c128_i32 = arith.constant 128 : i32
    %c128 = arith.constant 128 : index
    %alloc = memref.alloc() {alignment = 64 : i64} : memref<128x128xi32>
    linalg.fill ins(%c0_i32 : i32) outs(%alloc : memref<128x128xi32>)
    %c1_i32 = arith.constant 1 : i32
    %c0_i32_0 = arith.constant 0 : i32
    %c-1_i32 = arith.constant -1 : i32
    %0 = arith.cmpi sgt, %c128_i32, %c0_i32_0 : i32
    %1 = arith.select %0, %c-1_i32, %c1_i32 : i32
    %2 = arith.addi %1, %arg4 : i32
    %3 = arith.divsi %2, %c128_i32 : i32
    %4 = arith.addi %c1_i32, %3 : i32
    %5 = arith.subi %c0_i32_0, %arg4 : i32
    %6 = arith.divsi %5, %c128_i32 : i32
    %7 = arith.subi %c0_i32_0, %6 : i32
    %8 = arith.cmpi slt, %arg4, %c0_i32_0 : i32
    %9 = arith.cmpi sgt, %arg4, %c0_i32_0 : i32
    %10 = arith.cmpi slt, %c128_i32, %c0_i32_0 : i32
    %11 = arith.cmpi sgt, %c128_i32, %c0_i32_0 : i32
    %12 = arith.andi %8, %10 : i1
    %13 = arith.andi %9, %11 : i1
    %14 = arith.ori %12, %13 : i1
    %15 = arith.select %14, %4, %7 : i32
    %c1_i32_1 = arith.constant 1 : i32
    %c0_i32_2 = arith.constant 0 : i32
    %c-1_i32_3 = arith.constant -1 : i32
    %16 = arith.cmpi slt, %15, %c0_i32_2 : i32
    %17 = arith.select %16, %c1_i32_1, %c-1_i32_3 : i32
    %18 = arith.subi %17, %arg12 : i32
    %19 = arith.divsi %18, %15 : i32
    %20 = arith.subi %c-1_i32_3, %19 : i32
    %21 = arith.divsi %arg12, %15 : i32
    %22 = arith.cmpi slt, %arg12, %c0_i32_2 : i32
    %23 = arith.cmpi sgt, %arg12, %c0_i32_2 : i32
    %24 = arith.cmpi slt, %15, %c0_i32_2 : i32
    %25 = arith.cmpi sgt, %15, %c0_i32_2 : i32
    %26 = arith.andi %22, %25 : i1
    %27 = arith.andi %23, %24 : i1
    %28 = arith.ori %26, %27 : i1
    %29 = arith.select %28, %20, %21 : i32
    %30 = arith.remsi %arg12, %15 : i32
    %31 = arith.muli %29, %c128_i32 : i32
    %32 = arith.muli %30, %c128_i32 : i32
    %33 = arith.index_cast %31 : i32 to index
    %34 = arith.index_cast %arg6 : i32 to index
    %35 = arith.muli %33, %34 : index
    %36 = arith.index_cast %arg7 : i32 to index
    %37 = arith.index_cast %arg8 : i32 to index
    %38 = arith.index_cast %32 : i32 to index
    %39 = arith.index_cast %arg9 : i32 to index
    %40 = arith.muli %38, %39 : index
    %reinterpret_cast = memref.reinterpret_cast %arg0 to offset: [%35], sizes: [128, 128], strides: [%34, %36] : memref<*xi32> to memref<128x128xi32, strided<[?, ?], offset: ?>>
    %reinterpret_cast_4 = memref.reinterpret_cast %arg1 to offset: [%40], sizes: [128, 128], strides: [%37, %39] : memref<*xi32> to memref<128x128xi32, strided<[?, ?], offset: ?>>
    %alloc_5 = memref.alloc() : memref<128x128xi32>
    %41 = arith.index_cast %arg5 : i32 to index
    %42 = arith.minsi %41, %c128 : index
    %subview = memref.subview %reinterpret_cast[0, 0] [128, %42] [1, 1] : memref<128x128xi32, strided<[?, ?], offset: ?>> to memref<128x?xi32, strided<[?, ?], offset: ?>>
    %subview_6 = memref.subview %alloc_5[0, 0] [128, %42] [1, 1] : memref<128x128xi32> to memref<128x?xi32, strided<[128, 1]>>
    %43 = arith.cmpi slt, %42, %c128 : index
    scf.if %43 {
      linalg.fill ins(%c0_i32 : i32) outs(%alloc_5 : memref<128x128xi32>)
    }
    linalg.copy {cast = #linalg.type_fn<cast_signed>} ins(%subview : memref<128x?xi32, strided<[?, ?], offset: ?>>) outs(%subview_6 : memref<128x?xi32, strided<[128, 1]>>)
    %alloc_7 = memref.alloc() : memref<128x128xi32>
    %44 = arith.index_cast %arg5 : i32 to index
    %45 = arith.minsi %44, %c128 : index
    %subview_8 = memref.subview %reinterpret_cast_4[0, 0] [%45, 128] [1, 1] : memref<128x128xi32, strided<[?, ?], offset: ?>> to memref<?x128xi32, strided<[?, ?], offset: ?>>
    %subview_9 = memref.subview %alloc_7[0, 0] [%45, 128] [1, 1] : memref<128x128xi32> to memref<?x128xi32, strided<[128, 1]>>
    %46 = arith.cmpi slt, %45, %c128 : index
    scf.if %46 {
      linalg.fill ins(%c0_i32 : i32) outs(%alloc_7 : memref<128x128xi32>)
    }
    linalg.copy {cast = #linalg.type_fn<cast_signed>} ins(%subview_8 : memref<?x128xi32, strided<[?, ?], offset: ?>>) outs(%subview_9 : memref<?x128xi32, strided<[128, 1]>>)
    %alloc_10 = memref.alloc() {alignment = 64 : i64} : memref<128x128xi32>
    %alloc_11 = memref.alloc() {alignment = 64 : i64} : memref<128x128xi32>
    memref.copy %alloc_10, %alloc_11 : memref<128x128xi32> to memref<128x128xi32>
    memref.dealloc %alloc_10 : memref<128x128xi32>
    linalg.matmul ins(%alloc_5, %alloc_7 : memref<128x128xi32>, memref<128x128xi32>) outs(%alloc_11 : memref<128x128xi32>)
    memref.dealloc %alloc_7 : memref<128x128xi32>
    memref.dealloc %alloc_5 : memref<128x128xi32>
    %alloc_12 = memref.alloc() {alignment = 64 : i64} : memref<128x128xi32>
    linalg.generic {indexing_maps = [#map, #map, #map], iterator_types = ["parallel", "parallel"]} ins(%alloc_11, %alloc : memref<128x128xi32>, memref<128x128xi32>) outs(%alloc_12 : memref<128x128xi32>) {
    ^bb0(%in: i32, %in_17: i32, %out: i32):
      %68 = arith.addi %in, %in_17 : i32
      linalg.yield %68 : i32
    }
    memref.dealloc %alloc_11 : memref<128x128xi32>
    %alloc_13 = memref.alloc() {alignment = 64 : i64} : memref<128x128xi32>
    linalg.generic {indexing_maps = [#map, #map, #map], iterator_types = ["parallel", "parallel"]} ins(%alloc, %alloc_12 : memref<128x128xi32>, memref<128x128xi32>) outs(%alloc_13 : memref<128x128xi32>) {
    ^bb0(%in: i32, %in_17: i32, %out: i32):
      %68 = arith.addi %in, %in_17 : i32
      linalg.yield %68 : i32
    }
    memref.dealloc %alloc_12 : memref<128x128xi32>
    memref.dealloc %alloc : memref<128x128xi32>
    %47 = arith.muli %29, %c128_i32 : i32
    %48 = arith.muli %30, %c128_i32 : i32
    %49 = arith.index_cast %arg10 : i32 to index
    %50 = arith.index_cast %47 : i32 to index
    %51 = arith.muli %50, %49 : index
    %52 = arith.index_cast %arg11 : i32 to index
    %53 = arith.index_cast %48 : i32 to index
    %54 = arith.muli %53, %52 : index
    %55 = arith.addi %51, %54 : index
    %reinterpret_cast_14 = memref.reinterpret_cast %arg2 to offset: [%55], sizes: [128, 128], strides: [%49, %52] : memref<*xi32> to memref<128x128xi32, strided<[?, ?], offset: ?>>
    %56 = arith.index_cast %47 : i32 to index
    %57 = arith.addi %56, %c128 : index
    %58 = arith.index_cast %arg3 : i32 to index
    %59 = arith.minsi %57, %58 : index
    %60 = arith.subi %59, %56 : index
    %61 = arith.index_cast %48 : i32 to index
    %62 = arith.addi %61, %c128 : index
    %63 = arith.index_cast %arg4 : i32 to index
    %64 = arith.minsi %62, %63 : index
    %65 = arith.subi %64, %61 : index
    %66 = arith.minsi %60, %c128 : index
    %67 = arith.minsi %65, %c128 : index
    %subview_15 = memref.subview %alloc_13[0, 0] [%66, %67] [1, 1] : memref<128x128xi32> to memref<?x?xi32, strided<[128, 1]>>
    %subview_16 = memref.subview %reinterpret_cast_14[0, 0] [%66, %67] [1, 1] : memref<128x128xi32, strided<[?, ?], offset: ?>> to memref<?x?xi32, strided<[?, ?], offset: ?>>
    %cast = memref.cast %subview_15 : memref<?x?xi32, strided<[128, 1]>> to memref<?x?xi32, strided<[?, ?], offset: ?>>
    linalg.copy {cast = #linalg.type_fn<cast_signed>} ins(%subview_15 : memref<?x?xi32, strided<[128, 1]>>) outs(%subview_16 : memref<?x?xi32, strided<[?, ?], offset: ?>>)
    memref.dealloc %alloc_13 : memref<128x128xi32>
    return
  }
  func.func @kernel(%arg0: memref<128x128xi32>, %arg1: memref<128x128xi32>, %arg2: memref<128x128xi32>, %arg3: i32, %arg4: i32, %arg5: i32, %arg6: i32, %arg7: i32, %arg8: i32, %arg9: i32, %arg10: i32, %arg11: i32, %arg12: i32, %arg13: i32, %arg14: i32) {
    %cast = memref.cast %arg0 : memref<128x128xi32> to memref<*xi32>
    %cast_0 = memref.cast %arg1 : memref<128x128xi32> to memref<*xi32>
    %cast_1 = memref.cast %arg2 : memref<128x128xi32> to memref<*xi32>
    call @matmul_kernel(%cast, %cast_0, %cast_1, %arg3, %arg4, %arg5, %arg6, %arg7, %arg8, %arg9, %arg10, %arg11, %arg12, %arg13, %arg14) : (memref<*xi32>, memref<*xi32>, memref<*xi32>, i32, i32, i32, i32, i32, i32, i32, i32, i32, i32, i32, i32) -> ()
    return
  }
}

Compilation command:

air-opt mmult.triton.air.mlir \
    -buffer-results-to-out-params \
    -air-linalg-codegen \
    -air-par-to-herd \
    -air-copy-to-dma \
    -air-dependency \
    -canonicalize -cse \
    -o mmult.air.mlir

Error message and stack trace

air-opt: /home/niansong/mlir-air/llvm/llvm/include/llvm/ADT/SmallVector.h:294: reference llvm::SmallVectorTemplateCommon<mlir::Value>::operator[](size_type) [T = mlir::Value]: Assertion `idx < size()' failed.
PLEASE submit a bug report to https://github.com/llvm/llvm-project/issues/ and include the crash backtrace.
Stack dump:
0.      Program arguments: air-opt mmult.triton.air.mlir -buffer-results-to-out-params -air-linalg-codegen -air-par-to-herd -air-copy-to-dma -air-dependency -canonicalize -cse -o mmult.air.mlir
 #0 0x000055cbc9a8c007 llvm::sys::PrintStackTrace(llvm::raw_ostream&, int) (/home/niansong/mlir-air/install/bin/air-opt+0x2854007)
 #1 0x000055cbc9a89e5e llvm::sys::RunSignalHandlers() (/home/niansong/mlir-air/install/bin/air-opt+0x2851e5e)
 #2 0x000055cbc9a8c80f SignalHandler(int) Signals.cpp:0:0
 #3 0x00007f54362a1420 __restore_rt (/lib/x86_64-linux-gnu/libpthread.so.0+0x14420)
 #4 0x00007f5435d3400b raise /build/glibc-SzIz7B/glibc-2.31/signal/../sysdeps/unix/sysv/linux/raise.c:51:1
 #5 0x00007f5435d13859 abort /build/glibc-SzIz7B/glibc-2.31/stdlib/abort.c:81:7
 #6 0x00007f5435d13729 get_sysdep_segment_value /build/glibc-SzIz7B/glibc-2.31/intl/loadmsgcat.c:509:8
 #7 0x00007f5435d13729 _nl_load_domain /build/glibc-SzIz7B/glibc-2.31/intl/loadmsgcat.c:970:34
 #8 0x00007f5435d24fd6 (/lib/x86_64-linux-gnu/libc.so.6+0x33fd6)
 #9 0x000055cbc8049ac9 llvm::SmallVectorTemplateCommon<mlir::Value, void>::operator[](unsigned long) AIRDependencyScheduleOpt.cpp:0:0
#10 0x000055cbc815728d (anonymous namespace)::AIRDependency::createPartialMemref(mlir::Value, unsigned int, llvm::SmallVector<mlir::Value, 2u>) AIRDependency.cpp:0:0
#11 0x000055cbc815778c void (anonymous namespace)::AIRDependency::traceDeps<xilinx::air::ExecuteOp>(llvm::SmallVector<(anonymous namespace)::AIRDependency::partialMemref, 1u>, xilinx::air::ExecuteOp, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>>) AIRDependency.cpp:0:0
#12 0x000055cbc81569e3 (anonymous namespace)::AIRDependency::runOnOperation()::'lambda0'(mlir::Operation*)::operator()(mlir::Operation*) const AIRDependency.cpp:0:0
#13 0x000055cbc815470d void llvm::function_ref<void (mlir::Operation*)>::callback_fn<(anonymous namespace)::AIRDependency::runOnOperation()::'lambda0'(mlir::Operation*)>(long, mlir::Operation*) AIRDependency.cpp:0:0
#14 0x000055cbc84d5dce mlir::detail::walk(mlir::Operation*, llvm::function_ref<void (mlir::Operation*)>, mlir::WalkOrder) (/home/niansong/mlir-air/install/bin/air-opt+0x129ddce)
#15 0x000055cbc81546b2 std::enable_if<llvm::is_one_of<mlir::Operation*, mlir::Operation*, mlir::Region*, mlir::Block*>::value, void>::type mlir::detail::walk<(mlir::WalkOrder)1, (anonymous namespace)::AIRDependency::runOnOperation()::'lambda0'(mlir::Operation*), mlir::Operation*, void>(mlir::Operation*, (anonymous namespace)::AIRDependency::runOnOperation()::'lambda0'(mlir::Operation*)&&) AIRDependency.cpp:0:0
#16 0x000055cbc815465d std::enable_if<llvm::function_traits<std::decay<(anonymous namespace)::AIRDependency::runOnOperation()::'lambda0'(mlir::Operation*)>::type>::num_args == 1, void>::type mlir::Operation::walk<(mlir::WalkOrder)1, (anonymous namespace)::AIRDependency::runOnOperation()::'lambda0'(mlir::Operation*), void>((anonymous namespace)::AIRDependency::runOnOperation()::'lambda0'(mlir::Operation*)&&) AIRDependency.cpp:0:0
#17 0x000055cbc8148ba0 std::enable_if<llvm::function_traits<std::decay<(anonymous namespace)::AIRDependency::runOnOperation()::'lambda0'(mlir::Operation*)>::type>::num_args == 1, void>::type mlir::OpState::walk<(mlir::WalkOrder)1, (anonymous namespace)::AIRDependency::runOnOperation()::'lambda0'(mlir::Operation*), void>((anonymous namespace)::AIRDependency::runOnOperation()::'lambda0'(mlir::Operation*)&&) AIRDependency.cpp:0:0
#18 0x000055cbc81472de (anonymous namespace)::AIRDependency::runOnOperation() AIRDependency.cpp:0:0
#19 0x000055cbc8388c9f mlir::detail::OpToOpPassAdaptor::run(mlir::Pass*, mlir::Operation*, mlir::AnalysisManager, bool, unsigned int) (/home/niansong/mlir-air/install/bin/air-opt+0x1150c9f)
#20 0x000055cbc83892c9 mlir::detail::OpToOpPassAdaptor::runPipeline(mlir::OpPassManager&, mlir::Operation*, mlir::AnalysisManager, bool, unsigned int, mlir::PassInstrumentor*, mlir::PassInstrumentation::PipelineParentInfo const*) (/home/niansong/mlir-air/install/bin/air-opt+0x11512c9)
#21 0x000055cbc838b446 mlir::PassManager::run(mlir::Operation*) (/home/niansong/mlir-air/install/bin/air-opt+0x1153446)
#22 0x000055cbc8385b86 performActions(llvm::raw_ostream&, bool, bool, std::shared_ptr<llvm::SourceMgr> const&, mlir::MLIRContext*, llvm::function_ref<mlir::LogicalResult (mlir::PassManager&)>, bool, bool) MlirOptMain.cpp:0:0
#23 0x000055cbc838585d mlir::LogicalResult llvm::function_ref<mlir::LogicalResult (std::unique_ptr<llvm::MemoryBuffer, std::default_delete<llvm::MemoryBuffer>>, llvm::raw_ostream&)>::callback_fn<mlir::MlirOptMain(llvm::raw_ostream&, std::unique_ptr<llvm::MemoryBuffer, std::default_delete<llvm::MemoryBuffer>>, llvm::function_ref<mlir::LogicalResult (mlir::PassManager&)>, mlir::DialectRegistry&, bool, bool, bool, bool, bool, bool, bool)::$_0>(long, std::unique_ptr<llvm::MemoryBuffer, std::default_delete<llvm::MemoryBuffer>>, llvm::raw_ostream&) MlirOptMain.cpp:0:0
#24 0x000055cbc840e4c8 mlir::splitAndProcessBuffer(std::unique_ptr<llvm::MemoryBuffer, std::default_delete<llvm::MemoryBuffer>>, llvm::function_ref<mlir::LogicalResult (std::unique_ptr<llvm::MemoryBuffer, std::default_delete<llvm::MemoryBuffer>>, llvm::raw_ostream&)>, llvm::raw_ostream&, bool, bool) (/home/niansong/mlir-air/install/bin/air-opt+0x11d64c8)
#25 0x000055cbc8383dfe mlir::MlirOptMain(llvm::raw_ostream&, std::unique_ptr<llvm::MemoryBuffer, std::default_delete<llvm::MemoryBuffer>>, llvm::function_ref<mlir::LogicalResult (mlir::PassManager&)>, mlir::DialectRegistry&, bool, bool, bool, bool, bool, bool, bool) (/home/niansong/mlir-air/install/bin/air-opt+0x114bdfe)
#26 0x000055cbc838429f mlir::MlirOptMain(int, char**, llvm::StringRef, mlir::DialectRegistry&, bool) (/home/niansong/mlir-air/install/bin/air-opt+0x114c29f)
#27 0x000055cbc7fc1a3a main (/home/niansong/mlir-air/install/bin/air-opt+0xd89a3a)
#28 0x00007f5435d15083 __libc_start_main /build/glibc-SzIz7B/glibc-2.31/csu/../csu/libc-start.c:342:3
#29 0x000055cbc7fc173e _start (/home/niansong/mlir-air/install/bin/air-opt+0xd8973e)
./compile.sh: line 9: 643162 Aborted                 air-opt mmult.triton.air.mlir -buffer-results-to-out-params -air-linalg-codegen -air-par-to-herd -air-copy-to-dma -air-dependency -canonicalize -cse -o mmult.air.mlir

Launch MLIR code on VCK5000?

On the MLIR-AIE side, everything seems to be clear as long as it is to be run on VCK190. I understood that to make it on VCK5000, MLIR-AIR is needed. I installed it as detailed here with the most recent ROCm-runtime.

Following aircc doc, there seems not to be in any way a clear path to get an MLIR code up and running on VCK5000. I tried to play with the tests and a few examples, such as SPARTA but I am not seeing that going anywhere.

Perhaps there is an indication or a tutorial like the one for MLIR-AIE here where from an MLIR code written in AIE/R dialects, I can get it running on VCK5000?

Need to replace DimTuple

I'm trying to update the iree-amd-aie project with the latest version of mlir-aie. That seems to have removed or renamed the DimTuple attr which used to be defined here https://github.com/Xilinx/mlir-aie/blob/c0341aa3d525827a21f25b7423b18f4359a34cf3/include/aie/Dialect/AIE/IR/AIEAttrs.td but is still used in this project here:

std::vector<AIE::DimTupleAttr> dims =

std::vector<AIE::DimTupleAttr>

Toolchain file seems to need to be passed in twice in cmake?

With docker image containers.xilinx.com/acdc/build:2.0 the current util build script, targetting x86, fails with

CMake Error at utils/llvm/build/lib/cmake/llvm/HandleLLVMOptions.cmake:320 (message):
  Host compiler does not support '-fuse-ld=lld'
Call Stack (most recent call first):
  CMakeLists.txt:73 (include)

It seems that this issue can be solved by passing in the toolchain file twice into the cmake:

    -DCMAKE_TOOLCHAIN_FILE=`pwd`/../cmake/modules/toolchain_x86_64.cmake \
    -Dx86_64_TOOLCHAIN_FILE=`pwd`/../cmake/modules/toolchain_x86_64.cmake \

Python Multiple Segment Examples

I've written a couple of multi-segment examples as part of the programmming examples generally (and specifically for channels #648) in this PR #663.

Right now, all 3 examples using 2 segments fail during compilation with a segfault. I have not looked further into the issue but I hope to do some debugging myself next week.

Edit: I'm putting work on this on hold for a bit, if anyone wants to pick this up.

Worker-to-Self Channel Example

I am working on writing a worker-to-worker data transfer example for channels (as part of the grouping of examples that exercise various features of channels, #648).

Draft PR is here: #653

I am basing it off the code in the channel_size example (PR waiting to be merged here: #642)

The channel_size example works well for me. As an intermediate step to adding worker-to-worker communication to that example, I tried to have each worker send data to itself over a channel. That is the version of the code that is pushed in the draft PR #653. The particular file of interest is this one.

When I run with this intermediate step, I get the following error:

Using aiecc.py from:  /scratch/ehunhoff/mlir-air/mlir-aie/install/bin/..
Running: builtin.module(air-insert-launch-and-segment-around-herd,func.func(air-lower-herd-parallel),air-dma-to-channel,canonicalize,cse,air-specialize-channel-wrap-and-stride,func.func(air-renumber-dma),func.func(convert-linalg-to-loops),air-place-herds{num-rows=6 num-cols=4 row-anchor=2 col-anchor=0})
Running: builtin.module(air-to-aie{emit-while-loop=false row-offset=2 col-offset=0 device=npu1_4col})
python3: /scratch/ehunhoff/mlir-air/mlir/lib/Conversion/AIRToAIESchedulingUtils.cpp:956: void xilinx::air::simpleDMAChannelAllocation(std::vector<MemcpyBundleAsFlow> &, ShimDMAAllocator &, MemTileDMAAllocator &, TileDMAAllocator &): Assertion `core' failed.
Aborted (core dumped)
make: *** [Makefile:7: run] Error 134

My question is:

  • Is a worker allowed to put/get data to/from a channel to itself?
  • Or is this a bug (either in my example code or the air compiler)?

AllocOps and load/store ops in Launch and Segment

As part of the channel examples, and an interesting discussion on allocation, I wanted to see if I could explicitly allocate L3 memory in a launch and L2 memory in a segment. To this end, I wrote the programming_examples/channel_examples/hierarchical_alloc example in this PR: #661

Currently, it fails to fully compile with an error like this:

Traceback (most recent call last):
  File "mlir-air/programming_examples/channel_examples/hierarchical_alloc/run.py", line 86, in <module>
    test_main(build_module, verbose=args.verbose)
  File "mlir-air/programming_examples/channel_examples/hierarchical_alloc/run.py", line 45, in test_main
    addone = backend.compile_and_load(mlir_module)
  File "mlir-air/install-xrt/python/air/backend/xrt.py", line 222, in compile_and_load
    c = self.compile(module)
  File "mlir-air/install-xrt/python/air/backend/xrt.py", line 117, in compile
    aircc.run(air_module, aircc_options)
  File "mlir-air/install-xrt/python/air/compiler/aircc/main.py", line 449, in run
    run_passes(air_to_npu_passes, air_to_npu_module, opts, air_to_npu_file)
  File "mlir-air/install-xrt/python/air/compiler/aircc/main.py", line 113, in run_passes
    PassManager.parse(pass_pipeline).run(mlir_module.operation)
air._mlir_libs._site_initialize.<locals>.MLIRError: Failure while executing pass pipeline:
error: "-":145:20: failed to legalize operation 'airrt.alloc' marked as erased
 note: "-":145:20: see current operation: %16 = "airrt.alloc"() : () -> memref<1x4xi32, 1 : i32>
 note: "-":161:16: found live user of result #0: %5 = memref.load %2[%c0, %c0] : memref<1x4xi32, 1 : i32>
make: *** [Makefile:9: run] Error 1

I'm not fully confident this example is a reasonable thing to ask the air compiler to handle, but I think it might be. If it is not, let me know, and I will either change or erase the example!

Illegal Allocation Catch in Verifier

Recently, when I tried to do an l2 allocation from within a herd, I got a segfault. This makes sense as I believe only the following allocations are legal:

  • L1 in herd
  • L2 in segment
  • L3 in launch??

It would be more user-friendly if any illegal allocations outside of the above were caught in a verifier.

Running Matrix Scalar Add Examples with `aircc --experimental-passes`

This is low priority, but ideally I would like to run the Matrix Scalar Add examples with the experimental_passes aircc.py option set. However, the experimental passes break both of the currently working examples, single_core_dma and single_core_channel. For single_core_dma, the output is wrong. For single_core_channel, there is a segfault.

To replicate, set experimental_passes=True in this file (on the minimal-matrix-scalar-add branch).

For single_core_dma:

cd programming_examples/matrix_scalar_add/single_core_dma
make clean
make

For single_core_channel:

cd programming_examples/matrix_scalar_add/single_core_dma
make clean
make

I investigated a bit which passes might be causing the problems, and if I just comment out the first two of the experimental passes (defined here), both examples still work with the remaining passes:

    #"air-dependency",
    #"air-dependency-schedule-opt",

Single Core DMA/Channel Matrix Scalar Add Examples Broken

I'm working on my draft PR #621

I rebased my branch to master after #620 was merged in. However, after that rebase, the two examples that previously worked (single_core_dma and single_core_channel) no longer work.

You can replicate the working versions in branch debugging-matrix-scalar-add (which diverges from main at bcbfed5c instead of HEAD=45592176) with:

cd programming_examples/matrix_scalar_add/single_core_dma
make

and

cd programming_examples/matrix_scalar_add/single_core_channel
make

You can replicate the failing tests in branch minimal-matrix-scalar-add with the same commands.

For the single_core_dma example, the files aie.air.mlir and placed.air.mlir are identical between the passing/failing cases. The file npu.air.mlir has the following diff:

$ diff broken_single_core_dma_build/air_project/npu.air.mlir working_single_core_dma_build/air_project/npu.air.mlir 
63,64c63,64
<       aiex.npu.dma_memcpy_nd(0, 0, %arg0[0, 0, 0, 512][1, 1, 8, 16][0, 0, 32]) {id = 2 : i64, metadata = @airMemcpyId3} : memref<32x16xi32>
<       aiex.npu.dma_memcpy_nd(0, 0, %arg0[0, 0, 0, 528][1, 1, 8, 16][0, 0, 32]) {id = 3 : i64, metadata = @airMemcpyId3} : memref<32x16xi32>
---
>       aiex.npu.dma_memcpy_nd(0, 0, %arg0[0, 0, 16, 0][1, 1, 8, 16][0, 0, 32]) {id = 2 : i64, metadata = @airMemcpyId3} : memref<32x16xi32>
>       aiex.npu.dma_memcpy_nd(0, 0, %arg0[0, 0, 16, 16][1, 1, 8, 16][0, 0, 32]) {id = 3 : i64, metadata = @airMemcpyId3} : memref<32x16xi32>
67,68c67,68
<       aiex.npu.dma_memcpy_nd(0, 0, %arg1[0, 0, 0, 512][1, 1, 8, 16][0, 0, 32]) {id = 6 : i64, metadata = @airMemcpyId4} : memref<32x16xi32>
<       aiex.npu.dma_memcpy_nd(0, 0, %arg1[0, 0, 0, 528][1, 1, 8, 16][0, 0, 32]) {id = 7 : i64, metadata = @airMemcpyId4} : memref<32x16xi32>
---
>       aiex.npu.dma_memcpy_nd(0, 0, %arg1[0, 0, 16, 0][1, 1, 8, 16][0, 0, 32]) {id = 6 : i64, metadata = @airMemcpyId4} : memref<32x16xi32>
>       aiex.npu.dma_memcpy_nd(0, 0, %arg1[0, 0, 16, 16][1, 1, 8, 16][0, 0, 32]) {id = 7 : i64, metadata = @airMemcpyId4} : memref<32x16xi32>

The diff for the single_core_channel example is essentially the same.

Let me know if more information is needed!

Issue while building, -fno-rtti is passed although boost requires rtti

Hi, I have this weird issue while building the repo following the instructions in this page:

FAILED: mlir/lib/CAPI/CMakeFiles/obj.AIRCAPI.dir/Runner.cpp.o 
/net/media/scratch/fournier/llvm-install/llvm-15/bin/clang++ -DGTEST_HAS_RTTI=0 -D_DEBUG -D__STDC_CONSTANT_MACROS -D__STDC_FORMAT_MACROS -D__STDC_LIMIT_MACROS -I/net/media/scratch/fournier/llvm-for-mlir-aie/llvm/include -I/net/media/scratch/fournier/llvm-for-mlir-aie/build-Debug/include -I/net/media/scratch/fournier/llvm-for-mlir-aie/mlir/include -I/net/media/scratch/fournier/llvm-for-mlir-aie/build-Debug/tools/mlir/include -I/net/media/scratch/fournier/mlir-aie/include -I/net/media/scratch/fournier/mlir-aie/build/include -I/net/media/scratch/fournier/mlir-air/mlir/include -I/net/media/scratch/fournier/mlir-air/build/mlir/include -fPIC -fno-semantic-interposition -fvisibility-inlines-hidden -Werror=date-time -Werror=unguarded-availability-new -Wall -Wextra -Wno-unused-parameter -Wwrite-strings -Wcast-qual -Wmissing-field-initializers -Wimplicit-fallthrough -Wcovered-switch-default -Wno-noexcept-type -Wnon-virtual-dtor -Wdelete-non-virtual-dtor -Wsuggest-override -Wstring-conversion -Wmisleading-indentation -Wctad-maybe-unsupported -fdiagnostics-color -ffunction-sections -fdata-sections -std=gnu++17   -D_DEBUG -D__STDC_CONSTANT_MACROS -D__STDC_FORMAT_MACROS -D__STDC_LIMIT_MACROS  -fno-exceptions -fno-rtti -UNDEBUG -MD -MT mlir/lib/CAPI/CMakeFiles/obj.AIRCAPI.dir/Runner.cpp.o -MF mlir/lib/CAPI/CMakeFiles/obj.AIRCAPI.dir/Runner.cpp.o.d -o mlir/lib/CAPI/CMakeFiles/obj.AIRCAPI.dir/Runner.cpp.o -c /net/media/scratch/fournier/mlir-air/mlir/lib/CAPI/Runner.cpp
In file included from /net/media/scratch/fournier/mlir-air/mlir/lib/CAPI/Runner.cpp:11:
In file included from /net/media/scratch/fournier/mlir-air/mlir/include/air/Util/Runner.h:12:
In file included from /net/media/scratch/fournier/mlir-air/mlir/include/air/Util/Dependency.h:34:
In file included from /usr/include/boost/graph/graphviz.hpp:25:
/usr/include/boost/property_map/dynamic_property_map.hpp:150:28: error: use of typeid requires -frtti
    if (in_value.type() == typeid(value_type)) {
                           ^
/usr/include/boost/property_map/dynamic_property_map.hpp:191:56: error: use of typeid requires -frtti
  virtual const std::type_info& key()   const { return typeid(key_type); }
                                                       ^
/usr/include/boost/property_map/dynamic_property_map.hpp:192:56: error: use of typeid requires -frtti
  virtual const std::type_info& value() const { return typeid(value_type); }
                                                       ^
/usr/include/boost/property_map/dynamic_property_map.hpp:286:29: error: use of typeid requires -frtti
    if (i->second->key() == typeid(key)) {
                            ^
/usr/include/boost/property_map/dynamic_property_map.hpp:308:29: error: use of typeid requires -frtti
    if (i->second->key() == typeid(key))
                            ^
/usr/include/boost/property_map/dynamic_property_map.hpp:321:29: error: use of typeid requires -frtti
    if (i->second->key() == typeid(key))
                            ^
/usr/include/boost/property_map/dynamic_property_map.hpp:334:29: error: use of typeid requires -frtti
    if (i->second->key() == typeid(key))

As you can see boost requires RTTI, but the command line of the compiler contains -fno-rtti. Do you have an idea, what could be causing this? When I grep for rtti in the repo I only find hits in the cmake files that are in sandbox and they don't appear related... Thanks for any help.

Unrecognized architecture 'aie'

I have setup all the tools for the first time and can get pretty far. However, when I try to compile any of the examples it complains about unrecognized architecture 'aie'.

What did I miss in my setup or is this a new bug?

I just cloned this repo earlier today, I have built up to commit 8ea962a. Using Ubuntu 20.04LTS & v2021.2 tools. I have it set to use the sysroot from the remnants of the pynq pre-build.

This is the beefmaker example project make results:

clang ../../../install/runtime_lib/test_library.cpp --target=aarch64-linux-gnu --sysroot=../../../platforms/xilinx_vck190_air/petalinux/sysroot/sysroots/cortexa72-cortexa53-xilinx-linux/ -g  -I../../../platforms/xilinx_vck190_air/petalinux/sysroot/sysroots/cortexa72-cortexa53-xilinx-linux//opt/xaienginev2/include -std=c++17 -I/sandbox-hdd/mlir-air/install/bin//../runtime_lib/airhost/include -I../../../install/runtime_lib -DAIR_LIBXAIE_ENABLE -DLIBXAIENGINEV2 -c -o test_library.o
xchesscc -p me -P /tools/Xilinx/Vitis/2021.2/aietools/data/cervino/lib -c chess/beefmaker_kernel.cc
aircc.py -o beefmaker.air.a --host-target=aarch64-linux-gnu -xbridge --sysroot=../../../platforms/xilinx_vck190_air/petalinux/sysroot/sysroots/cortexa72-cortexa53-xilinx-linux/ air.mlir
Compiling partitions: ['partition_0']
Found Vitis at /enc/tools/Xilinx/Vitis/2021.2
 MLIR compilation: ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━   0% -:--:-- 0:00:00 0/1 1 Workeropt: unrecognized architecture 'aie' provided.
Error encountered while running: opt --opaque-pointers=0 --passes=default<O2> -inline-threshold=10 -S air_project/partition_0/input.ll -o air_project/partition_0/input.opt.ll
 Error ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━   0% -:--:-- 0:00:00 0/1 1 Worker
Error encountered while running: aiecc.py --sysroot ../../../platforms/xilinx_vck190_air/petalinux/sysroot/sysroots/cortexa72-cortexa53-xilinx-linux/ --host-target aarch64-linux-gnu --tmpdir air_project/partition_0 --aie-generate-xaiev2 --xbridge --no-xchesscc air_project/aiecc.partition_0.mlir
make: *** [Makefile:18: beefmaker.air.a] Error 1

Form Herds from Multiple scf.forall nests.

In order to support code generated from loop-peeling, we'd like to look at forming herds from multiple scf.forall statements ( rather than targeting just one), so that L1 allocations can stay local within a herd definition.

Channel Examples

Channels are a key abstraction of mlir-air, but there are few examples for how to use them. This issue is a place to discuss which examples are needed to show how channels work, and which of those examples are implemented.

Examples to Create

  • Worker2Worker (core to core within a herd) using L1 -> L1 data
  • Herd2Herd (core in herd0 to core in herd1 using L1 data) #636
  • Segment2Segment (#665)
    • Create basic example with two segments (something evens simpler that doesn't use channels, because this is largely untested) (#663)
    • L1-L1 (worker to worker, where the workers are in herds in the different segments) (#663)
    • L2-L2 (from segment to segment directly)
  • Channel bundling (using sizes/indices) #642
  • Broadcast
    • Broadcast to multiple workers in same herd
    • Broadcast to multiple herds
  • Hierarchical (launch -> segment -> herd) (#661)

Discussion Topics

Is Launch2Launch Desirable?

  • Some trouble creating multiple launches, so this may be a little early in terms of creating a launch2launch channel communication (see issue #627)
  • What information would you need to scheduling launches in order to do this?

Synchronous vs Asynchronous

  • Might one day be good to have some examples where the user explicitly sets async tokens on channel operations, but this capability isn't implemented yet in the air dependency pass.

Placement

  • An example doing something specific with channels based on placement of resources?

Build instructions clarification in docs

The instructions in https://xilinx.github.io/mlir-air/building.html appear to have a typo:

git clone https://github.com/stephenneuendorffer/aie-rt
cd aie-rt
git checkout phoenix_v2023.2
cd driver/src
make -f Makefile.Linux CFLAGS="-D__AIEAMDAIR__"
sudo cp -r ../include /opt/aiengine/
sudo cp libxaiengine.so* /opt/xaiengine/lib/
export LD_LIBRARY_PATH=/opt/xaiengine/lib:${LD_LIBRARY_PATH}

opt/aiengine vs opt/xaiengine ?

error: undefined reference due to --no-allow-shlib-undefined: AmdairBackend

Hi,
I've encountered the error stated on the title of the issue when I try to compile the test 13_mb_add_one. If I check the symbol table of libxaiengine.xo I get that AmdAirBackend is defined but AmdairBackend is undefined:

image

I built the libxaiengine library from https://github.com/stephenneuendorffer/aie-rt, branch phoenix_v2023.2, following the instructions in the mlir-air documentation: https://xilinx.github.io/mlir-air/building.html.

Edit: found the solution, there was a problem with the definition of a variable in the source code because of a name mismatch. Will post the fix as a pull request in the relevant repository soon.

CMakefile hard codes reference to clang/clang++ 12

Hello everyone!

In cmake/modules/toolchain_x86.cmake there is an hardcoded reference to LLVM 12

# specify the compiler
set(CLANG_VER 12)
set(CMAKE_C_COMPILER clang-${CLANG_VER})
set(CMAKE_CXX_COMPILER clang++-${CLANG_VER})
set(CMAKE_ASM_COMPILER clang-${CLANG_VER})
set(CMAKE_STRIP llvm-strip-${CLANG_VER})
set(CLANG_LLD lld-${CLANG_VER} CACHE STRING "" FORCE)

However, the LLVM version downloaded might be different (LLVM 17 as I write) and the builder cannot find the compilers. One option is to update the clang version variable, however clang++-17 doesn't exist (only clang++ and clang-17) so that doesn't work either. I resolved temporarily by removing any reference to the CLANG version in the above variables. Do we need to check for specific version of LLVM? can we use llvm-config of the LLVM we built?

Thanks,

Roberto

Update ODS to add 'name' member to 'xilinx::airrt::EventType' and other types

It seems that some ODS information needs to be update because of recent LLVM changes. See also this related issue.

I have tried to compile the nod.ai shark runtime with the iree-amd-aie plugin enabled. The latter uses this project. I got the following error:

In file included from /home/fharwath/wd/shark/SRT/third_party/llvm-project/mlir/include/mlir/IR/Types.h:12:
/home/fharwath/wd/shark/SRT/third_party/llvm-project/mlir/include/mlir/IR/TypeSupport.h:54:28: error: no member named 'name' in 'xilinx::airrt::EventType'
   54 |                         T::name);
      |                         ~~~^

This is caused by LLVM commit 3dbac2c007c1 [mlir] Expose type and attribute names in the MLIRContext and abstract type/attr classes
.

Assertion failure running air-to-aie pass

air-to-aie on a transform test, run from mlir-air/mlir/test directory:

 $ air-opt --air-to-aie  Transform/AIRDependency/matmul_parallel.mlir

Maybe this is not a sensible pass to run on this input, but I'm reporting just in case you don't think users should be reaching assertion failures like this. Here's the dump:

/usr/bin/../lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/stl_vector.h:1045: std::vector::reference std::vector<int>::operator[](std::vector::size_type) [_Tp = int, _Alloc = std::allocator<int>]: Assertion '__n < this->size()' failed.
PLEASE submit a bug report to https://github.com/llvm/llvm-project/issues/ and include the crash backtrace.
Stack dump:
0.      Program arguments: ../../build/bin/air-opt --air-to-aie Transform/AIRDependency/matmul_parallel.mlir
Stack dump without symbol names (ensure you have llvm-symbolizer in your PATH or set the environment var `LLVM_SYMBOLIZER_PATH` to point to it):
0  air-opt   0x00005563621ba8f7

`XRTBackend` Implementation of `AirBackend` is Confusing

I was trying to use the compile() and load() methods of XRTBackend when I was doing some debugging recently. I realized the load() method takes a module: air.ir.Module as an argument which is then never used.

This is confusing.

The abstract base class (AirBackend) is flexible, because we can specify a unique CompiledArtifact for XRTBackend. I think this confusion would be fixed if the CompiledArgument for the XRTBackend were something like (in pseudo code):

class XRTArtifact:
  xclbin: File(),
  insts: File(),

e.g., the compiled artifacts are a pair of files (or file paths) pointing to the xclbin and instruction file.

I'm happy to make a PR for this, if others think this is a reasonable change. If there is some history behind the current format that needs to be taken into account, I'm happy to hear it!

Error configuring elfutils

The AIRBIN script requires elfutils. When attempting to build it on one of the machines, we get the following error from configure regarding zstd:

./configure: line 7060: syntax error near unexpected token `ZSTD_COMPRESS,libzstd'
./configure: line 7060: `      PKG_CHECK_MODULES(ZSTD_COMPRESS,libzstd >= 1.4.0,'

The workaround was to copy a configuration file from a working machine. Want to note this so we look into it later and remember the temporary fix.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.