plaidml / tpp-mlir Goto Github PK

View Code? Open in Web Editor NEW

102.0 102.0 24.0 57.52 MB

TPP experimentation on MLIR for linear algebra

Home Page: https://arxiv.org/abs/2404.15204

License: Other

CMake 0.15% C++ 3.61% MLIR 95.85% Python 0.20% C 0.03% Shell 0.16% Cuda 0.01%

compiler library llvm machine-learning micro-kernel

tpp-mlir's Introduction

A platform for making deep learning work everywhere.

To Our Users

First off, we’d like to thank you for choosing PlaidML. Whether you’re a new user or a multi-year veteran, we greatly appreciate you for the time you’ve spent tinkering around with our source code, sending us feedback, and improving our codebase. PlaidML would truly not be the same without you.

The feedback we have received from our users indicates an ever-increasing need for performance, programmability, and portability. During the past few months, we have been restructuring PlaidML to address those needs. Below is a summary of the biggest changes:

We’ve adopted MLIR, an extensible compiler infrastructure that has gained industry-wide adoption since its release in early 2019. MLIR makes it easier to integrate new software and hardware into our compiler stack, as well as making it easier to write optimizations for our compiler.
We’ve worked extensively on Stripe, our low-level intermediate representation within PlaidML. Stripe contains optimizations that greatly improve the performance of our compiler. While our work on Stripe began before we decided to use MLIR, we are in the process of fully integrating Stripe into MLIR.
We created our C++/Python embedded domain-specific language (EDSL) to improve the programmability of PlaidML.

Today, we’re announcing a new branch of PlaidML — plaidml-v1. This will act as our development branch going forward and will allow us to more rapidly prototype the changes we’re making without breaking our existing user base. As a precaution, please note that certain features, tests, and hardware targets may be broken in plaidml-v1 as is a research project. Right now plaidml-v1 only supports Intel and AMD CPUs with AVX2 and AVX512 support.

You can continue to use code on the master branch or from our releases on PyPI. For your convenience, the contents of our master branch will be released as version 0.7.0. There is no further development in this branch. plaidml-v1 is a research project.

PlaidML is an advanced and portable tensor compiler for enabling deep learning on laptops, embedded devices, or other devices where the available computing hardware is not well supported or the available software stack contains unpalatable license restrictions.

PlaidML sits underneath common machine learning frameworks, enabling users to access any hardware supported by PlaidML. PlaidML supports Keras, ONNX, and nGraph.

As a component within the nGraph Compiler stack, PlaidML further extends the capabilities of specialized deep-learning hardware (especially GPUs,) and makes it both easier and faster to access or make use of subgraph-level optimizations that would otherwise be bounded by the compute limitations of the device.

As a component under Keras, PlaidML can accelerate training workloads with customized or automatically-generated Tile code. It works especially well on GPUs, and it doesn't require use of CUDA/cuDNN on Nvidia hardware, while achieving comparable performance.

PlaidML works on all major operating systems: Linux, macOS, and Windows.

Building PlaidML from source

Due to use of conda PlaidML runs on all major Linux distributions.

export PLAIDML_WORKSPACE_DIR=[choose a directory of your choice]

# setting up miniconda env
cd ${PLAIDML_WORKSPACE_DIR}
wget https://repo.anaconda.com/miniconda/Miniconda3-py37_4.12.0-Linux-x86_64.sh
bash Miniconda3-py37_4.12.0-Linux-x86_64.sh -p ${PLAIDML_WORKSPACE_DIR}/miniconda3
eval "$(${PLAIDML_WORKSPACE_DIR}/miniconda3/bin/conda shell.bash hook)"
conda activate

# clone plaidml-v1 and set up env
git clone https://github.com/plaidml/plaidml.git --recursive -b plaidml-v1
cd plaidml
conda env create -f environment.yml -p .cenv/
conda activate .cenv/

# we might need to go into .cenv/bin and create a sym-link 
cd .cenv/bin/
ln -s ninja ninja-build
cd ../../

# preparing PlaidML build
./configure

# buidling PlaidML
cd build-x86_64/Release
ninja && PYTHONPATH=$PWD python plaidml/plaidml_setup.py

Demos and Related Projects

Plaidbench

Plaidbench is a performance testing suite designed to help users compare the performance of different cards and different frameworks.

cd build-x86_64/Release
ninja plaidbench_py && PYTHONPATH=$PWD KMP_AFFINITY=granularity=fine,verbose,compact,1,0 OMP_NUM_THREADS=8 python plaidbench/plaidbench.py -n128 keras resnet50

The command above is suited for 8-core Intel/AMD CPUs with hyper-threading enabled. E.g. on an Intel i9-11900K we expect around 8.5ms latency.

Reporting Issues

Either open a ticket on GitHub.

CI & Validation

Validated Hardware

A comprehensive set of tests for each release are run against the hardware targets listed below.

AMD CPUs with AVX2 and AVX512
Intel CPUs with AVX2 and AVX512

Validated Networks

We support all of the Keras application networks from current versions of 2.x. Validated networks are tested for performance and correctness as part of our continuous integration system.

CNNs
- Inception v3
- ResNet50
- VGG19
- VGG16
- Xception
- DenseNet

tpp-mlir's People

Contributors

Stargazers

Watchers

tpp-mlir's Issues

Add a resnet-like test/benchmark

We need to break Resnet down into blocks and add those blocks as individual tests.

Later layers of a Resnet 50 have the "bottleneck", which replaces two 3x3 convolutions with { 1x1, 3x3, 1x1 }. This allows us to test BRGEMMs for the 1x1 and GEMMs for the 3x3. We can also use the first convolution to test a 7x7 filter.

The tasks in this issue are:

Get the first bottleneck into a self-contained MLIR file in test/Benchmarks
1. Add tpp-opt checks for the appropriate TPP operations
Create fake dense weights as global memrefs and leave only the inputs as parameters
1. Add tpp-run checks for output
Add the MLIR file as a benchmark in benchmarks

Repeat for the 7x7 layer. We may want to also add the fully-connected layer at the end

Once benchmarks are working with XSMM calls, implement an XSMM function that reproduces a bottleneck for comparison with the MLIR file.

Move our local `Transform` dialect out of `Linalgx` into `Dialects`

We'll always have transforms that don't belong upstream, but we'll soon remove the Linalgx dialect (once #118 is done), so we'll always keep it in our repo. Having a proper dialect directory for that makes sense.

If we have generic transforms that could be accepted upstream, we can do that in separate from upstreaming Linalgx, so these two dialects shouldn't be tied in the same directory.

Rename Standalone folder to Tpp?

Move the pack operation in the critical path in BF16 MLP example

Currently, pack operation is inside the outer scf loop of the MLP example, which is the critical path of execution. The pack operation needs to be executed once on the bigger tensor and subviews must be used to read smaller tensor slices inside the scf for loop.

Remove the usage of `sparse-compiler` in tests

The following tests are using the option -sparse-compiler for tpp-opt. Identify why we're using it and remove it no longer relevant.

Here are the tests using it:

test/BF16/matmul-pbf16.mlir
test/BF16/mlp-all-bf16.mlir
test/BF16/mlp-single-layer-blocked-bf16.mlir
test/BF16/mlp-single-layer-bf16.mlir
test/Integration/relayout-gemm.mlir
test/Integration/conv-on-loops.mlir
test/Integration/conv-on-tensor.mlir
test/Integration/subview-on-tensor.mlir
test/Integration/relayout-more-interesting.mlir
test/Integration/conv-mapping-to-gemm.mlir

Weakly typed libxsmm runtime implementation

Libxsmm runtime utils are currently strongly typed i.e., there's an invoke and dispatch call defined per datatype (f32/bf16). This needs to be modified such that the runtime is weakly typed with a datatype argument per call.

Check if we can use `collapseGenericOpIterationDims`

Return an handle for `map_linalg_to_tpp`

This allows to compose better:

// collect all the generics
%0 = transform.structured.match ops{["linalg.generic"]} in %arg1
// annotate generics
 %1 = transform.structured.map_linalg_to_tpp in %0
// grep only the relus
%2 = transform.structured.match ops{["linalg.generic"]} attributes{library_call = "tpp.relu"} in %1
%3 = do something with the relus

Crash when building vector print in `tpp-run`

As reported in #45:

********************
Testing:  0.. 10.. 20.. 30.. 40.. 50.. 60..
FAIL: TPP_OPT :: Benchmarks/simple-gemm.mlir (69 of 71)
******************** TEST 'TPP_OPT :: Benchmarks/simple-gemm.mlir' FAILED ********************
Script:
--
: 'RUN: at line 1';   /home/asiemien/tpp-sandbox/build/bin/tpp-opt /home/asiemien/tpp-sandbox/test/Benchmarks/simple-gemm.mlir -map-linalg-to-tpp  -one-shot-bufferize="bufferize-function-boundaries allow-return-allocs function-boundary-type-conversion=identity-layout-map"  -drop-equivalent-buffer-results -finalizing-bufferize -canonicalize  -convert-linalg-to-tpp -convert-tpp-to-xsmm  -convert-xsmm-to-func |  /home/asiemien/tpp-sandbox/build/bin/tpp-run   -e entry -entry-point-result=void   -shared-libs=/home/asiemien/llvm-project/build/./lib/libmlir_c_runner_utils.so,/home/asiemien/tpp-sandbox/build/lib//libtpp_c_runner_utils.so |  /home/asiemien/llvm-project/build/bin/FileCheck /home/asiemien/tpp-sandbox/test/Benchmarks/simple-gemm.mlir
: 'RUN: at line 12';   /home/asiemien/tpp-sandbox/build/bin/tpp-opt /home/asiemien/tpp-sandbox/test/Benchmarks/simple-gemm.mlir -map-linalg-to-tpp  -one-shot-bufferize="bufferize-function-boundaries allow-return-allocs function-boundary-type-conversion=identity-layout-map"  -drop-equivalent-buffer-results -finalizing-bufferize -canonicalize  -convert-linalg-to-tpp | /home/asiemien/llvm-project/build/bin/FileCheck -check-prefix=TPP /home/asiemien/tpp-sandbox/test/Benchmarks/simple-gemm.mlir
--
Exit Code: 2

Command Output (stderr):
--
PLEASE submit a bug report to https://github.com/llvm/llvm-project/issues/ and include the crash backtrace.
Stack dump:
0.	Program arguments: /home/asiemien/tpp-sandbox/build/bin/tpp-run -e entry -entry-point-result=void -shared-libs=/home/asiemien/llvm-project/build/./lib/libmlir_c_runner_utils.so,/home/asiemien/tpp-sandbox/build/lib//libtpp_c_runner_utils.so
 #0 0x0000559cc3da24f3 llvm::sys::PrintStackTrace(llvm::raw_ostream&, int) /home/asiemien/llvm-project/llvm/lib/Support/Unix/Signals.inc:569:13
 #1 0x0000559cc3da0790 llvm::sys::RunSignalHandlers() /home/asiemien/llvm-project/llvm/lib/Support/Signals.cpp:105:18
 #2 0x0000559cc3da2b3f SignalHandler(int) /home/asiemien/llvm-project/llvm/lib/Support/Unix/Signals.inc:407:1
 #3 0x00007f84ce8f5420 __restore_rt (/lib/x86_64-linux-gnu/libpthread.so.0+0x14420)
 #4 0x0000559cc4da54d7 std::__uniq_ptr_impl<mlir::MLIRContextImpl, std::default_delete<mlir::MLIRContextImpl>>::_M_ptr() const /usr/bin/../lib/gcc/x86_64-linux-gnu/9/../../../../include/c++/9/bits/unique_ptr.h:154:42
 #5 0x0000559cc4da54d7 std::unique_ptr<mlir::MLIRContextImpl, std::default_delete<mlir::MLIRContextImpl>>::get() const /usr/bin/../lib/gcc/x86_64-linux-gnu/9/../../../../include/c++/9/bits/unique_ptr.h:361:21
 #6 0x0000559cc4da54d7 std::unique_ptr<mlir::MLIRContextImpl, std::default_delete<mlir::MLIRContextImpl>>::operator*() const /usr/bin/../lib/gcc/x86_64-linux-gnu/9/../../../../include/c++/9/bits/unique_ptr.h:347:10
 #7 0x0000559cc4da54d7 mlir::MLIRContext::getImpl() /home/asiemien/llvm-project/mlir/include/mlir/IR/MLIRContext.h:197:39
 #8 0x0000559cc4da54d7 mlir::RegisteredOperationName::lookup(llvm::StringRef, mlir::MLIRContext*) /home/asiemien/llvm-project/mlir/lib/IR/MLIRContext.cpp:750:21
 #9 0x0000559cc3d69bca mlir::RegisteredOperationName mlir::OpBuilder::getCheckRegisteredInfo<mlir::vector::PrintOp>(mlir::MLIRContext*) /home/asiemien/llvm-project/mlir/include/mlir/IR/Builders.h:443:5
#10 0x0000559cc3d69bca mlir::vector::PrintOp mlir::OpBuilder::create<mlir::vector::PrintOp, mlir::vector::TransferReadOp&>(mlir::Location, mlir::vector::TransferReadOp&) /home/asiemien/llvm-project/mlir/include/mlir/IR/Builders.h:458:20
#11 0x0000559cc3d69bca prepareMLIRKernel(mlir::Operation*) /home/asiemien/tpp-sandbox/build/../tpp-run/tpp-run.cpp:254:48
#12 0x0000559cc4dcda9d mlir::LogicalResult::failed() const /home/asiemien/llvm-project/mlir/include/mlir/Support/LogicalResult.h:44:33
#13 0x0000559cc4dcda9d mlir::failed(mlir::LogicalResult) /home/asiemien/llvm-project/mlir/include/mlir/Support/LogicalResult.h:72:58
#14 0x0000559cc4dcda9d mlir::JitRunnerMain(int, char**, mlir::DialectRegistry const&, mlir::JitRunnerConfig) /home/asiemien/llvm-project/mlir/lib/ExecutionEngine/JitRunner.cpp:368:9
#15 0x0000559cc3d133c7 std::vector<std::unique_ptr<mlir::DialectExtensionBase, std::default_delete<mlir::DialectExtensionBase>>, std::allocator<std::unique_ptr<mlir::DialectExtensionBase, std::default_delete<mlir::DialectExtensionBase>>>>::~vector() /usr/include/c++/9/bits/stl_vector.h:677:15
#16 0x0000559cc3d133c7 mlir::DialectRegistry::~DialectRegistry() /home/asiemien/llvm-project/mlir/include/mlir/IR/DialectRegistry.h:109:7
#17 0x0000559cc3d133c7 main /home/asiemien/tpp-sandbox/build/../tpp-run/tpp-run.cpp:287:19
#18 0x00007f84ce38f083 __libc_start_main (/lib/x86_64-linux-gnu/libc.so.6+0x24083)
#19 0x0000559cc3d64f1e _start (/home/asiemien/tpp-sandbox/build/bin/tpp-run+0x2dff1e)

I have seen similar errors before but I thought I had fixed them all. Since this code is in constant flux at the moment (work in progress, tracked by #74), I may have fixed already locally, too.

I'll test this again after the next merge, which should make it much easier to fix this problem, as I'm factoring out everything in smaller methods of a benchmark class.

@adam-smnk

More copies in `mlp-packed.mlir`

Upstream commits: llvm/llvm-project@1b99f3a made bufferization more conservative. The copies are not needed as we access different chunks for each iteration in the outermost loop.

// Minimal reprodcuer: tpp-opt -empty-tensor-to-alloc-tensor -one-shot-bufferize="allow-return-allocs bufferize-function-boundaries"
 #map0 = affine_map<(d0, d1, d2, d3) -> (d0, d1, d3)>
 #map1 = affine_map<(d0, d1, d2, d3) -> (d0, d3, d2)>
 #map2 = affine_map<(d0, d1, d2, d3) -> (d1, d2)>
 module @predict_function {
  func.func @main(%arg0: tensor<128x256xf32>, %arg1: tensor<256x512xf32>, %arg2: tensor<512xf32>, %arg3: tensor<512x1024xf32>, %arg4: tensor<1024xf32>, %arg5: tensor<1024x2048xf32>, %arg6: tensor<2048xf32>, %arg7: tensor<2048x1024xf32>, %arg8: tensor<1024xf32>, %arg9: tensor<128x1024xf32>, %arg10: tensor<128x2048xf32>, %arg11: tensor<128x1024xf32>, %arg12: tensor<128x512xf32>) -> tensor<4x32x32x32xf32> {
    %cst = arith.constant 1.000000e+00 : f32
    %c1 = arith.constant 1 : index
    %c64 = arith.constant 64 : index
    %c32 = arith.constant 32 : index
    %c0 = arith.constant 0 : index
    %c4 = arith.constant 4 : index
    %0 = tensor.empty() : tensor<4x64x32x32xf32>
    %inserted = tensor.insert %cst into %0[%c0, %c0, %c0, %c0] : tensor<4x64x32x32xf32>
    %1 = tensor.empty() : tensor<4x32x32x32xf32>
    %2 = scf.for %arg13 = %c0 to %c4 step %c1 iter_args(%arg14 = %1) -> (tensor<4x32x32x32xf32>) {
      %3 = tensor.empty() : tensor<32x32x32xf32>
      %extracted_slice = tensor.extract_slice %inserted[%arg13, 0, 0, 0] [1, 64, 32, 32] [1, 1, 1, 1] : tensor<4x64x32x32xf32> to tensor<64x32x32xf32>
      %4 = scf.for %arg15 = %c0 to %c64 step %c1 iter_args(%arg16 = %extracted_slice) -> (tensor<64x32x32xf32>) {
        %6 = tensor.empty() : tensor<32x32x32xf32>
        %extracted_slice_1 = tensor.extract_slice %arg16[%arg15, 0, 0] [1, 32, 32] [1, 1, 1] : tensor<64x32x32xf32> to tensor<32x32xf32>
        %7 = linalg.generic {indexing_maps = [#map0, #map1, #map2], iterator_types = ["reduction", "parallel", "parallel", "reduction"]} ins(%3, %6 : tensor<32x32x32xf32>, tensor<32x32x32xf32>) outs(%extracted_slice_1 : tensor<32x32xf32>) {
        ^bb0(%in: f32, %in_3: f32, %out: f32):
          %8 = arith.mulf %in, %in_3 : f32
          %9 = arith.addf %out, %8 : f32
          linalg.yield %9 : f32
        } -> tensor<32x32xf32>
        %inserted_slice_2 = tensor.insert_slice %7 into %arg16[%arg15, 0, 0] [1, 32, 32] [1, 1, 1] : tensor<32x32xf32> into tensor<64x32x32xf32>
        scf.yield %inserted_slice_2 : tensor<64x32x32xf32>
      }
      %extracted_slice_0 = tensor.extract_slice %arg14[%arg13, 0, 0, 0] [1, 32, 32, 32] [1, 1, 1, 1] : tensor<4x32x32x32xf32> to tensor<32x32x32xf32>
      %5 = scf.for %arg15 = %c0 to %c32 step %c1 iter_args(%arg16 = %extracted_slice_0) -> (tensor<32x32x32xf32>) {
        %6 = tensor.empty() : tensor<64x32x32xf32>
       %extracted_slice_1 = tensor.extract_slice %arg16[%arg15, 0, 0] [1, 32, 32] [1, 1, 1] : tensor<32x32x32xf32> to tensor<32x32xf32>
        %7 = linalg.generic {indexing_maps = [#map0, #map1, #map2], iterator_types = ["reduction", "parallel", "parallel", "reduction"]} ins(%4, %6 : tensor<64x32x32xf32>, tensor<64x32x32xf32>) outs(%extracted_slice_1 : tensor<32x32xf32>) {
        ^bb0(%in: f32, %in_3: f32, %out: f32):
          %8 = arith.mulf %in, %in_3 : f32
          %9 = arith.addf %out, %8 : f32
          linalg.yield %9 : f32
        } -> tensor<32x32xf32>
        %inserted_slice_2 = tensor.insert_slice %7 into %arg16[%arg15, 0, 0] [1, 32, 32] [1, 1, 1] : tensor<32x32xf32> into tensor<32x32x32xf32>
        scf.yield %inserted_slice_2 : tensor<32x32x32xf32>
      }
      %inserted_slice = tensor.insert_slice %5 into %arg14[%arg13, 0, 0, 0] [1, 32, 32, 32] [1, 1, 1, 1] : tensor<32x32x32xf32> into tensor<4x32x32x32xf32>
      scf.yield %inserted_slice : tensor<4x32x32x32xf32>
    }
    return %2 : tensor<4x32x32x32xf32>
  }
}

The problem is %extracted_slice = tensor.extract_slice %inserted[%arg13, 0, 0, 0] [1, 64, 32, 32] [1, 1, 1, 1] : tensor<4x64x32x32xf32> to tensor<64x32x32xf32> Bufferization does not understand that we are reading different chunks of %inserted and it needs to assume we are reading the entire buffer. To fix the issue we need to have inserted as bb arg into the loop.

VNNI Bufferization not in place

Since the patch (https://github.com/plaidml/tpp-mlir/pull/181/files) got merged already, I am creating this new issue to track and fix the bug in VNNI Bufferization. Currently, the result of matmul is being allocated a new buffer instead of bufferizing in place to use the matrix C's buffer.

[RFC] Drop support for pass-based optimization

As you know our optimization flow is moving toward transform. Today we have two optimization flows: 1. transform, 2. pass-based. I would like to remove support for the pass-based one and use transform only. This would allow to avoid duplicate test, and drop quite significant amount of c++.
Example:
convolution2d_nchw_fchw.mlir is a subset of transform-pack.mlir
-decompose-conv-to-matmul-or-brgemm="enable-brgemm=true block-factors=32,32" is a quite involved pass that can be expressed as a composition of transform ops: https://github.com/plaidml/tpp-sandbox/blob/38529869d79938299a12d2e327b925a85d8e35f1/test/TPP/transform/transform-convolutions.mlir#L38

What do you think?

BF16 brgemm crash

BRGEMM throws a segfault when used with bf16 operands. The same example works just fine when the datatype is f32. The segfault is inside libxsmm's call.

Add multi-dimensional support to `tpp-run` memref return

Currently, tpp-run only supports 2D outputs because we have a simple loop for printing the values (using vectors the size of the inner dimension, iterating through the outer dimension).

The idea is to get the rank of the output memref and set up the scf.for loop accordingly. We still need to keep the vector small, or there are some codegen failures in MLIR when the vector size is too large.

Needs #74

Replace `benchmarks` with `tpp-run`

After #74 is complete, we can finally replace the programs in benchmarks with simple scripts calling tpp-run.

@hfp

Standalone/stdx-ops.mlir crash

FAIL: STANDALONE_OPT :: Standalone/stdx-ops.mlir (64 of 64)
******************** TEST 'STANDALONE_OPT :: Standalone/stdx-ops.mlir' FAILED ********************
Script:
--
: 'RUN: at line 1';   /home/lorenzo/tpp-sandbox/build/bin/standalone-opt /home/lorenzo/tpp-sandbox/test/Standalone/stdx-ops.mlir | /home/lorenzo/llvm-project/build/bin/FileCheck /home/lorenzo/tpp-sandbox/test/Standalone/stdx-ops.mlir
--
Exit Code: 2

Command Output (stderr):
--
realloc(): invalid pointer
PLEASE submit a bug report to https://github.com/llvm/llvm-project/issues/ and include the crash backtrace.
Stack dump:
0.	Program arguments: /home/lorenzo/tpp-sandbox/build/bin/standalone-opt /home/lorenzo/tpp-sandbox/test/Standalone/stdx-ops.mlir
1.	MLIR Parser: custom op parser 'func.func'
2.	MLIR Parser: custom op parser 'stdx.closure'
3.	MLIR Parser: custom op parser 'stdx.yield'
 #0 0x000055cb0c56a79a llvm::sys::PrintStackTrace(llvm::raw_ostream&, int) /home/lorenzo/llvm-project/llvm/lib/Support/Unix/Signals.inc:569:11
 #1 0x000055cb0c56a94b PrintStackTraceSignalHandler(void*) /home/lorenzo/llvm-project/llvm/lib/Support/Unix/Signals.inc:636:1
 #2 0x000055cb0c568f96 llvm::sys::RunSignalHandlers() /home/lorenzo/llvm-project/llvm/lib/Support/Signals.cpp:104:5
 #3 0x000055cb0c56b075 SignalHandler(int) /home/lorenzo/llvm-project/llvm/lib/Support/Unix/Signals.inc:407:1
 #4 0x00007f44396af520 (/lib/x86_64-linux-gnu/libc.so.6+0x42520)
 #5 0x00007f4439703a7c __pthread_kill_implementation ./nptl/pthread_kill.c:44:76
 #6 0x00007f4439703a7c __pthread_kill_internal ./nptl/pthread_kill.c:78:10
 #7 0x00007f4439703a7c pthread_kill ./nptl/pthread_kill.c:89:10
 #8 0x00007f44396af476 gsignal ./signal/../sysdeps/posix/raise.c:27:6
 #9 0x00007f44396957f3 abort ./stdlib/abort.c:81:7
#10 0x00007f44396f66f6 __libc_message ./libio/../sysdeps/posix/libc_fatal.c:155:5
#11 0x00007f443970dd7c ./malloc/malloc.c:5668:3
#12 0x00007f4439712b2c __libc_realloc ./malloc/malloc.c:3444:7
#13 0x000055cb0c4b075d llvm::safe_realloc(void*, unsigned long) /home/lorenzo/llvm-project/llvm/include/llvm/Support/MemAlloc.h:53:9
#14 0x000055cb0c4b0e32 llvm::SmallVectorBase<unsigned int>::grow_pod(void*, unsigned long, unsigned long) /home/lorenzo/llvm-project/llvm/lib/Support/SmallVector.cpp:151:13
#15 0x000055cb06504368 llvm::SmallVectorTemplateCommon<mlir::Type, void>::grow_pod(unsigned long, unsigned long) /home/lorenzo/llvm-project/llvm/include/llvm/ADT/SmallVector.h:142:3
#16 0x000055cb06504322 llvm::SmallVectorTemplateBase<mlir::Type, true>::grow(unsigned long) /home/lorenzo/llvm-project/llvm/include/llvm/ADT/SmallVector.h:529:71
#17 0x000055cb0650fabc mlir::Type const* llvm::SmallVectorTemplateCommon<mlir::Type, void>::reserveForParamAndGetAddressImpl<llvm::SmallVectorTemplateBase<mlir::Type, true>>(llvm::SmallVectorTemplateBase<mlir::Type, true>*, mlir::Type const&, unsigned long) /home/lorenzo/llvm-project/llvm/include/llvm/ADT/SmallVector.h:247:12
#18 0x000055cb0650fa45 llvm::SmallVectorTemplateBase<mlir::Type, true>::reserveForParamAndGetAddress(mlir::Type&, unsigned long) /home/lorenzo/llvm-project/llvm/include/llvm/ADT/SmallVector.h:540:5
#19 0x000055cb064f73a6 llvm::SmallVectorTemplateBase<mlir::Type, true>::push_back(mlir::Type) /home/lorenzo/llvm-project/llvm/include/llvm/ADT/SmallVector.h:566:23
#20 0x000055cb0650e0ea mlir::Type& llvm::SmallVectorTemplateBase<mlir::Type, true>::growAndEmplaceBack<>() /home/lorenzo/llvm-project/llvm/include/llvm/ADT/SmallVector.h:560:5
#21 0x000055cb0650e06f mlir::Type& llvm::SmallVectorImpl<mlir::Type>::emplace_back<>() /home/lorenzo/llvm-project/llvm/include/llvm/ADT/SmallVector.h:943:7
#22 0x000055cb0650e010 mlir::AsmParser::parseTypeList(llvm::SmallVectorImpl<mlir::Type>&)::'lambda'()::operator()() const /home/lorenzo/llvm-project/mlir/include/mlir/IR/OpImplementation.h:1123:41
#23 0x000055cb0650dfd5 mlir::ParseResult llvm::function_ref<mlir::ParseResult ()>::callback_fn<mlir::AsmParser::parseTypeList(llvm::SmallVectorImpl<mlir::Type>&)::'lambda'()>(long) /home/lorenzo/llvm-project/llvm/include/llvm/ADT/STLFunctionalExtras.h:45:12
#24 0x000055cb0c1846f9 llvm::function_ref<mlir::ParseResult ()>::operator()() const /home/lorenzo/llvm-project/llvm/include/llvm/ADT/STLFunctionalExtras.h:68:12
#25 0x000055cb0c17125e mlir::detail::Parser::parseCommaSeparatedList(mlir::AsmParser::Delimiter, llvm::function_ref<mlir::ParseResult ()>, llvm::StringRef) /home/lorenzo/llvm-project/mlir/lib/AsmParser/Parser.cpp:103:7
#26 0x000055cb0c1890cf mlir::detail::AsmParserImpl<mlir::OpAsmParser>::parseCommaSeparatedList(mlir::AsmParser::Delimiter, llvm::function_ref<mlir::ParseResult ()>, llvm::StringRef) /home/lorenzo/llvm-project/mlir/lib/AsmParser/AsmParserImpl.h:291:19
#27 0x000055cb084c460c mlir::stdx::YieldOp::parse(mlir::OpAsmParser&, mlir::OperationState&) /home/lorenzo/tpp-sandbox/build/include/Standalone/Dialect/Stdx/StdxOps.cpp.inc:346:3

bf16 support

#map0 = affine_map<(d0, d1, d2) -> (d0, d2)>
#map1 = affine_map<(d0, d1, d2) -> (d2, d1)>
#map2 = affine_map<(d0, d1, d2) -> (d0, d1)>
module {
  func.func private @printMemrefBF16(memref<*xbf16>) attributes { llvm.emit_c_interface }

 func.func @matmultpp(%A: tensor<4x8xbf16>, 
          %B: tensor<8x4xbf16>, %C: tensor<4x4xbf16>) -> tensor<4x4xbf16> attributes {llvm.emit_c_interface} {
    %D = linalg.generic {indexing_maps = [#map0, #map1, #map2], 
                         iterator_types = ["parallel", "parallel", "reduction"]} 
    ins(%A, %B: tensor<4x8xbf16>, tensor<8x4xbf16>) outs(%C: tensor<4x4xbf16>) {
      ^bb0(%a: bf16, %b: bf16, %c: bf16):
        %0 = arith.mulf %a, %b : bf16
        %1 = arith.addf %c, %0 : bf16
        linalg.yield %1 : bf16
    } -> tensor<4x4xbf16>
    return %D : tensor<4x4xbf16>
  }

  func.func @entry() {
    %c0 = arith.constant 0 : index

    // Initialize various matrices, dense for stress testing,
    // and sparse to verify correct nonzero structure.
    %da = arith.constant dense<[
        [ 1.1, 2.1, 3.1, 4.1, 5.1, 6.1, 7.1, 8.1 ],
        [ 1.2, 2.2, 3.2, 4.2, 5.2, 6.2, 7.2, 8.2 ],
        [ 1.3, 2.3, 3.3, 4.3, 5.3, 6.3, 7.3, 8.3 ],
        [ 1.4, 2.4, 3.4, 4.4, 5.4, 6.4, 7.4, 8.4 ]
    ]> : tensor<4x8xbf16>
    %db = arith.constant dense<[
        [ 10.1, 11.1, 12.1, 13.1 ],
        [ 10.2, 11.2, 12.2, 13.2 ],
        [ 10.3, 11.3, 12.3, 13.3 ],
        [ 10.4, 11.4, 12.4, 13.4 ],
        [ 10.5, 11.5, 12.5, 13.5 ],
        [ 10.6, 11.6, 12.6, 13.6 ],
        [ 10.7, 11.7, 12.7, 13.7 ],
        [ 10.8, 11.8, 12.8, 13.8 ]
    ]> : tensor<8x4xbf16>

    // Call kernel.
    %C = arith.constant dense<0.0> : tensor<4x4xbf16>
    %0 = call @matmultpp(%da, %db, %C)
       : (tensor<4x8xbf16>, tensor<8x4xbf16>, tensor<4x4xbf16>) -> tensor<4x4xbf16>

    //
    // CHECK:    ( ( 388.76, 425.56, 462.36, 499.16 ),
    // CHECK-SAME: ( 397.12, 434.72, 472.32, 509.92 ),
    // CHECK-SAME: ( 405.48, 443.88, 482.28, 520.68 ),
    // CHECK-SAME: ( 413.84, 453.04, 492.24, 531.44 ) )
    //
    %m0 = bufferization.to_memref %0 : memref<4x4xbf16>
    %U = memref.cast %m0 : memref<4x4xbf16> to memref<*xbf16>
    call @printMemrefBF16(%U) : (memref<*xbf16>) -> ()

    return
  }
}

Fix match `maxf(x, N)` to relu

As exposed in #128, a linalg.generic with arith.maxf(x, N) (with N any non-zero number) will map to ReLU, which is wrong.

To map to ReLU, N must be 0.0.

Perf dialect

With the new runner, we need to time the execution of the kernel only, but because the runner calls the JIT compiler, we can't separate the compilation from the execution, and the preparation of tensors from the kernel. But if we had a simple dialect that allows us to start/stop a timer and collect some statistics, we could have that in the input IR (or even add it as a pass), and then we wouldn't need any callbacks, etc.

We'll need a set of ops like:

%t = perf.start_timer: perf.timer_t for some definition of timer_t (could be a unique integer)
%delta = perf.stop_timer(%t): f64 for the delta from a specific timer, closes the timer (ie. can't take another stop, %t is dead)
perf.push(%list: memref<1000xf64>, %delta: f64): void for adding the current delta to a list of results
%mean = perf.mean(%list: memref<1000xf64>): f64 for the mean of those 1000 values
%dev = perf.stdev(%list: memref<1000xf64>): f64 for the standard deviation of those 1000 values

And a simple runtime implementation for those, calling std::chrono or something.

If we get this right, this could be a simple dialect that could easily be upstreamed to MLIR directly without much effort.

Replace `mathx.relu` with `arith.maxf`

We have a mathx dialect that has relu operation, when we don't really need it, we can just use arith.maxf with 0.0.

It'd also be good to transform other linalg.generic patterns, like cmp(gt, 0.0)+select into arith.maxf, so that we can detect and use in our transforms.

One day, these will all be tcp.clamp, but for now, we have to make do with what we have.

After replacing it in our passes and tests, we should delete the mathx dialect from our tree.

More integration tests when mapping from linalg to tpps

Test1:

#map1 = affine_map<(d0, d1, d2, d3) -> (d0, d1, d2, d3)>
func.func @main(%arg0: memref<64xf32, strided<[?], offset: ?>>) {
  // fill arg0
  %alloc = memref.alloc() {alignment = 128 : i64} : memref<12x56x56x64xf32>
  linalg.generic {indexing_maps = [#map, #map1], iterator_types = ["parallel", "parallel", "parallel", "parallel"], library_call = "tpp.identity"} ins(%arg0 : memref<64xf32, strided<[?], offset: ?>>) outs(%alloc : memref<12x56x56x64xf32>) {
      ^bb0(%in: f32, %out: f32):
        linalg.yield %in : f32
  }
  return 
}

Test2:

func.func @main(%arg0: tensor<32xf32>) {
// fill arg0.
%alloc_2 = memref.alloc() {alignment = 128 : i64} : memref<12x2x56x56x32xf32>
// broadcast arg0 into the alloc_2
scf.for %arg3 = %c0 to %c12 step %c1 {
  scf.for %arg4 = %c0 to %c2 step %c1 {
    scf.for %arg5 = %c0 to %c56 step %c1 {
      %subview_4 = memref.subview %alloc_2[%arg3, %arg4, %arg5, 0, 0] [1, 1, 1, 56, 32] [1, 1, 1, 1, 1] : memref<12x2x56x56x32xf32> to memref<1x1x1x56x32xf32, strided<[200704, 100352, 1792, 32, 1], offset: ?>>
      linalg.generic {indexing_maps = [#map2], iterator_types = ["parallel", "parallel", "parallel", "parallel", "parallel"], library_call = "tpp.relu"} outs(%subview_4 : memref<1x1x1x56x32xf32, strided<[200704, 100352, 1792, 32, 1], offset: ?>>) {
          ^bb0(%out: f32):
            %0 = mathx.relu %out : f32
            linalg.yield %0 : f32
      }
    }
   }
  }
  return
}

To fill the tensor we can use something similar to @generate_2D_source : iree-org/iree@5a37968 of course we need the 1d version here (or you can make arg0 ad 2d).

Fix benchmarks

We only run matmul on CI, and it seems MLP is broken with a non-existent option from tpp-opt:

tpp-opt: Unknown command line argument '-block-matmul-layout=block-factors=2,2'.  Try: 'tpp-opt --help'
tpp-opt: Did you mean '--pack-matmul=block-factors=2,2'?

We need to fix the scripts and add all benchmarks to the CI.

Check dialect changes

Compile the checks into calls to an assert at runtime only, and not lower the whole operator into a runtime call : This avoids weakly typing the runtime utilities in order to support different datatypes.

Build the tpp compiler with sanitizers on CI

As exposed on here, we have memory leaks in the code.

We need to either always build the compiler with sanitizers or create a separate build for sanitizers. Since the compiler is small and we don't use it for other things, we might just do the former.

@chelini @hfp

FP16 vs BF16 tests

Add test to compare that tensors generated by lowering types using tpp_identity (fp16 to bf16) are almost equal using check dialect.

Use Github Actions

We're looking into getting pre-merge checks from pull requests, but our CI runs on local machines. To avoid running arbitrary code locally, we don't allow PRs to trigger CI.

This can be solved with Github actions, at least for some basic testing (build + check), which should be enough for pre-merge. Like LLVM, if we have a problem that hasn't been found by pre-merge tests, we create an artificial IR-to-IR test, which is quick and runs well on Github builders.

There's an initial implementation on branch github-actions. The LLVM and sandbox builds work fine, the integration between them doesn't.

Anyone with a few days to spare and a weird passion for infrastructure jobs can pick where I left off. :)

@hfp

Fusion in `mlp-packed.mlir`

Avoid using LICM to fix-up fusion, prefer instead call-back to have more control on what get fused. Callback implementation: chelini/llvm-project@f2a09ad.

The callback allows to fuse together the BRGEMM and the relu.

bf16 MLP test still failing with "unexpected instruction number" on AVX-512 machines

https://buildkite.com/intel/tpp-compiler/builds/255#0183b310-a9db-4171-84cb-1f124485643d

Need to investigate, but I think we didn't see this on the PR because the test wasn't actually running due to the UNSUPPORTED line being incorrect.

Map linalg to BF16 TPP ops

Now that we have TPP ops working with BF16, we need to lower BF16 tensor types to VNNI packing.

This should be a matter of packing the tensors at the right time, moving the VNNI elements contiguously, and adding a dimension into the linalg.generic, making sure we can traverse them with the right maps (compatible with VNNI) and then recognise those maps to TPP VNNI calls.

@KavithaTipturMadhu, as discussed earlier.

Add multi-core support for tile calls

After blocking and tiling, we end up with a nested loop of TPP calls to GEMM/BRGEMM. Those can run in parallel, but currently, we don't do that.

A simple idea is to insert OpenMP ops to parallelise the execution of the outer loops once we know what the tiles look like. Another option is to use scp.parallel or scp.foreach_thread and make sure that propagates correctly to the back-end.

Later, when we have some end-to-end execution with IREE or OpenXLA, we can use their runtime to run in parallel.

Upstream the `perf` dialect

Once #100 is complete, we should try to upstream the dialect. Either as a separate dialect, or as part of another, doesn't matter. But it seems like a simple and valuable piece of infrastructure that other people might want.

TPP equations support

Add support for TPP equations, examples here:
https://github.com/libxsmm/libxsmm/tree/main/samples/equation

The main point of equations is that you can dispatch multiple instructions at once, and the JIT compiler will optimize (fuse, vectorize) based on the sequence of ops, not individually. This can bring significant speedups.

Today we do fusion (ex. GEMM+RELU) but that's a special case. We want to move to a generic case, where we can fuse anything that the implementation knows how to handle. We'd need information from the library on what it can fuse, though.

As an example, in pseudo-IR, we currently do:

// Original IR
%a, %b ...
%c = linalg.identity()
linalg.generic(%a, %b, %c) { %m = arith.mul(%an, %bk); arith.add(%m, %cn) }
linalg.generic(%c) { arith.maxf(%i, 0) }

// TPP IR
%c = tensor.empty()
scf.for (%n) {
  scf.for(%k) {
    scf.for(%n) {
      %c = tpp.brgemm_with_init_and_relu(%a, %b, %c)
} } }

// XSMM IR
%c = tensor.empty()
%call = xsmm.dispatch(brgemm_with_init_and_relu)
scf.for (%m) {
  scf.for(%k) {
    scf.for(%n) {
      %c = xsmm.invoke(%call, %m, %k, %n)
} } }

But that dispatch is a special version that only works for that fusion.

With equations, we'd have a generic mechanism for many other patterns:

// Original IR
%a, %b ...
%c = linalg.identity()
linalg.generic(%a, %b, %c) { %m = arith.mul(%an, %bk); %cn = arith.sub(%m, %cn); %cn = arith.add(%m, 42); }

// TPP IR
%c = tensor.empty()
scf.for (%n) {
  scf.for(%k) {
    scf.for(%n) {
      tpp.group {
        %x = tpp.mul(...)
        %y = tpp.sub(...)
        %c = tpp.add(...)
      }
} } }

// XSMM IR - Today
%c = tensor.empty()
%mcall = xsmm.dispatch(mul)
%scall = xsmm.dispatch(sub)
%acall = xsmm.dispatch(add)
scf.for (%m) {
  scf.for(%k) {
    scf.for(%n) {
      %x = xsmm.invoke(%mcall, ...)
      %y = xsmm.invoke(%scall, ...)
      %c = xsmm.invoke(%acall, ...)
} } }

// XSMM IR - With Equations
%c = tensor.empty()
%call = xsmm.dispatch(mul, sub, add)
scf.for (%m) {
  scf.for(%k) {
    scf.for(%n) {
      %c = xsmm.invoke(%call, %m, %k, %n)
} } }

The shape of the dialect is still undefined, as we gather more ideas on how to do that, but the basic idea is to group the fused ops and issue a single dispatch, while controlling which fused op refers to what original op, in sequence.

LinalgX:relayout.

the relayout op should be upstreamed. But we want to have a pack and unpack operation, not in Linalg but in the Tenor and then memref dialect. We are not supposed to use maps anymore but a combination of attributes (tiling factors along the dimensions + transpose - how you want to relayout the dimensions).

Warnings with Clang

Using clang, I get these warnings when building:

tpp-rt/CheckRunnerUtils.cpp:43:12: warning: using integer absolute value function 'abs' when argument is of floating point type [-Wabsolute-value]
    assert(abs(addr_a[i] - addr_b[i]) <= C && "Result mismatch");
           ^
tpp-rt/CheckRunnerUtils.cpp:43:12: note: use function 'std::abs' instead
    assert(abs(addr_a[i] - addr_b[i]) <= C && "Result mismatch");
           ^~~
           std::abs
/usr/include/assert.h:90:27: note: expanded from macro 'assert'
     (static_cast <bool> (expr)                                         \
                          ^

lib/TPP/ConvertTppToXsmm.cpp:146:14: warning: implicit conversion turns string literal into bool: 'const char[34]' to 'bool' [-Wstring-conversion]
             "Element type neither bf16 nor f32");
             ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
/usr/include/assert.h:90:27: note: expanded from macro 'assert'
     (static_cast <bool> (expr)                                         \
                          ^~~~

There are some warnings on xsmm-src but we don't care about those.

Make tpp.relu a single operand op

Make tpp.relu operate in place on its operand, instead of two operands (one for input and one for result).

Add run/benchmark options to `tpp-opt` / `tpp-runner`

The tpp-opt tool works with MLIR and could very well JIT-compile and run (replacing the call to mlir-cpu-runner) and benchmark. Alternatively, we can create a new tpp-cpu-runner from mlir-cpu-runner like tpp-opt was from mlir-opt. Whatever is easier.

Either way, we'll need to:

Link with the right libraries (libxsmm, libtpp_c_runtime, openmp, etc)
Create a wrapper function in MLIR to prepare the input tensors
Warm-up libxsmm calls, time calls as close as possible, avoid tensor preparation

Future stuff:

Use the check dialect to test MLIR execution
Use the perf dialect to time inside kernels

Install gtest on cluster?

CI says: gtest not found, unittests will not be available.

We don't really use GTest, as our unit tests are just the standalone smoke tests. But we could make more use of that?

@chelini @KavithaTipturMadhu any intent on using GTest or should we just drop the requirement (and delete the unittest directory)?

@hfp

Add Multi-Head-Attention test/benchmark

To test our transforms in complex code we should use the most common target nowadays: Attention layer.

This is a picture of the attention layer (and it's larger cousin):

Extracted from the third link below.

Here are some links to understand what it means:
https://paperswithcode.com/method/multi-head-attention
https://towardsdatascience.com/transformers-explained-visually-part-3-multi-head-attention-deep-dive-1c1ff1024853
https://medium.com/@smitasasindran/12-attention-mechanisms-multihead-attention-958041a35553

You can generate the whole thing with Keras or PyTorch, then export to MLIR.

Connect TPP to IREE for end-to-end

Using our local IREE repo, create a branch that follows the existing TPP work and create a prototype where we can pass one simple Python model through IREE, TPP and executing calling libxsmm.

Once we have that working, we should look to extend the model support, automate build, add tests to a CI loop, and make the connection simpler, to allow IREE upstream to pick up this work on their own repo.

@nhasabni

Move `mlp-all.mlir` from `TPP` to `Benchmarks`

We already have an MLP file in test/Benchmarks but that's a single layer and the one in TPP has four.

This issue is about moving the four-layer one into the benchmark folder and improving the MLP C++ benchmark to cope with multiple layers.

Should be done after #175 is finished, so we only do it once.

Upstream `pack` and `unpack`

We only have two ops in linalgx: pack and unpack. These ops have been in IREE for a while, but we need to make sure they make it to linalg in MLIR, or we'll be stuck with this forever.

We need to create a few examples on pure linalg code and just push those ops into linalg itself as an upstream patch on Phabricator. This will give more people, outside of IREE, visibility on what we're trying to do and help us merge this soon.

I'm hoping @nicolasvasilache can help us here. We're not aiming at perfect, we're just aiming are minimally acceptable, so that we can iterate upstream rather than carrying this with us all the time.

Even if the semantics isn't completely clear, we can just add a bunch of asserts to make sure only the things we can prove are done, everything else fails validation. In time, we can add more functionality upstream once we're clear on what is needed and how to implement it.

Waiting until we have a complete picture won't work, because we need to implement things to have a better picture and right now we can't implement all the things that everyone may one day want, because we don't know what it is, nor have the need for those things.

Remove the `stdx` dialect

Currently, we have only one construct in that dialect, closure (plus yield), but we're not using it, so we should just delete the dialect and make sure that anything that was relying on it (hoisting weights, for example) can still be done without it.

Curious warning

/home/rengolin/devel/intel/tpp-sandbox/lib/TPP/IteratorCollapsing.cpp:368:17: warning: variable 'collapsedSize' set but not used [-Wunused-but-set-variable]
        int64_t collapsedSize = shape[pos];
                ^
1 warning generated.

Seems harmless enough, this is the code:

    while (pos < origRank) {
      currentOperandReassociation.push_back(getAffineDimExpr(pos, context));
      if (isCollapsedDim(pos)) {
        int64_t collapsedSize = shape[pos];
        while (pos + 1 < origRank && isCollapsedDim(pos + 1)) {
          ++pos;
          collapsedSize *= shape[pos];
          currentOperandReassociation.push_back(getAffineDimExpr(pos, context));
        }
      } else {
        // No reassociation, reassociation is valid.
        isValidGroupReass = true;
      }

Apart from that write, it's never read-from. I'm wondering if that's just left-over, or if there's something else missing elsewhere, too?

@chelini

Disable fusion callback

https://github.com/plaidml/tpp-sandbox/blob/67d2cbb160c9393ba7b89931ed0cc47568a83cc9/lib/Standalone/TileConsumerAndFuseProducers.cpp#L180

We want to use the sandbox into IREE and the change we did in LLVM/MLIR (chelini/llvm-project@f0c82ea) is not available upstream.

VNNI BF16 Layout matmul example with a new datatype (vnni_bf16) corresponding to tuple (<bf16,bf16>)

#map0 = affine_map<(d0, d1, d2) -> (d0, d2)>
#map1 = affine_map<(d0, d1, d2) -> (d2, d1)>
#map2 = affine_map<(d0, d1, d2) -> (d0, d1)>
module {
 func.func @matmultpp(%A: tensor<2x8xvnni_bf16>,
          %B: tensor<8x4xbf16>, %C: tensor<2x4xvnni_bf16>) -> tensor<2x4xvnni_bf16> attributes {llvm.emit_c_interface} {
    %D = linalg.generic {indexing_maps = [#map0, #map1, #map2],
                         iterator_types = ["parallel", "parallel", "reduction"]}
    ins(%A, %B: tensor<2x8xvnni_bf16>, tensor<8x4xbf16>) outs(%C: tensor<4x4xbf16>) {
      ^bb0(%a: vnni_bf16, %b: bf16, %c: vnni_bf16):
        %0 = arith.vnni_mulf %a, %b : vnni_bf16
        %1 = arith.vnni_addf %c, %0 : bf16
        linalg.yield %1 : bf16
    } -> tensor<4x4xbf16>
    return %D : tensor<4x4xbf16>
  }

  func.func @entry() {
    %c0 = arith.constant 0 : index

    // Initialize various matrices.
    %da = arith.constant dense<[
        [ 1.1, 2.1, 3.1, 4.1, 5.1, 6.1, 7.1, 8.1 ],
        [ 1.2, 2.2, 3.2, 4.2, 5.2, 6.2, 7.2, 8.2 ],
        [ 1.3, 2.3, 3.3, 4.3, 5.3, 6.3, 7.3, 8.3 ],
        [ 1.4, 2.4, 3.4, 4.4, 5.4, 6.4, 7.4, 8.4 ]
    ]> : tensor<4x8xbf16>
    %da_cast = linalgx.reorder %da: tensor<4x8xbf16> to tensor<2x8xvnni_bf16>
    %db = arith.constant dense<[
        [ 10.1, 11.1, 12.1, 13.1 ],
        [ 10.2, 11.2, 12.2, 13.2 ],
        [ 10.3, 11.3, 12.3, 13.3 ],
        [ 10.4, 11.4, 12.4, 13.4 ],
        [ 10.5, 11.5, 12.5, 13.5 ],
        [ 10.6, 11.6, 12.6, 13.6 ],
        [ 10.7, 11.7, 12.7, 13.7 ],
        [ 10.8, 11.8, 12.8, 13.8 ]
    ]> : tensor<8x4xbf16>

    // Call kernel.
    %C = arith.constant dense<0.0> : tensor<2x4xvnni_bf16>
    %0 = call @matmultpp(%da_cast, %db, %C)
       : (tensor<2x8xvnni_bf16>, tensor<8x4xbf16>, tensor<4x4xbf16>) -> tensor<4x4xbf16>

    return
  }
}
`

Check on issues with `max(x, 0)` from #124

From #124:

There are some weirdness when I tried to tight the relu to just max(x, 0) probably related to the tests I use. I think are addressed in: https://github.com/plaidml/tpp-sandbox/pull/115/commits but more investigation is needed.

@chelini can you expand on what do you mean by weirdness and how do we know it's "fixed".

Tensor cast folding, resulting in linalg not matching to tpp

https://github.com/plaidml/tpp-sandbox/pull/115/files In the tests committed, when tensor is generated instead of memref and casted from unranked to ranked memref, the cast operator folds, resulting in subsequent linalg.generic op not matching any tpps.

`mlp_kernel.mlir` in benchmarks doesn't work without `-pre-bufferize`

We're removing -pre-bufferize from tests, but the benchmark mlp_kernel.mlir, which is now moving to the test infra, is badly transformed (and produces wrong results) without it.

When initializing the source tensors with all 1s, expecting the results as all 9s, I still get all 1s (as if the tensor hasn't changed).

I haven't looked at the problem yet, but my guess is some alloc / copy that isn't being transformed correctly to memrefs.

Once #140 is merged, the reproduction is just to remove the -pre-bufferize option from the test in test/Benchmarks and run check-tpp-opt.

Alternatively, remove the option from benchmarks/mlp/run-mlp.sh and run that benchmark.

Parallelization of SCF loops

Parallelize the SCF for loops into SCF parallel loops either using Nicolas' parallelization technique or by adding static analysis to figure out the dependences across SCF for loops and convert them appropriately into scf parallel ops.

Upstream the `check` dialect.

Once we're happy with the check dialect (#88) locally, we should try to upstream it. It's possible that other people want something similar, and it would be good if we didn't have to carry it ourselves in this repo.

plaidml / tpp-mlir Goto Github PK

tpp-mlir's Introduction

To Our Users

Building PlaidML from source

Demos and Related Projects

Plaidbench

Reporting Issues

CI & Validation

Validated Hardware

Validated Networks

tpp-mlir's People

Contributors

Stargazers

Watchers

Forkers

tpp-mlir's Issues

Recommend Projects

Recommend Topics

Recommend Org