thu-dsp-lab / llvm-project Goto Github PK

LLVM OpenCL C compiler suite for ventus GPGPU

License: Other

C++ 39.41% Shell 0.03% CMake 0.32% Assembly 9.83% C 12.51% Python 1.07% LLVM 35.95% Dockerfile 0.01% HTML 0.20% CSS 0.01% JavaScript 0.02% Emacs Lisp 0.01% Batchfile 0.01% Objective-C 0.46% Objective-C++ 0.10% HLSL 0.01% Cuda 0.07% Pawn 0.01% SWIG 0.01% Rust 0.01%

llvm-project's Introduction

This is the Ventus GPGPU port of LLVM Compiler Infrastructure

Ventus GPGPU is based on RISCV RV32IMAZfinxZve32f ISA with fully redefined concept of V-extension.

The Ventus GPGPU OpenCL compiler based on LLVM is developed by Terapines Technology (Wuhan) Co., Ltd

承影GPGPU OpenCL编译器由Terapines(兆松科技)负责开发

For more architecture detail, please refer to Ventus GPGPU Arch

Getting Started

1: Programs related repositories

Download all the repositories firstly and place them in the same path.

llvm-ventus : git clone https://github.com/THU-DSP-LAB/llvm-project.git
pocl : git clone https://github.com/THU-DSP-LAB/pocl.git
ocl-icd : git clone https://github.com/OCL-dev/ocl-icd.git
isa-simulator(spike) : git clone https://github.com/THU-DSP-LAB/ventus-gpgpu-isa-simulator.git
driver : git clone https://github.com/THU-DSP-LAB/ventus-driver.git

ATTENTION: Remember to check branch for every repository, cause the project are under development, if you get any build errors, feel free to give an issue or just contact authors

2: Build all the programs

Our program is based on LLVM, so the need packages to build ventus are almost the same as what are needed to build LLVM, you can refer to official website for detailed llvm building guidance, we just list most important needed packages here.

ccache
cmake
ninja
clang

If you see any packages missing information, just install them

The following packages are needed for other repositories:

device-tree-compiler
bsdmainutils

Run ./build-ventus.sh to automatically build all the programs, but we need to run firstly

For developers who want to build Debug version for llvm, export BUILD_TYPE=Debug, since it's set default to be 'Release'
export POCL_DIR=<path-to-pocl-dir>, default folder path will be set to be <llvm-ventus-parentFolder>/pocl
export OCL_ICD_DIR=<path-to-ocl-icd-dir>, default folder path will be set to be <llvm-ventus-parentFolder>/ocl-icd

You can dive into build-ventus.sh file to see the detailed information about build process

3: Bridge icd loader

Run export VENTUS_INSTALL_PREFIX=<path_to_install> to set VENTUS_INSTALL_PREFIX environment variable(system environment variable recommended)

Run export LD_LIBRARY_PATH=${VENTUS_INSTALL_PREFIX}/lib to tell OpenCL application to use your own built libOpenCL.so, also to correctly locate LLVM shared libraries

Run export OCL_ICD_VENDORS=${VENTUS_INSTALL_PREFIX}/lib/libpocl.so to tell ocl icd loader where the icd driver is.

Finally, run export POCL_DEVICES="ventus" to tell pocl driver which device is available(should we set ventus as default device?).

You will see Ventus GPGPU device is found if your setup is correct.

NOTE: OpenCL host side program should be linked with icd loader -lOpenCL.

$ <pocl-install-dir>/bin/poclcc -l

LIST OF DEVICES:
0:
  Vendor:   THU
    Name:   Ventus GPGPU device
 Version:   2.2 HSTR: THU-ventus-gpgpu

Also, you can try to set POCL_DEBUG=all and run example under <pocl-build-dir> to see the full OpenCL software stack execution pipeline. For example(Work in progress).

4: Compiler using example

we can now use our built compiler to generate an ELF file, and using spike to complete the isa simulation.

Cause the address space requirement in spike, we use a customized linker script for our compiler

Take vecadd.cl below as an example :

__kernel void vectorAdd(__global float* A, __global float* B) {
  unsigned tid = get_global_id(0);
  A[tid] += B[tid];
}

4.1: Generate ELF file

4.1.1 Compile directly

Remember to build libclc too because we need the libclc library

Use command line under the root directory of llvm-ventus

./install/bin/clang -cl-std=CL2.0 -target riscv32 -mcpu=ventus-gpgpu vecadd.cl  ./install/lib/crt0.o -L./install/lib -lworkitem -I./libclc/generic/include -nodefaultlibs ./libclc/riscv32/lib/workitem/get_global_id.cl -O1 -cl-std=CL2.0 -Wl,-T,utils/ldscripts/ventus/elf32lriscv.ld -o vecadd.riscv

4.1.2 Compile step-by-step

Compile OpenCL code to LLVM IR assembly (.ll file):

./install/bin/clang -S -cl-std=CL2.0 -target riscv32 -mcpu=ventus-gpgpu vecadd.cl -emit-llvm -o vecadd.ll

Compile LLVM IR to RISC-V assembly or object file:

./install/bin/llc -mtriple=riscv32 -mcpu=ventus-gpgpu vecadd.ll -o vecadd.s

./install/bin/llc -mtriple=riscv32 -mcpu=ventus-gpgpu --filetype=obj vecadd.ll -o vecadd.o

Link essential library: Linking crt0 and libclc All the libclc workitem functions' implementation is included in riscv32clc.o

./install/bin/ld.lld -o vecadd.riscv -T utils/ldscripts/ventus/elf32lriscv.ld vecadd.o ./install/lib/crt0.o ./install/lib/riscv32clc.o -L./install/lib -lworkitem --gc-sections

4.1.3 Compile assembly code to object file (`.s` to `.o`)

Take custome instructions custome.s as an example :

vftta.vv v0, v0, v1
vfexp v0, v1
vadd12.vi v0, v1, 8

./install/bin/clang -c -target riscv32 -mcpu=ventus-gpgpu custom.s -o custom.o

4.2: Dump file

./install/bin/llvm-objdump -d --mattr=+v,+zfinx vecadd.riscv >& vecadd.txt

you will see output like below, 0x80000000 is the space address required by spike for _start function, this is the reason why we use a customized linker script

vecadd.riscv:	file format elf32-littleriscv

Disassembly of section .text:

80000000 <_start>:
80000000: 97 21 00 00  	auipc	gp, 2
80000004: 93 81 01 80  	addi	gp, gp, -2048
80000008: 93 0e 00 02  	li	t4, 32
8000000c: d7 fe 0e 0d  	vsetvli	t4, t4, e32, m1, ta, ma
80000010: b7 2e 00 00  	lui	t4, 2
80000014: f3 ae 0e 30  	csrrs	t4, mstatus, t4
80000018: 93 0e 00 00  	li	t4, 0
8000001c: 73 21 60 80  	csrr	sp, 2054
80000020: 73 22 70 80  	csrr	tp, 2055

80000024 <.Lpcrel_hi1>:
80000024: 17 15 00 00  	auipc	a0, 1
80000028: 13 05 85 fe  	addi	a0, a0, -24

....
....
....

or you can check encoding of custom instructions

./install/bin/llvm-objdump -d --mattr=+v,+zfinx custom.o >& custom.txt

custom.o:       file format elf32-littleriscv

Disassembly of section .text:

00000000 <.text>:
       0: 0b c0 00 0e   vftta.vv        v0, v0, v1
       4: 0b 60 10 0a   vfexp   v0, v1
       8: 0b 80 80 00   vadd12.vi       v0, v1, 8

4.3: Running in spike

We need to run the isa simulator to verify our compiler

Use spike from THU and follow the README.md

4.4: Driver using example

Accordingly, after all the building process, you can change directory to <llvm-ventus-parentFolder>/pocl/build/examples/vecadd directory, then export variables as what Bridge icd loader does, finally just execute the file vecadd

5: Github actions

the workflow file is .github/workflows/ventus-build.yml, including below jobs

Build llvm
Build ocl-icd
Build libclc
Build isa-simulator
Build sumulator-driver
Build pocl
Isa simulation test
GPU-rodinia testsuite
Pocl testing

6: Docker image

If the user needs to build the toolchain of the Ventus project in an environment other than Ubuntu, such as the CentOS system, we provide the Dockerfile for building the CentOS image. The file is under '.github/workflows/containers/dockerfiles'.

Note: When using build-ventus.sh to build the instantiated centos container, the following modifications are required, which are different from the above "2: Build all the programs":

--- a/build-ventus.sh
+++ b/build-ventus.sh
@@ -119,6 +119,8 @@ build_llvm() {
     -DLLVM_CCACHE_BUILD=ON \
     -DLLVM_OPTIMIZED_TABLEGEN=ON \
     -DLLVM_PARALLEL_LINK_JOBS=12 \
+    -DCMAKE_C_COMPILER=clang \
+    -DCMAKE_CXX_COMPILER=clang++ \
     -DCMAKE_BUILD_TYPE=${BUILD_TYPE} \
     -DLLVM_ENABLE_PROJECTS="clang;lld;libclc" \
     -DLLVM_TARGETS_TO_BUILD="AMDGPU;X86;RISCV" \
@@ -232,7 +234,7 @@ export_elements() {
   export SPIKE_TARGET_DIR=${VENTUS_INSTALL_PREFIX}
   export VENTUS_INSTALL_PREFIX=${VENTUS_INSTALL_PREFIX}
   export POCL_DEVICES="ventus"
-  export OCL_ICD_VENDORS=${VENTUS_INSTALL_PREFIX}/lib/libpocl.so
+  export OCL_ICD_VENDORS=${VENTUS_INSTALL_PREFIX}/lib64/libpocl.so
 }

llvm-project's People

Contributors

Stargazers

Watchers

Forkers

minisparrow yangkex code-fool gmh5225 jules-kong mfkiwl wangqinfan zhoujingya mccyx ziliangzl reram-labs summersurface cooperyuan

llvm-project's Issues

barrier支持带memory scope参数

目前硬件那边需要支持barrier函数带memory_scope参数。

[CTS] global_work_offsets和get_linear_ids错误

llvm中global id没有考虑offset，且计算global linear id中有bug

Join指令插入的顺序

之前是默认插入到PostIDomBB的最前面，但是目前vmv的设计好像有点问题，见#16
1 : 当vmv指令出现在分支时。拷贝操作应该是当前线程mask为1时才进行
2 ：join操作应该是在BB的最后一个vmv指令之后

访存指令匹配的错误

在tablegen中写访存指令的匹配，有一定的局限性，还需要在源代码中修改以解决访存指令错误的问题

寄存器和内存资源的统计

目前已经差不多完成了统计工作，后续发现问题再修改。

[CTS] enqueued_local_size的问题

目前llvm提供的libclc函数里面缺少enqueued_local_size的workitem函数实现

[CTS] basic/if failed

run test_basic if failed
simple change case to test
A. set 32 threads , run test_basic if 32
B.
run failed
input data has 0x1 ,but output data is 0x7fffffff

[gpu-rodiania] illegal fmv.w.x instruction

Using the ventus-llvm commit :b9fa6ff

run the gpu-rodinia project - gaussian test, you will get illegal instruction as below

d3 82 03 f0  	fmv.w.x	t0, t2

but as hardware colleges commented, this istruction is llegal, so when we need to bitcast f32(GPRF32) to i32(GPR), we can all use vmv.v.x instruction

多参数处理的问题

在ventus中，目前只有printf函数的实现是用了多参数。本身代码是用C语言写的，见printf分支代码，以下面的代码为例

int *Buffer = (int *)0xa0024000;
int pos = -1;
int printf(const __attribute__((address_space(1)))  char *format, ...) {
  __builtin_va_list arg_list;
  __builtin_va_start(arg_list, format);
  int pos = -1;
  // 往这个地址写东西： print_buffer_addr
  while (*format != '\0') {
    pos++;
    if (*format == '%') {
      format++;
      if (*format == 'd') {
        int val = __builtin_va_arg(arg_list, int);
        Buffer[pos] = val;
      }
      format++;
    } else {
      // printf("%c", *format);
      Buffer[pos] = *format;
      format++;
    }
  }
  __builtin_va_end(arg_list);
}

标准的riscv32代码如下:

printf:                                 # @printf
# %bb.0:
        addi    sp, sp, -64
        sw      ra, 28(sp)                      # 4-byte Folded Spill
        sw      s0, 24(sp)                      # 4-byte Folded Spill
        addi    s0, sp, 32
        sw      a7, 28(s0)
        sw      a6, 24(s0)
        sw      a5, 20(s0)
        sw      a4, 16(s0)
        sw      a3, 12(s0)
        sw      a2, 8(s0)
        sw      a1, 4(s0)
        sw      a0, -16(s0)
        addi    a0, s0, 4
        sw      a0, -20(s0)
        li      a0, -1
        sw      a0, -24(s0)
        j       .LBB0_1
.LBB0_1:                                # =>This Inner Loop Header: Depth=1
        lw      a0, -16(s0)
        lbu     a0, 0(a0)
        beqz    a0, .LBB0_8
        j       .LBB0_2
.LBB0_2:                                #   in Loop: Header=BB0_1 Depth=1
        lw      a0, -24(s0)
        addi    a0, a0, 1
        sw      a0, -24(s0)
        lw      a0, -16(s0)
        lbu     a0, 0(a0)
        li      a1, 37
        bne     a0, a1, .LBB0_6
        j       .LBB0_3
.LBB0_3:                                #   in Loop: Header=BB0_1 Depth=1
        lw      a0, -16(s0)
        addi    a0, a0, 1
        sw      a0, -16(s0)
        lw      a0, -16(s0)
        lbu     a0, 0(a0)
        li      a1, 100
        bne     a0, a1, .LBB0_5
        j       .LBB0_4
.LBB0_4:                                #   in Loop: Header=BB0_1 Depth=1
        lw      a0, -20(s0)
        addi    a1, a0, 4
        sw      a1, -20(s0)
        lw      a0, 0(a0)
        sw      a0, -28(s0)
        lw      a0, -28(s0)
        lui     a1, %hi(Buffer)
        lw      a1, %lo(Buffer)(a1)
        lw      a2, -24(s0)
        slli    a2, a2, 2
        add     a1, a1, a2
        sw      a0, 0(a1)
        j       .LBB0_5
.LBB0_5:                                #   in Loop: Header=BB0_1 Depth=1
        lw      a0, -16(s0)
        addi    a0, a0, 1
        sw      a0, -16(s0)
        j       .LBB0_7
.LBB0_6:                                #   in Loop: Header=BB0_1 Depth=1
        lw      a0, -16(s0)
        lbu     a0, 0(a0)
        lui     a1, %hi(Buffer)
        lw      a1, %lo(Buffer)(a1)
        lw      a2, -24(s0)
        slli    a2, a2, 2
        add     a1, a1, a2
        sw      a0, 0(a1)
        lw      a0, -16(s0)
        addi    a0, a0, 1
        sw      a0, -16(s0)
        j       .LBB0_7
.LBB0_7:                                #   in Loop: Header=BB0_1 Depth=1
        j       .LBB0_1
.LBB0_8:
        lw      a0, -12(s0)
        lw      ra, 28(sp)                      # 4-byte Folded Reload
        lw      s0, 24(sp)                      # 4-byte Folded Reload
        addi    sp, sp, 64
        ret
                                        # -- End function
Buffer:
        .word   2684502016

pos:
        .word   4294967295                      # 0xffffffff

                                        # DW_AT_external
                                        # DW_AT_external
                                        # DW_AT_prototyped
                                        # DW_AT_external

这里的sw a0, -20(s0)是lower vastart操作，就是把可变参数的首地址存在栈上。而这个地址是跟栈帧(s0)相关的，在ventus的架构中，VLW/VSW的操作是由硬件来完成地址偏移计算的，所以按照标准的RISCV32标准来生成代码，如此类似的指令执行起来是有问题的，见这个PR: 649f4da

[CTS] Basic测试下的global_work_offsets测试

llvm libclc compiler-rt precision issue

目前ventus libclc里面的函数实现借鉴了llvm仓库的两个子文件夹

libclc(builtin以及workitem函数)
compiler-rt(因为目前承影架构不支持64位，所以采用了软浮点的方式支持了64位)

[libclc] add_sat针对char, uchar这两种数据类型计算错误

这个测试是跑CTS，basic, integer_ops子测试下的add_sat测试发新的问题

函数调用的时候generic_addressing_space自动推导的问题

OpenCL 2.0支持generic address space，所以在定义一个函数的时候可以不指定指针的地址空间，那么在调用该函数的时候可以传入任何地址空间的指针，所以函数内部不知道怎么翻译有关该指针的访存指令。可以看下面这个例子：

// RUN: clang -cl-std=CL2.0 -target riscv32 -mcpu=ventus-gpgpu -S %s
int test2(private int *a) {
    return *a + 1;
}

int test3(int *a) {
    return *a + 1;
}

int test1() {
    private int a[2];
    return test2(a) + test3(a);
}

这里test3和test2翻成了不同的访存指令。下面是汇编：

test2:
	addi	sp, sp, 4
	sw	ra, -4(sp)
	vlw.v	v0, 0(v0)
	vadd.vi	v0, v0, 1
	lw	ra, -4(sp)
	addi	sp, sp, -4
	ret
.Lfunc_end0:
	.size	test2, .Lfunc_end0-test2

	.globl	test3
	.p2align	2
	.type	test3,@function
test3:
	addi	sp, sp, 4
	sw	ra, -4(sp)
	vlw12.v	v0, 0(v0)
	vadd.vi	v0, v0, 1
	lw	ra, -4(sp)
	addi	sp, sp, -4
	ret
.Lfunc_end1:
	.size	test3, .Lfunc_end1-test3

	.globl	test1
	.p2align	2
	.type	test1,@function
test1:
	addi	sp, sp, 4
	sw	ra, -4(sp)
	lw	ra, -4(sp)
	addi	sp, sp, -4
	ret

[CTS] Basic测试下的In2fp测试

rodinia hybridsort测例atomicadd指令的问题

rodinia hybridsort测例中的kernel histogram1024调用了opencl的atomic_add函数，执行后结果输出全0。反汇编出的指令代码为amoadd.w.aqrl v0, v1, (v0)

在去掉vsub12.v指令后kernel_call_kernel_function fail

在去掉vsub12.v指令后kernel_call_kernel_function fail，有vsub12.v时case pass。

VGPR callee saved registers save and restore

函数调用的时候需要保存callee saved register的信息

No need to distinguish GPRF32 and GPR

ADJUSTSTACDKUP/ADJUSTSTACDKDOWN消除的问题

源码

#pragma OPENCL EXTENSION cl_khr_fp64 : enable

__kernel void test_fn(__global double16 *x, __global double16 *y, __global double16 *dst)
{

    dst[0] = min(x[0], y[0]);
}

[BUG] Remove all the illegal vmv instructions

Including belows:

Illegal bitcast from VGPR to SGPR
Illegal bitcast from vGPR to GPRF32

栈的空间调整的问题

在ventus的架构中，其实是有两个栈空间的，目前我们没有完全分开这两种栈空间的计算，或者说有些计算错误目前也未知

Discussion about test

目前的测试其实包括了很多模块，但是主要还是针对pocl和spike的功能测试，当前的测试比较分散，没有加入CTS，只有pocl和rodinia的部分测试，个人觉得不是很完善，应该把CTS通过的部分也加入进去，以下想法:

1 : 创建一个子仓库，里面存所有待测试的可执行文件，文件层级按目前的仓库来划分。类似

pocl
	vecadd
	matadd
gpurodia
	nn
	gaussian
CTS
	testconformance
		basic
				if
				kernel_call_kernel
		compiler
		...

每次有通过的新用例可以往这个子仓库push,然后写一个脚本来跑测试，以后每次PR之前，都做一个这个测试，保证更改之后之前的测试都能过，这是一个大致的指导思路

当然总体思路是要保证能测试到目前所有通过的用例，然后后面每次修改都要保证通过前面所有的测试，而且前面每一次测试的dump和log文件最好能够保留，这样后面PR用例失败的时候，有新旧log比对

[CTS] Missing definition of the renamed function async_work_group_copy

子用例：

async_copy_global_to_local
async_copy_local_to_global

报错信息：

测试char3时，链接libclc缺少async_work_group_copy 重命名函数的定义

The most common problem when running test

1 : load/store instructions

*  wrong function return address 
*  wrong load/store instructions
	* load from private stack, but use vlw12 ins
	* load from lds, but use vlw ins

* Wrong function return address	
	* 这种情况最有可能出现在llvm commit之后导致libclc库的函数实现出现问题
	*  crt0.S文件里面也有一小部分关于ra寄存器的spill/restore

回退vsub12.v指令提交后回归local_arg_def case出错

在回退提交0e5eef6 ，vlw12.v会从错误的内存地址上load数据，具体的log如下：
log_arg_def.dump.txt
local_arg_def.log

非法的指令生成(软件生成硬件不支持的指令)

目前硬件不支持FLW/FSW指令，所以编译器不应该产生zfinx扩展不支持的指令

[ventus] Build an important temporary development branch

[CTS] parameter types简单参数的对齐

fails to compile vector_dot_product testcase

my environment is ubuntu 20.04, and I compile example vecadd.cl successfully, but for this testcase it fails.
command is:
/opt/llvm-project/install/bin/clang -cl-std=CL2.0 -target riscv32 -mcpu=ventus-gpgpu -O1 -S example1.cl -o example1.s
source file example1.cl is:

#define USE_VECTOR_DATATYPES

__kernel void dot_product (__global const float4 *a,
             __global const float4 *b, __global float *c)
{
  int gid = get_global_id(0);

#ifndef USE_VECTOR_DATATYPES
  /* This version is to smoke test the autovectorization.
     Tries to create parallel regions with nice memory
     access pattern etc. so it gets autovectorizer. */
  /* This parallel region does not vectorize with the
     loop vectorizer because it accesses vector datatypes.
     Perhaps with SLP/BB vectorizer.*/

  float ax = a[gid].x;
  float ay = a[gid].y;
  float az = a[gid].z;
  float aw = a[gid].w;

  float bx = b[gid].x,
      by = b[gid].y,
      bz = b[gid].z,
      bw = b[gid].w;

  barrier(CLK_LOCAL_MEM_FENCE);

  /* This parallel region should vectorize. */
  c[gid] = ax * bx;
  c[gid] += ay * by;
  c[gid] += az * bz;
  c[gid] += aw * bw;

#else
  float4 prod = a[gid] * b[gid];
  c[gid] = prod.x + prod.y + prod.z + prod.w;
#endif

}

the report is:

clang: /opt/llvm-project/llvm/lib/Target/RISCV/RISCVISelDAGToDAG.cpp:879: bool llvm::RISCVDAGToDAGISel::SelectAddrRegReg(llvm::SDValue, llvm::SDValue&, llvm::SDValue&): Assertion `0 && "TODO"' failed.
PLEASE submit a bug report to https://github.com/llvm/llvm-project/issues/ and include the crash backtrace, preprocessed source, and associated run script.
Stack dump:
0.      Program arguments: /opt/llvm-project/install/bin/clang -cl-std=CL2.0 -target riscv32 -mcpu=ventus-gpgpu -O1 -S example1.cl -o example1.s
1.      <eof> parser at end of file
2.      Code generation
3.      Running pass 'Function Pass Manager' on module 'example1.cl'.
4.      Running pass 'RISCV DAG->DAG Pattern Instruction Selection' on function '@dot_product'
 #0 0x00007f81ee6e3298 llvm::sys::PrintStackTrace(llvm::raw_ostream&, int) /opt/llvm-project/llvm/lib/Support/Unix/Signals.inc:567:22
 #1 0x00007f81ee6e3353 PrintStackTraceSignalHandler(void*) /opt/llvm-project/llvm/lib/Support/Unix/Signals.inc:641:1
 #2 0x00007f81ee6e0f5e llvm::sys::RunSignalHandlers() /opt/llvm-project/llvm/lib/Support/Signals.cpp:104:20
 #3 0x00007f81ee6e2ab9 llvm::sys::CleanupOnSignal(unsigned long) /opt/llvm-project/llvm/lib/Support/Unix/Signals.inc:366:31
 #4 0x00007f81ee52a87b (anonymous namespace)::CrashRecoveryContextImpl::HandleCrash(int, unsigned long) /opt/llvm-project/llvm/lib/Support/CrashRecoveryContext.cpp:77:5
 #5 0x00007f81ee52adf6 CrashRecoverySignalHandler(int) /opt/llvm-project/llvm/lib/Support/CrashRecoveryContext.cpp:398:1
 #6 0x00007f81edea3090 (/lib/x86_64-linux-gnu/libc.so.6+0x43090)
 #7 0x00007f81edea300b raise (/lib/x86_64-linux-gnu/libc.so.6+0x4300b)
 #8 0x00007f81ede82859 abort (/lib/x86_64-linux-gnu/libc.so.6+0x22859)
 #9 0x00007f81ede82729 (/lib/x86_64-linux-gnu/libc.so.6+0x22729)
#10 0x00007f81ede93fd6 (/lib/x86_64-linux-gnu/libc.so.6+0x33fd6)
#11 0x00007f81f9313dfa llvm::RISCVDAGToDAGISel::SelectAddrRegReg(llvm::SDValue, llvm::SDValue&, llvm::SDValue&) /opt/llvm-project/llvm/lib/Target/RISCV/RISCVISelDAGToDAG.cpp:886:21
#12 0x00007f81f931e2c3 llvm::RISCVDAGToDAGISel::CheckComplexPattern(llvm::SDNode*, llvm::SDNode*, llvm::SDValue, unsigned int, llvm::SmallVectorImpl<std::pair<llvm::SDValue, llvm::SDNode*>>&) /opt/llvm-project/build/lib/Target/RISCV/RISCVGenDAGISel.inc:17100:78
#13 0x00007f81eca68299 llvm::SelectionDAGISel::SelectCodeCommon(llvm::SDNode*, unsigned char const*, unsigned int) /opt/llvm-project/llvm/lib/CodeGen/SelectionDAG/SelectionDAGISel.cpp:3119:31
#14 0x00007f81f931974e llvm::RISCVDAGToDAGISel::SelectCode(llvm::SDNode*) /opt/llvm-project/build/lib/Target/RISCV/RISCVGenDAGISel.inc:16112:1
#15 0x00007f81f9312556 llvm::RISCVDAGToDAGISel::Select(llvm::SDNode*) /opt/llvm-project/llvm/lib/Target/RISCV/RISCVISelDAGToDAG.cpp:614:13
#16 0x00007f81eca5dac9 llvm::SelectionDAGISel::DoInstructionSelection() /opt/llvm-project/llvm/lib/CodeGen/SelectionDAG/SelectionDAGISel.cpp:1105:32
#17 0x00007f81eca5cd34 llvm::SelectionDAGISel::CodeGenAndEmitDAG() /opt/llvm-project/llvm/lib/CodeGen/SelectionDAG/SelectionDAGISel.cpp:935:61
#18 0x00007f81eca5b3f7 llvm::SelectionDAGISel::SelectBasicBlock(llvm::ilist_iterator<llvm::ilist_detail::node_options<llvm::Instruction, true, false, void>, false, true>, llvm::ilist_iterator<llvm::ilist_detail::node_options<llvm::Instruction, true, false, void>, false, true>, bool&) /opt/llvm-project/llvm/lib/CodeGen/SelectionDAG/SelectionDAGISel.cpp:689:1
#19 0x00007f81eca6031a llvm::SelectionDAGISel::SelectAllBasicBlocks(llvm::Function const&) /opt/llvm-project/llvm/lib/CodeGen/SelectionDAG/SelectionDAGISel.cpp:1604:33
#20 0x00007f81eca59e9b llvm::SelectionDAGISel::runOnMachineFunction(llvm::MachineFunction&) /opt/llvm-project/llvm/lib/CodeGen/SelectionDAG/SelectionDAGISel.cpp:469:7
#21 0x00007f81f93195a0 llvm::RISCVDAGToDAGISel::runOnMachineFunction(llvm::MachineFunction&) /opt/llvm-project/llvm/lib/Target/RISCV/RISCVISelDAGToDAG.h:40:3
#22 0x00007f81f650ed0a llvm::MachineFunctionPass::runOnFunction(llvm::Function&) /opt/llvm-project/llvm/lib/CodeGen/MachineFunctionPass.cpp:91:33
#23 0x00007f81ef98eb40 llvm::FPPassManager::runOnFunction(llvm::Function&) /opt/llvm-project/llvm/lib/IR/LegacyPassManager.cpp:1430:20
#24 0x00007f81ef98ee09 llvm::FPPassManager::runOnModule(llvm::Module&) /opt/llvm-project/llvm/lib/IR/LegacyPassManager.cpp:1476:13
#25 0x00007f81ef98f27b (anonymous namespace)::MPPassManager::runOnModule(llvm::Module&) /opt/llvm-project/llvm/lib/IR/LegacyPassManager.cpp:1545:20
#26 0x00007f81ef98a19c llvm::legacy::PassManagerImpl::run(llvm::Module&) /opt/llvm-project/llvm/lib/IR/LegacyPassManager.cpp:535:13
#27 0x00007f81ef98fb51 llvm::legacy::PassManager::run(llvm::Module&) /opt/llvm-project/llvm/lib/IR/LegacyPassManager.cpp:1673:1
#28 0x00007f81f7674458 (anonymous namespace)::EmitAssemblyHelper::RunCodegenPipeline(clang::BackendAction, std::unique_ptr<llvm::raw_pwrite_stream, std::default_delete<llvm::raw_pwrite_stream>>&, std::unique_ptr<llvm::ToolOutputFile, std::default_delete<llvm::ToolOutputFile>>&) /opt/llvm-project/clang/lib/CodeGen/BackendUtil.cpp:1093:51
#29 0x00007f81f7674660 (anonymous namespace)::EmitAssemblyHelper::EmitAssembly(clang::BackendAction, std::unique_ptr<llvm::raw_pwrite_stream, std::default_delete<llvm::raw_pwrite_stream>>) /opt/llvm-project/clang/lib/CodeGen/BackendUtil.cpp:1118:17
#30 0x00007f81f76756e6 clang::EmitBackendOutput(clang::DiagnosticsEngine&, clang::HeaderSearchOptions const&, clang::CodeGenOptions const&, clang::TargetOptions const&, clang::LangOptions const&, llvm::StringRef, llvm::Module*, clang::BackendAction, std::unique_ptr<llvm::raw_pwrite_stream, std::default_delete<llvm::raw_pwrite_stream>>) /opt/llvm-project/clang/lib/CodeGen/BackendUtil.cpp:1274:25
#31 0x00007f81f7e5116d clang::BackendConsumer::HandleTranslationUnit(clang::ASTContext&) /opt/llvm-project/clang/lib/CodeGen/CodeGenAction.cpp:381:24
#32 0x00007f81e77dd7d7 clang::ParseAST(clang::Sema&, bool, bool) /opt/llvm-project/clang/lib/Parse/ParseAST.cpp:203:14
#33 0x00007f81f4376a6f clang::ASTFrontendAction::ExecuteAction() /opt/llvm-project/clang/lib/Frontend/FrontendAction.cpp:1162:11
#34 0x00007f81f7e4ae58 clang::CodeGenAction::ExecuteAction() /opt/llvm-project/clang/lib/CodeGen/CodeGenAction.cpp:1170:5
#35 0x00007f81f437632e clang::FrontendAction::Execute() /opt/llvm-project/clang/lib/Frontend/FrontendAction.cpp:1059:38
#36 0x00007f81f4288109 clang::CompilerInstance::ExecuteAction(clang::FrontendAction&) /opt/llvm-project/clang/lib/Frontend/CompilerInstance.cpp:1045:42
#37 0x00007f81f9164966 clang::ExecuteCompilerInvocation(clang::CompilerInstance*) /opt/llvm-project/clang/lib/FrontendTool/ExecuteCompilerInvocation.cpp:264:38
#38 0x0000561f27804aa1 cc1_main(llvm::ArrayRef<char const*>, char const*, void*) /opt/llvm-project/clang/tools/driver/cc1_main.cpp:251:40
#39 0x0000561f277f2bca ExecuteCC1Tool(llvm::SmallVectorImpl<char const*>&) /opt/llvm-project/clang/tools/driver/driver.cpp:319:20
#40 0x00007f81f3bb5f69 clang::driver::CC1Command::Execute(llvm::ArrayRef<std::optional<llvm::StringRef>>, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>>*, bool*) const::'lambda'()::operator()() const /opt/llvm-project/clang/lib/Driver/Job.cpp:428:32
#41 0x00007f81f3bb6575 void llvm::function_ref<void ()>::callback_fn<clang::driver::CC1Command::Execute(llvm::ArrayRef<std::optional<llvm::StringRef>>, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>>*, bool*) const::'lambda'()>(long) /opt/llvm-project/llvm/include/llvm/ADT/STLFunctionalExtras.h:46:40
#42 0x00007f81ee516c18 llvm::function_ref<void ()>::operator()() const /opt/llvm-project/llvm/include/llvm/ADT/STLFunctionalExtras.h:68:62
#43 0x00007f81ee52b008 llvm::CrashRecoveryContext::RunSafely(llvm::function_ref<void ()>) /opt/llvm-project/llvm/lib/Support/CrashRecoveryContext.cpp:434:10
#44 0x00007f81f3bb6187 clang::driver::CC1Command::Execute(llvm::ArrayRef<std::optional<llvm::StringRef>>, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>>*, bool*) const /opt/llvm-project/clang/lib/Driver/Job.cpp:428:7
#45 0x00007f81f3b39bf8 clang::driver::Compilation::ExecuteCommand(clang::driver::Command const&, clang::driver::Command const*&, bool) const /opt/llvm-project/clang/lib/Driver/Compilation.cpp:200:22
#46 0x00007f81f3b39f86 clang::driver::Compilation::ExecuteJobs(clang::driver::JobList const&, llvm::SmallVectorImpl<std::pair<int, clang::driver::Command const*>>&, bool) const /opt/llvm-project/clang/lib/Driver/Compilation.cpp:254:62
#47 0x00007f81f3b52acb clang::driver::Driver::ExecuteCompilation(clang::driver::Compilation&, llvm::SmallVectorImpl<std::pair<int, clang::driver::Command const*>>&) /opt/llvm-project/clang/lib/Driver/Driver.cpp:1816:28
#48 0x0000561f277f4196 clang_main(int, char**) /opt/llvm-project/clang/tools/driver/driver.cpp:520:39
#49 0x0000561f27826c5e main /opt/llvm-project/build/tools/clang/tools/driver/clang-driver.cpp:11:63
#50 0x00007f81ede84083 __libc_start_main (/lib/x86_64-linux-gnu/libc.so.6+0x24083)
#51 0x0000561f277f125e _start (/opt/llvm-project/install/bin/clang+0x4f25e)
clang-16: error: clang frontend command failed with exit code 134 (use -v to see invocation)
clang version 16.0.0 (https://github.com/THU-DSP-LAB/llvm-project.git 60a504f481826a3e1fa84aa120786110e8714944)
Target: riscv32
Thread model: posix
InstalledDir: /opt/llvm-project/install/bin
clang-16: note: diagnostic msg:
********************

PLEASE ATTACH THE FOLLOWING FILES TO THE BUG REPORT:
Preprocessed source(s) and associated run script(s) are located at:
clang-16: note: diagnostic msg: /tmp/example1-4fb1fd.cl
clang-16: note: diagnostic msg: /tmp/example1-4fb1fd.sh
clang-16: note: diagnostic msg:

********************

[CTS] Basic下的kernel_limit_constants测试

__kernel void test( __global int *intOut, __global float *floatOut ) 
{
  intOut[0] = isinf( 1.0 ) ? 1 : 0;
  intOut[1] = isnormal( MAXFLOAT ) ? 1 : 0;
  intOut[2] = isnan( MAXFLOAT ) ? 1 : 0;
  intOut[3] = sizeof( MAXFLOAT );
  intOut[4] = ( MAXFLOAT == FLT_MAX ) ? 1 : 0;
  intOut[6] = ( MAXFLOAT == MAXFLOAT ) ? 1 : 0;
  intOut[7] = ( MAXFLOAT == 0x1.fffffep127f ) ? 1 : 0;
  floatOut[0] = MAXFLOAT;
}

目前isinf(MAX_FLOAT)这个函数取得的结果有点问题

800000a8 <loop_constant_kernel>:
800000a8: 13 01 01 01  	addi	sp, sp, 16
800000ac: 23 20 11 00  	sw	ra, 0(sp)
800000b0: 83 22 85 00  	lw	t0, 8(a0)
800000b4: 23 2c 51 fe  	sw	t0, -8(sp)
800000b8: 23 2e a1 fe  	sw	a0, -4(sp)
800000bc: 83 22 05 00  	lw	t0, 0(a0)
800000c0: 23 2a 51 fe  	sw	t0, -12(sp)
800000c4: 57 40 00 5e  	vmv.v.x	v0, zero
800000c8: ef 00 c0 05  	jal	0x80000124 <_Z13get_global_idj>
800000cc: 83 24 81 ff  	lw	s1, -8(sp)
800000d0: b7 32 00 80  	lui	t0, 524291
800000d4: 63 56 90 02  	blez	s1, 0x80000100 <.LBB0_3>
800000d8: 03 23 c1 ff  	lw	t1, -4(sp)
800000dc: 03 23 43 00  	lw	t1, 4(t1)
800000e0: 83 a2 82 00  	lw	t0, 8(t0)
800000e4: 93 03 10 00  	li	t2, 1

800000e8 <.LBB0_2>:
800000e8: 03 24 03 00  	lw	s0, 0(t1)
800000ec: d3 f2 82 00  	fadd.s	t0, t0, s0
800000f0: b3 84 74 40  	sub	s1, s1, t2
800000f4: 13 03 c3 00  	addi	t1, t1, 12
800000f8: e3 98 04 fe  	bnez	s1, 0x800000e8 <.LBB0_2>
800000fc: 6f 00 80 00  	j	0x80000104 <.LBB0_4>

80000100 <.LBB0_3>:
80000100: 83 a2 82 00  	lw	t0, 8(t0)

80000104 <.LBB0_4>:
80000104: 57 30 01 96  	vsll.vi	v0, v0, 2
80000108: 03 23 41 ff  	lw	t1, -12(sp)
8000010c: 57 40 03 02  	vadd.vx	v0, v0, t1
80000110: d7 c0 02 5e  	vmv.v.x	v1, t0
80000114: 7b 60 10 00  	vsw12.v	v1, 0(v0)
80000118: 83 20 01 00  	lw	ra, 0(sp)
8000011c: 13 01 01 ff  	addi	sp, sp, -16
80000120: 67 80 00 00  	ret

这个是测试失败的编译器生成的汇编信息

vecadd.riscv:   file format elf32-littleriscv

Disassembly of section .rodata.ventus.resource:

00000000 <.rodata.ventus.resource>:
       0: 00 00         <unknown>
       2: 00 00         <unknown>
       4: 60 00         <unknown>
       6: c8 01         <unknown>

参数*KnodesD返回给pocl的是4 byte，直接指令生成时按8byte取，造成越界访问。
参数*ansD参数为40(a0), 指令中没有对该参数的读取及访问，造成结果未写出。
在vmsleu.vv指令的逻辑错误可能存在问题，该指令后并未有对offsetD[bid]赋值的指令。造成数据逻辑错误。

支持regexti指令的生成

目前承影架构中所有带立即数的向量指令，都只有5bit，regexti指令是扩展十一位立即数
regexti的立即数字段和下一条指令的立即数字段拼起来形成一个11位立即数，然后对这个数进行有符号扩展

O1参数优化下branch指令生成错误

对于非kernel函数，在优化参数O1下会将局部变量直接存到标量寄存器使用，而不是先存到tp栈上再取出使用，这会导致生成的分支指令为标量指令，导致有些情况下的结果错误，可以看下面这个例子：
源码：

int loop(int x) {
    for (int i = 0; i != 64; i=i+2) {
        if (x - i) return i;
    }
    return 0;
}

-O0：

loop:
        addi    sp, sp, 4
        addi    tp, tp, 12
        regext  zero, zero, 1
        vmv.v.x v32, tp
        sw      ra, -4(sp)
        regext  zero, zero, 8
        vsw.v   v0, -8(v32)
        li      t0, 0
        vmv.v.x v0, t0
        regext  zero, zero, 8
        vsw.v   v0, -12(v32)
        j       .LBB0_1
.LBB0_1:
        regext  zero, zero, 8
        vlw.v   v0, -12(v32)
        li      t0, 64
        vmv.v.x v1, t0
.Lpcrel_hi0:
        auipc   t1, %pcrel_hi(.LBB0_7)
        setrpc  zero, t1, %pcrel_lo(.Lpcrel_hi0)
        vbeq    v0, v1, .LBB0_6
        j       .LBB0_2
.LBB0_2:
        regext  zero, zero, 8
        vlw.v   v0, -8(v32)
        regext  zero, zero, 8
        vlw.v   v1, -12(v32)
        vsub.vv v0, v0, v1
        li      t0, 0
        vmv.v.x v1, t0
.Lpcrel_hi1:
        auipc   t1, %pcrel_hi(.LBB0_7)
        setrpc  zero, t1, %pcrel_lo(.Lpcrel_hi1)
        vbeq    v0, v1, .LBB0_4
        j       .LBB0_3
.LBB0_3:
        regext  zero, zero, 8
        vlw.v   v0, -12(v32)
        regext  zero, zero, 8
        vsw.v   v0, -4(v32)
        j       .LBB0_7
.LBB0_4:
        j       .LBB0_5
.LBB0_5:
        regext  zero, zero, 8
        vlw.v   v0, -12(v32)
        vadd.vi v0, v0, 2
        regext  zero, zero, 8
        vsw.v   v0, -12(v32)
        j       .LBB0_1
.LBB0_6:
        li      t0, 0
        vmv.v.x v0, t0
        regext  zero, zero, 8
        vsw.v   v0, -4(v32)
        j       .LBB0_7
.LBB0_7:
        join    zero, zero, 0
        regext  zero, zero, 8
        vlw.v   v0, -4(v32)
        lw      ra, -4(sp)
        addi    sp, sp, -4
        addi    tp, tp, -12
        regext  zero, zero, 1
        vmv.v.x v32, tp
        ret

-O1：

loop:
        li      t0, 0
        li      t2, 64
.LBB0_1:
        vmv.v.x v1, t0
.Lpcrel_hi0:
        auipc   t1, %pcrel_hi(.LBB0_5)
        setrpc  zero, t1, %pcrel_lo(.Lpcrel_hi0)
        vbne    v0, v1, .LBB0_4
        addi    t0, t0, 2
        bne     t0, t2, .LBB0_1
        vmv.v.x v0, zero
        j       .LBB0_5
.LBB0_4:
        vmv.v.x v0, t0
        j       .LBB0_5
.LBB0_5:
        join    zero, zero, 0
        ret

可以发现在O1优化下返回值有问题。

函数返回值过大的问题

当前函数返回值最多支持用16个寄存器（V0-V15）来接收，所以当返回值大小超过4x16个字节时（比如double16类型）就会出现问题。
参考下面这个用例：

double16 func(double16 x, double16 y)
{
    return x + y;
}

子用例：

progvar_prog_scope_init
progvar_prog_scope_uninit
progvar_prog_scope_misc

报错信息：

[CTS] Basic测试下的parameter_types测试

kernel void test_kernel(
char c, uchar uc, short s, ushort us, int i, uint ui, float f,
global float *result)
{

result[0] =  (c);
result[1] =  (uc);
result[2] =  (s);
result[3] =  (us);
result[4] =  (i);
result[5] =  (ui);
result[6] = f;
}