Giter VIP home page Giter VIP logo

mlu-ops's Introduction

  • 为了提供更好的开发体验,我们为您提供包含寒武纪完整软件栈的容器镜像,帮助您跳过环境部署环节
  • 如需获取该容器镜像,可提 ISSUE 留下您的联系方式

简介

MLU-OPS™提供基于寒武纪人工智能单元(MLU),使用 C 接口开发高性能算子的示例代码。 MLU-OPS™旨在通过提供示例代码,供开发者参考使用,可用于开发自定义算子,实现对应模型的计算。

MLU-OPS™提供了以下功能:

依赖条件

  • 操作系统:
    • 支持 x86_64 架构下的 Ubuntu20.04、Centos7.6、Centos8.5、Kylin10
    • MLU-OPS™ v1.0.0版本后将不再支持 Ubuntu18.04。Ubuntu22.04系统将在后续的版本提供支持。
  • 寒武纪 MLU SDK:
    • 编译和运行时依赖 CNToolkit v3.12.3 或更高版本,CNNL v1.26.1 或者更高版本
  • 寒武纪 MLU 驱动:
    • 运行时依赖驱动 v5.10.25 或更高版本
  • 外部链接库:
    • libxml2-dev、libprotobuf-dev、protobuf-compiler、llvm-6.0-dev、libeigen3-dev>=3.4
  • Python环境:
    • 依赖Python-3版本(默认版本 python 3.8.0,最低要求 python 3.6.0)

依赖环境准备

  • 获取 MLU-OPS™ 代码

以Ubuntu20.04版本为例

git clone https://github.com/Cambricon/mlu-ops.git
cd mlu-ops
git submodule update --init --recursive
  • 准备 CNToolkit、CNNL 环境

    wget https://sdk.cambricon.com/static/Basis/MLU370_X86_ubuntu20.04/cntoolkit_x.x.x-x.ubuntu20.04_amd64.deb
    wget https://sdk.cambricon.com/static/Basis/MLU370_X86_ubuntu20.04/cnnl_x.x.x-x.ubuntu20.04_amd64.deb
    sudo apt-get install ./cntoolkit-x.x.x-x.ubuntu20.04_amd64.deb
    sudo apt-get update
    sudo apt-get install cncc cnas cnbin cndrv cndev cnrt cnrtc cngdb cnperf
    sudo apt-get install ./cnnl_x.x.x-x.ubuntu20.04_amd64.deb
  • 准备 Python-3.8.0 环境

    wget https://www.python.org/ftp/python/3.8.0/Python-3.8.0.tgz
    tar -xvf Python-3.8.0.tgz
    cd Python-3.8.0
    make -j24 && make install
    
  • 准备链接库环境

    sudo apt-get update
    sudo apt-get install protobuf-compiler libxml2-dev libprotobuf-dev llvm-6.0-dev

获取关于 BANG 语言基础和开发相关工具介绍的文档

可查看最新版 开发者文档

目录文件结构

目录/文件 描述
cmake 存放编译相关的 make 文件。
core 存放公共数据类型的操作、运行时管理、日志等公共实现。
docker 存放 docker 打包脚本,提供 CI 构建环境。
docs 存放算子开发、测试、精度验收等说明文档。
kernels 算子代码实现,包含一元、二元算子模板供其他算子调用。
test 存放测试算子用的代码。
mlu_op.h 公共数据类型描述,以及 kernels 目录中的算子对外提供的 C 接口。

编译、开发与测试

提供基于 BANG C 的算子开发教程,涵盖算子入门、算子进阶、算子高级篇,帮助开发者迅速上手算子开发。 具体见 BANG C 算子开发指南

提供基于寒武纪人工智能单元(MLU)开发高性能算子、C 接口封装的示例代码。 MLU-OPS™ 具体的编译、开发与测试介绍见 MLU-OPS™算子编译、开发与测试介绍

更多内容见 docs 目录下文档。

mlu-ops's People

Contributors

aiyoungcino avatar alex-xuwenming avatar artintai avatar baicaixmj avatar betterdongw avatar chqy99 avatar danieeelliu avatar defei-coder avatar devin-d-u avatar duzekunkth avatar fryao avatar gggghja avatar graceguanh avatar guzhonghao avatar ling08 avatar littlereal avatar mahxn0 avatar nekhan avatar petrelyy avatar shin-wang avatar stulai avatar sunsadyaofas avatar tudejiang79 avatar unireverse avatar wangrt1 avatar wickyzheng avatar wsqrichards1 avatar xwulin avatar zhanglearning avatar zhangyicole avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

mlu-ops's Issues

【CNPlugin】PluginOp测试出错

【寒武纪硬件产品型号】:MLU270

【使用操作系统】:ubuntu16

【使用驱动版本】: v4.9.2

【出错信息】:

CNML: 7.10.2 ba20487

CNRT: 4.10.1 a884a9a

2022-06-30 09:10:37.015295: [cnmlError] cnml_ _op_ptr is nullptr

2022-06-30 09:10:37.015309: [cnmlError] cnml_ _op_ptr is nullptr

【出错代码链接】:

// Create op_ptr

cnml Op_t op;

cnmlCreatePluginConv3dOp(&op, param, cnml_input_tensor,cnml_filter_tensor,cnml_ _tensor, cnml_output_tensor);

// Set op layout

cnmlSetOperationComputingLayout(op, CNML_NDHWC);

// Compile op

cnmlCompile Op(op, CNML_MLU270, 4);

`完整代码:
``int main() {

// Yolov3DetectionOutputOp performs the computation described in the yolo- proposed by

// like cnrtMalloc.

cnmlInit(0);

unsigned dev_num;

CNRT_CHECK(cnrtGetDeviceCount(&dev_num));

if (dev_num == 0)

return CNRT_RET_ERR_NODEV;

cnrtDev_t dev;

CNRT_CHECK(cnrtGetDeviceHandle(&dev, 0));

CNRT_CHECK(cnrtSetCurrentDevice(dev));

// Prepare params & data for yolov3 detection op

Conv3dOpParam p1;

p1.input_n=1;

p1.input_d=16;

p1.input_h=300;

p1.input_w=256;

p1.input_c=16;

p1.output_d=16;

p1.output_h=300;

p1.output_w=256;

p1.output_c=128;

p1.kd=3;

p1.kh=3;

p1.kw=3;

p1.stride_d=1;

p1.stride_h=1;

p1.stride_w=1;

p1.dilation_d=1;

p1.dilation_h=1;

p1.dilation_w=1;

p1.pad_d_back=0;

p1.pad_d_front=0;

p1.pad_h_back=0;

p1.pad_h_front=0;

p1.pad_w_back=0;

p1.pad_w_front=0;

cnmlCoreVersion_t core_version = CNML_MLU270;

cnmlDataType_t data_type;

cnrtDataType_t cast_type;

int data_width = 0;

data_type = CNML_DATA_INT8;

cast_type = CNRT_INT8;

data_width = 1;

// Create params for yolov3 detection op.

cnmlPluginConv3dOpParam_t param;

cnmlCreatePluginConv3dOpParam(

  &param, p1.input_n,p1.input_d,p1.input_h,p1.input_w,p1.input_c,p1.output_d,p1.output_h,p1.output_w,

  p1.output_c,p1.kd,p1.kh,p1.kw,p1.stride_d,p1.stride_h,p1.stride_w,p1.dilation_d,p1.dilation_h,

  p1.dilation_w,p1.pad_d_front,p1.pad_d_back,p1.pad_h_front,p1.pad_h_back,p1.pad_w_front,p1.pad_w_back,1,false,0xFFFF,1.0,0xFFFF,1.0,data_type,core_version);

cnmlTensor_t *cnml_input_tensor = (cnmlTensor_t *)malloc(sizeof(cnmlTensor_t));

cnmlTensor_t *cnml_filter_tensor = (cnmlTensor_t *)malloc(sizeof(cnmlTensor_t));

cnmlTensor_t *cnml_ _tensor = (cnmlTensor_t *)malloc(sizeof(cnmlTensor_t) );

cnmlTensor_t *cnml_output_tensor = (cnmlTensor_t *)malloc(sizeof(cnmlTensor_t) );

cnmlCreateTensor_V2(&cnml_input_tensor[0], CNML_TENSOR);

cnmlCreateTensor_V2(&cnml_filter_tensor[0], CNML_TENSOR);

cnmlCreateTensor_V2(&cnml_ _tensor[0], CNML_TENSOR);

cnmlCreateTensor_V2(&cnml_output_tensor[0], CNML_TENSOR);

// Create op_ptr

cnml Op_t op;

cnmlCreatePluginConv3dOp(&op, param, cnml_input_tensor,cnml_filter_tensor,cnml_ _tensor, cnml_output_tensor);

// Set op layout

cnmlSetOperationComputingLayout(op, CNML_NDHWC);

// Compile op

cnmlCompile Op(op, CNML_MLU270, 4);

}

`

多个core同时IO如何优化性能

为了研究增加核数所带来的加速效果,我固定每个core处理2^24*3个float数据,以下是不同core数的运行时间(ms):
image
三列分别表示100个add,1个add和纯IO的运算。
可以看到,纯IO的情况,core数增加,单个core的性能下降了很多,而100add下降的很少。
我猜测是多个core的IO竞争gdram带宽,导致单个core性能下降。而对于计算密集型的100add运算,因为IO时间占比小,所以gdram带宽的竞争就比较小。
目前在mlu290上跑满64个core,IO的吞吐量最多约600GB/s,只能达到60%的效率。我想尽量跑满带宽,请问有什么优化的手段吗?

关于bangpy自动流水生成的代码逻辑问题

image
上图下面红框是使用print(f.get_source())打印出的对应bangc代码,代码for循环部分的顺序是memcpy_async, memcpy_async, __bang_add, memcpy_async。
但是按我的理解,流水要使计算和store+load同步的话,顺序应该是:memcpy_async, memcpy_async, memcpy_async, __bang_add才对。
bangc中的一元模板就是先写异步store+load语句,再写计算逻辑的:
image

关于从SRAM拷数据到GDRAM的问题

问题描述:
数据从SRAM拷到GDRAM时,打印查看后发现并没有拷进去。

期望的代码行为:

with self.tcp.block("data_copy"):
    self.tcp.memcpy(buffer_out_s[begin:end], buffer_io_n)
    with self.tcp.if_scope(j==2):
        st.assign(start-2*data_each_time)
        # with self.tcp.if_scope(tcp.all(task_id==0,i==2)):
        #     self.tcp.print(buffer_out_s[begin:begin+data_each_time])
              self.tcp.memcpy(buffer_out[st:stop],buffer_out_s[begin-2*data_each_time:end])
              self.tcp.sync_cluster()
        # with self.tcp.if_scope(tcp.all(task_id==0,i==2)):
        #     self.tcp.print(buffer_out[st:st+data_each_time])

期望buffer_out的值与buffer_out_s里面的值一致。

实际运行的代码行为:
buffer_out初始值为全0:data_out_dev = bangpy.Array(np.zeros(data_out.shape, dtype.as_numpy_dtype), dev)
拷贝后输出的仍然是全0,也即数据没有拷入进去

运行环境:
代码分支:分支链接
bug复现步骤:定位到bangpy-ops/ops/hard_sigmoid/hard_sigmoid.py文件184-193行,修改相关print语句,然后直接python3 mytest.py即可
bangpy版本:1.3.1
cncc:v3.6.1
bangpy和cncc版本

大致情况如上述,不知道是哪里理解错了,感谢!

BangPy创建nram类型Buffer存在问题

之前也曾遇到过这种问题,后来确认过一次应该是shape内部出了问题。
这次仍然遇到了这个问题,非常不理解,想要获得解决方案。
with self.bp.if_scope(dim == 0): dim_h = 1 dim_m = self.dim_0 dim_l = self.dim_1 * self.dim_2 * self.dim_3 buffer_out = self.bp.Buffer( shape=(dim_h, 1, dim_l), name="OUTPUT", dtype=self.dtype, scope="global" )
这里在编译的时候是通过的。但后面的创建nram部分就出错了。
buffer_out_n = self.bp.Buffer( shape=(dim_h, 1, dim_l), name="OUTN", dtype=self.dtype, scope="nram", )
会被报错。报错情况如下
Build all operators...
Traceback (most recent call last):
File "/home/pan/mlu-ops/bangpy-ops/utils/build_and_test_all_operators.py", line 145, in
main()
File "/home/pan/mlu-ops/bangpy-ops/utils/build_and_test_all_operators.py", line 127, in main
build_all_op()
File "/home/pan/mlu-ops/bangpy-ops/utils/build_and_test_all_operators.py", line 65, in build_all_op
obj(None, None)
File "/usr/local/lib/python3.8/site-packages/bangpy/tcp/tcp.py", line 529, in wrapper
mod = f(type_, target_)
File "/home/pan/mlu-ops/bangpy-ops/ops/cosine_similarity/cosine_similarity.py", line 487, in build_cosine_similarity
f = Cosine_similarity(dtype, target, task_num, stage).compute_body()
File "/home/pan/mlu-ops/bangpy-ops/ops/cosine_similarity/cosine_similarity.py", line 106, in compute_body
buffer_out_n = self.bp.Buffer(
File "/usr/local/lib/python3.8/site-packages/bangpy/tcp/tcp.py", line 178, in Buffer
Check.nocheck("Cannot handle this data type", TypeError)
File "/usr/local/lib/python3.8/site-packages/bangpy/tools/check.py", line 47, in nocheck
print_error_msg(error_message, exception_type)
File "/usr/local/lib/python3.8/site-packages/bangpy/tools/traceback.py", line 131, in print_error_msg
raise exception_type(debug_info)
ValueError: [ET000] Cannot handle this data type
62 print("======================")
63 print("Build all operators...")
64 for obj in build_entrys:
65 --> obj(None, None)`
之前说是因为控制流的问题,就是if_scope后面没有memcpy操作。但是这次,我每个条件下后面都有memcpy操作,但还是出了问题,想知道问题原因以及如何解决。

关于strided memcpy

由于数据对齐的需要,有时候我们需要将原本形状为(N, 31)的数据转变为(N, 32)的形状去处理,因此采用如下拷贝模式:

input_g_buffer = tcp.Buffer(shape = (data_N, 31))
tcp.memcpy(input_n_buffer.reshape((N, 32))[:N, :31], input_g_buffer[base : base + N])

这种情况下,由于非连续拷贝,IO效率会有一个较显著的降低。但是当我使用这种写法进行对齐数据的计算时:

input_g_buffer = tcp.Buffer(shape = (data_N, 32))
tcp.memcpy(input_n_buffer.reshape((N, 32))[:N, :32], input_g_buffer[base : base + N])

效率也会有较大的降低。没用切片的写法时就能恢复正常的IO效率。

poly_nms算子初始化问题

constexpr uint32_t BIT_FLOAT_NEG_1 = 0x80000000;

在ploy_nms算子计算IOU面积时涉及一个变量:constexpr uint32_t BIT_FLOAT_NEG_1 = 0x80000000;
我想在MLU100上运行一下这个算法,改动如下:
float => half, uint32_t => uint16_t
当检测框是凹四边形时,计算结果是正确的;
检测框是凸多边形时,计算结果是错误的,原因是BIT_FLOAT_NEG_1初始化值不正确,
请问在MLU100设备上BIT_FLOAT_NEG_1值应该怎么样初始化?

编译时提示 “ CNctxConfigParam” was not declared

下载仓库后,执行build.sh文件编译,提示 CNct 开头的一些类未声明的错误。
test/mlu_op_gtest/src/executor.cpp:1411:3: error: ‘CNctxConfigParam’ was not declared in this scope CNctxConfigParam ctx_conf_param, check_param;
core/context.cpp:105:3: error: ‘CNctxConfigParam’ was not declared in this scope CNctxConfigParam ctx_conf_param;

另外 希望作者在README 标注项目所需依赖:libgtest-dev libxml2-dev libprotobuf-dev protobuf-compiler

Add data type and error codes in .h file

Please describe the missing data types and error codes in .h file.

Missing data types:
mlu‑OpTensorDescriptor_t
mluOpHandle_t

Missing error codes:
MLUOP_STATUS_SUCCESS
MLUOP_STATUS_BAD_PARAM
MLUOP_STATUS_NOT_SUPPORTED

能详细解释一下Bufferslice的用法以及和memcpy的组合吗

微信截图_20220430162306
Guide里面的Bufferslice的用法感觉没有写的很具体,不知道怎么用参数n,h,w,c定义一个切片;

具体来说,对于GDRAM上的一个一维Buffer,我想要每隔L个元素取连续的K个元素(从第0个开始先取K个,然后每隔L个取K个),假设总共要取M个,能否将这M个元素通过Bufferslice的方法用一个memcpy语句将它们连续存储到NRAM上的一个长度为M个元素的buffer中呢?如果能的话麻烦教一下参数应该怎么设置(n,h,w,c,valid_dim)。
因为如果用一般的memcpy,一次能操作的元素上限就被K限制住了,会导致效率很低。所以想知道Bufferslice能否解决我的问题,谢谢。

关于IMM for u32 out of range的warning以及大数据情况下才会出现的错误

微信截图_20220518215424
数据量大的时候出现上面的warning会对程序正常执行有影响吗?意思是立即数有超过unsigned int32表示范围的意思吗?

在数据量大到出现上面warning的时候,对于下面的代码:
微信截图_20220518215608
如果我没有红框内的这部分代码程序会正常执行,但是结果会有0.01%的错误(数据量再小一点比如0.5倍就全对pass);
如果加上这部分代码(最后一次循环i=255),会直接报以下错误:
微信截图_20220518215639
请问这是否和out of range有关呢?

变量数据长度超过1G的时候会出现TVMerror

给input参数传入的数据长度超过2**30的时候,会出现报错:
TVMError: Bind have an unmet assertion: (2147483648 == int32(arg0.shape[0])), on argument arg0.shape[0]

请问这个是对输入数据的限制吗?

bangpy2.1报错error: Unknown identifier v_2.

使用bangpy2.1重构算子的时候遇到下面问题:
image
判断arch的接下来第一条语句会报错,如果注释掉判断,就不报错。
而且测试时如果使用float32不会报错,使用float16会报错。
还发现只要有那种编译时可以判断的if语句,就会报这种错误。
这是代码:
Archive.tar.gz

memcpy报错但是检查坐标值都没有问题

cross1.zip
提供的代码如上,在if分支2的流水线中打开所有的memcpy,运行测试用例,报错如下:
微信截图_20220521143615
注释掉部分代码,并且检查buffer坐标是否有问题(见cross_越界.py):
微信截图_20220521143737
还是报错,且没有打印出任何预设的信息;

注释掉所有的memcpy并且检查buffer坐标是否有问题(见cross_无Error.py):
微信截图_20220521143918
test通过,且没有打印出任何预设的信息。

而且是对于一个特定的较大测试用例会出错,规模稍小的样例全都能pass(见test_cross.py)。
麻烦能帮忙找一下问题出在哪里吗?

关于bangpy I/O 效率问题

我发现当我进行单向数据传输(大量数据从gdram拷入nram,而拷出数据可以忽略不计)时io效率显著低于有相当量的拷入拷出数据的情形。
在mlu-290上只进行循环拷贝操作的情况下,前者只有490GB/s的数据传输速率,后者能达到600~700GB/s以上。
请问对于前者有能够更充分利用io带宽的方法吗?

bangpy: time_evaluator可以忽略个别不正常的数据吗?

问题描述

使用timeevaluator测时间时,如果repeat设置的比较大,会出现这种特别大的不正常的时间。在实测过程中发现,这种不正常时间会对取平均的结果造成较大的波动。算子代码中只包含纯IO,所以应该和计算逻辑没关系。请问time_evaluator这个接口可以支持去除不正常数据吗?

相关截图

2022-04-27_11-52

运行环境

image

有关原位操作文档的问题

关于文档有两个问题:

  1. bangc文档中对每个算子都有标注是否支持原位操作,但bangpy文档中似乎没找到原位操作相关的内容
  2. banpy的maximum算子在输入输出均为张量时,调用的是__bang_maximum,而bangc手册上说该算子不支持原位操作。而实测好像是支持的

[Bug]/[CoreDump] : A sample of code bug issue submission

问题描述:
简述出现的问题。

期望的代码行为:
预期的代码行为,或输出结果,最好能附加代码片段。

实际运行的代码行为:
实际运行产生的错误或core dump的结果,最好附加相关的错误截图。

运行环境:
代码分支:请将出错的代码push到一个分支上,并附加分支链接。
BUG复现步骤:请描述如何复现出错的问题,包括如何执行命令等。
Bangpy版本:使用命令pip show bangpy查看,最好附上命令行运行截图。
Cntoolkit工具链版本:cncc --version,最好附上命令行运行截图。

BANGPy1.4.0的Buffer切片要如何和memcpy结合使用?

在流水线中使用如下指令编译会不通过:
微信截图_20220513163938
微信截图_20220513164001
请问有办法可以在流水线里将一个GDRAM上的切片flatten然后拷贝到nram的一维Buffer上吗?
因为GDRAM的维度大小在编译前是不可知的,所以也没有办法创建一个和GDRAM切片形状相同的nram上的buffer。

[bug report] Bangpy tcp script assign结果错误

环境:bangpy2.0.3
在使用tcp script重构算子过程中发现,
使用语句tcp.assign(ex1, 15),如果ex1是float16类型的nram buffer的话,赋值结果会错误(tcp.print(ex1[0]的打印结果是0或者接近0的数字),15换成其他数(比如15.0,1,1.0)也是同样错误。换成语句tcp.assign(ex1, tcp.cast(15, self.dtype))结果就正确了。

bangpy打印不全

有些时候bangpy打印不全,调试的时候比较麻烦,请问有什么解决办法吗
image

[Bug Report] 关于SizeVar检测问题

当我运行如下代码,除了初始化过程没有进行任何计算和memcpy操作

"""cosine embedding loss for bangpy tcp"""

import bangpy
from bangpy import tcp
from bangpy.tcp.runtime import TaskType
from bangpy.platform.bang_config import TARGET

DTYPES = [bangpy.float16, bangpy.float32]
TARGET_LIST = ["mlu290"]
KERNEL_NAME = "cosine_embedding_loss"

class CosineEmbeddingLoss(object):
    """Operator description:
    TODO: add detailed operator description
    使用sumpool的方式求和,sumpool之后的结果固定为128bytes * 128bytes 的方阵,依据nram大小
    决定sumpool时kernel的大小
    nram装不下一行数据时在第二维度迭代
    能装下就在第一维度迭代
    """

    def __init__(self, dtype, stage, target, task_type):
        # print(bangpy.version)
        self.dtype = dtype
        self.target = target
        self.stage = stage
        self.task_type = task_type
        self.tcp = tcp.TCP(target)
        self.tcp.launch_cluster(self.task_type.value)
        self.tcp.launch_task(self.task_type.value * 4, 1, 1)
        self.task_num = self.task_type.value * 4
        self.pipeline = 0

        self.data_num_v = self.tcp.SizeVar("data_num_vv")
        self.length = self.tcp.SizeVar("length_v")
        
        self.input_y = self.tcp.Buffer(shape = (self.data_num_v, ), name = "input_y", dtype = dtype, scope = "global")
        self.input_x1 = self.tcp.Buffer(shape = (self.data_num_v, self.length), name = "input_x1", dtype = dtype, scope = "global")
        self.input_x2 = self.tcp.Buffer(shape = (self.data_num_v, self.length), name = "input_x2", dtype = dtype, scope = "global")
        
        self.margin = self.tcp.Var(name = "margin", dtype = bangpy.float32)
        self.output = self.tcp.Buffer(shape = (self.data_num_v, ), name = "output", dtype = dtype, scope = "global")

        self.length_s = self.tcp.Scalar(bangpy.int32, "length_s")
        self.length_s.assign(self.length)
        self.data_num = self.tcp.Scalar(bangpy.int32, "data_num")
        self.data_num.assign(self.data_num_v)

        self.align_size = 128 // self.dtype.bytes
        self.max_buffer_size = 32 * 128 * 15 // self.dtype.bytes # TODO: 确定buffer上限
        self.max_kernel_size = self.tcp.Scalar(value = 32, dtype = bangpy.int32, name = "max_kernel_size")
        self.max_reduced_buffer_size = self.max_buffer_size // 32 # 有问题啊这里


    def compute_body(self):
        #...
        return self.tcp.BuildBANG(
            inputs = [self.input_x1, self.input_x2, self.input_y, self.margin],
            outputs = [self.output],
            kernel_name = KERNEL_NAME
            )


if __name__ == "__main__":
    import numpy as np
    import random

    def compute_simple_test(x1, x2, y, margin):
        upper = np.sum(np.multiply(x1, x2), axis = 1)
        lower1 = np.sum(np.multiply(x1, x1), axis = 1)
        lower2 = np.sum(np.multiply(x2, x2), axis = 1)
        # print(upper)
        # return (upper / ((lower1 * lower2) ** 0.5)).reshape((-1, ))
        result = (upper / ((lower1 * lower2) ** 0.5)).reshape((-1, ))
        # return result
        return ((y + 1) * (1 - result) + (1 - y) * np.maximum(0, result - margin)) / 2

    def compute_diff1(base, eval):
        return np.sum(np.abs(base - eval)) / np.sum(np.abs(base))
    def compute_diff2(base, eval):
        result = np.sum(np.abs(base - eval) * np.abs(base - eval)) / np.sum(base * base)
        return result ** 0.5
    def cal_diff(result, data_out):
        diff1 = np.sum(np.abs(np.subtract(result, data_out))) / np.sum(result)
        diff2 = np.sqrt(
            np.sum(np.power(np.subtract(data_out, result), 2,))
            / np.sum(np.power(result, 2))
        )
        # assert round(diff1 * 100, 5) < 3e-3 * 100
        # assert round(diff2 * 100, 5) < 3e-3 * 100
        print("DIFF1:", str(round(diff1 * 100, 5)) + "%")
        print("DIFF2:", str(round(diff2 * 100, 5)) + "%")

    dtype = bangpy.float32

    # width = (1024 * 128 * 48) // dtype.bytes
    # height = 1024
    width = 512
    height = 1080
    

    f = CosineEmbeddingLoss(dtype, 1, "mlu290", TaskType.UNION16).compute_body()
    print(f.get_source())

    print("data amount: %fGB" % (width * height * 2 * dtype.bytes / (2 ** 30)))
    data_input_x1 = np.random.rand(width * height).reshape((height, width))
    data_input_x2 = np.random.rand(width * height).reshape((height, width))
    data_input_y = np.random.randint(-1, 1, (height, ))
    margin = random.random()
    data_out = np.zeros((height, ))

    dev = bangpy.device(0)

    data_input_x1_dev = bangpy.Array(data_input_x1.astype(dtype.as_numpy_dtype), dev)
    data_input_x2_dev = bangpy.Array(data_input_x2.astype(dtype.as_numpy_dtype), dev)
    data_input_y_dev = bangpy.Array(data_input_y.astype(dtype.as_numpy_dtype), dev)
    data_out_dev = bangpy.Array(np.zeros(data_out.shape, dtype.as_numpy_dtype), dev)

    data_out = compute_simple_test(data_input_x1, data_input_x2, data_input_y, margin)
    f(data_input_x1_dev, data_input_x2_dev, data_input_y_dev, margin, data_out_dev)
    # bangpy.assert_allclose(data_out_dev.numpy(), data_out.astype(dtype.as_numpy_dtype))
    dev_out = data_out_dev.numpy()
    print(data_out)
    print(dev_out)
    print("diff1 = : " + str(compute_diff1(data_out, dev_out)))
    print("diff2 = : " + str(compute_diff2(data_out, dev_out)))
    cal_diff(dev_out, data_out)


    evaluator = f.time_evaluator(dev, 1, 10)
    time = (evaluator(data_input_x1_dev, data_input_x2_dev, data_input_y_dev, margin, data_out_dev).mean * 1e3)
    print("time cost: %fms" % (time))
    print("IO speed: %fGB/s" % ((width * height * 2 * dtype.bytes + height * dtype.bytes) / time * 1e3 / (2 ** 30)))

却会给我报错

Check failed: (it != var_idmap_.end()) is false: Find undefined Variable data_num_vv

可能是被优化掉了,没有生成对应的参数

构造函数中无法执行tcp.target()函数

@eg.module
class PairwiseDistance(object):

"""Operator description:
Add the data in the two buffers.
"""
def __init__(self, buffer_size: ty.int32, dtype: ty.string) -> None:
    self.dtype = dtype
    self.single_buffer_size = buffer_size
    self.bp = tcp.target()   #这句话会有问题
    #tcp.print("ce shi")

这样写会报错。
Exception: module 'bangpy.tcp' has no attribute 'target'

bangpy中buffer align检查的问题

image
这里的buffer_out_n等变量都是nram上float32数据类型,所以buffer_out_n[:32]的size应该是128byte吧,为什么报错中说是64byte呢?

Bangpy模拟nram双缓冲失败

我尝试模拟了一下nram双缓冲的情形:循环每次处理第i块存出、第i+1块计算和第i+2块加载
但是出现了报错TVMError: CNDrv Error : CN_QUEUE_ERROR_INVALID
生成的bangc代码与直接使用bangpy的pipline生成的bangc代码基本没有差别,自行排查的时候发现循环剩下的第 n-1 次加载及其后面的代码段删除之后能够正常运行,但是数据并没有正常输入输出
请问能帮忙排查一下是什么原因吗?我看生成的bangc代码是没有问题的

from cmath import e
from bangpy import tcp
from bangpy.tcp.runtime import TaskType
from bangpy.platform.bang_config import TARGET
import bangpy as bp
import numpy as np

target = "mlu290"
tcps = tcp.TCP(target)

#设置核信息
task_type = TaskType.UNION16
tcps.launch_cluster(task_type.value)
task_num = task_type.value * 4
tcps.launch_task(task_num, 1, 1)

#设置数据类型和数据个数
dtype = bp.float32
dtype_sz = dtype.bytes
length = 2**20
shape = (length, )

#设置硬件相关信息
compute_row = 32
nram_size = TARGET(target).nram_size

single_buffer_size = 64 * 128 // dtype.bytes * compute_row

task_id = tcps.taskId

#每次处理的数据量
data_calculated_each_time = single_buffer_size // dtype_sz
data_calculated_each_task  = length // task_num
with tcps.if_scope(task_id == task_num - 1):
    data_calculated_each_task = length // task_num + length % task_num

#每个task需要循环多少次
loop_num = tcps.Scalar(bp.int32, name="loop_num")
loop_num.assign(data_calculated_each_task // data_calculated_each_time)

# global的数据buffer
input = tcps.Buffer(shape=shape, name="input", dtype=dtype, scope="global")
target = tcps.Buffer(shape=shape, name="target", dtype=dtype, scope="global")
out = tcps.Buffer(shape=shape, name="output", dtype=dtype, scope="global")
reduction = tcps.SizeVar(name="reduction")


# 分块处理相关变量
start = tcps.Scalar(bp.int32, name="start")
start.assign(task_id * (length // task_num))
stop = tcps.Scalar(bp.int32, name="stop")

buffer_input = tcps.Buffer(shape=(2*data_calculated_each_time,), name="input_N", dtype=dtype, scope="nram")
buffer_target = tcps.Buffer(shape=(2*data_calculated_each_time,), name="target_N", dtype=dtype, scope="nram")
buffer_out = tcps.Buffer(shape=(2*data_calculated_each_time,), name="OUTPUT_N", dtype=dtype, scope="nram")

def compute(ou, inp, tar, i) :
    tcps.log(inp[(i%2)*data_calculated_each_time : (i%2+1)*data_calculated_each_time], inp[(i%2)*data_calculated_each_time : (i%2+1)*data_calculated_each_time], high_precision = False)
    tcps.log(ou[(i%2)*data_calculated_each_time : (i%2+1)*data_calculated_each_time], tar[(i%2)*data_calculated_each_time : (i%2+1)*data_calculated_each_time], high_precision = False)    
    tcps.subtract(ou[(i%2)*data_calculated_each_time : (i%2+1)*data_calculated_each_time], ou[(i%2)*data_calculated_each_time : (i%2+1)*data_calculated_each_time], inp[(i%2)*data_calculated_each_time : (i%2+1)*data_calculated_each_time])
    tcps.multiply(ou[(i%2)*data_calculated_each_time : (i%2+1)*data_calculated_each_time], tar[(i%2)*data_calculated_each_time : (i%2+1)*data_calculated_each_time], ou[(i%2)*data_calculated_each_time : (i%2+1)*data_calculated_each_time])

## 第 0 次 gdram -> nram 传输
tcps.memcpy_async(buffer_input[:data_calculated_each_time], input[start : start + data_calculated_each_time])
tcps.memcpy_async(buffer_target[:data_calculated_each_time], target[start : start + data_calculated_each_time])
tcps.sync_ipu()
## 第 0 次 comopute
compute(buffer_out, buffer_input, buffer_target, 0)

## 第 1 次 gdram -> nram 传输
tcps.memcpy_async(buffer_input[data_calculated_each_time : data_calculated_each_time*2], input[start+ data_calculated_each_time : start + data_calculated_each_time*2])
tcps.memcpy_async(buffer_target[data_calculated_each_time : data_calculated_each_time*2], target[start+ data_calculated_each_time : start + data_calculated_each_time*2])


#对于每个task里面的每次buffer计算进行流水线处理
with tcps.for_range(begin = 0, end = loop_num-2, stage = 0)  as i :
    tcps.sync_ipu()

    ## 第 i 次 nram -> gdram 传输
    tcps.memcpy_async(out[start+data_calculated_each_time*i : start+data_calculated_each_time*(i+1)], buffer_out[(i%2) * data_calculated_each_time: (i%2+1) * data_calculated_each_time])

    ## 第 i+1 次 compute
    compute(buffer_out, buffer_input, buffer_target, i+1)

    ## 第 i+2 次 gdram -> nram 传输
    tcps.memcpy_async(buffer_input[((i+2)%2)*data_calculated_each_time : ((i+2)%2 + 1)*data_calculated_each_time], input[start + (i+2)*data_calculated_each_time : start + (i+3)*data_calculated_each_time])
    tcps.memcpy_async(buffer_target[((i+2)%2)*data_calculated_each_time : ((i+2)%2 + 1)*data_calculated_each_time], target[start + (i+2)*data_calculated_each_time : start + (i+3)*data_calculated_each_time])

tcps.sync_ipu()

## 第 n-1 次 nram -> gdram 传输
tcps.memcpy_async(out[start+data_calculated_each_time*(loop_num-2) : start+data_calculated_each_time * (loop_num-1)], buffer_out[((loop_num-2)%2) * data_calculated_each_time: ((loop_num-2)%2+1) * data_calculated_each_time])

## 第 n 次 compute
compute(buffer_out, buffer_input, buffer_target, loop_num-1)

## 第 n 次 nram -> gdram 传输
tcps.memcpy_async(out[start+data_calculated_each_time*(loop_num-1) : start+data_calculated_each_time * (loop_num)], buffer_out[((loop_num-1)%2) * data_calculated_each_time: ((loop_num-1)%2+1) * data_calculated_each_time])


#处理每个task的尾部数据
with tcps.if_scope(data_calculated_each_task % data_calculated_each_time != 0):
    start.assign(start + data_calculated_each_time * loop_num)
    stop.assign(start + data_calculated_each_task % data_calculated_each_time)
    #data copy
    tcps.memcpy(buffer_input[:stop-start], input[start:stop])
    tcps.memcpy(buffer_target[:stop-start], target[start:stop])

    tcps.log(buffer_input, buffer_input, high_precision = False)
    tcps.log(buffer_out, buffer_target, high_precision = False)
    tcps.subtract(buffer_out, buffer_out, buffer_input)
    tcps.multiply(buffer_out, buffer_target, buffer_out)

    tcps.memcpy(out[start:stop], buffer_out[:stop-start])  

f = tcps.BuildBANG(inputs=[input, target, reduction], outputs=[out], kernel_name="kidivloss")
print(f.get_source())

######################################################################################

data_input =  np.random.uniform(low=0, high=1, size=(length,))
data_target = np.random.uniform(low=0, high=1, size=(length,)).astype(dtype.as_numpy_dtype)
# data_input = (np.array([0.2]*length)).reshape(shape)
# data_target = (np.array([0.25]*length)).reshape(shape)

# print('data_input : ', data_input)
# print('data_target : ', data_target)

data_out = np.multiply(data_target,np.subtract(np.log(data_target), np.log(data_input)))
print(data_out)


print("-------------result----------------------------------------")
dev = bp.device(0)

data_input_dev = bp.Array(data_input.astype(dtype.as_numpy_dtype), dev)
data_target_dev = bp.Array(data_target.astype(dtype.as_numpy_dtype), dev)
data_out_dev = bp.Array(np.zeros(data_out.shape, dtype.as_numpy_dtype), dev)

reduction = 0
f(data_input_dev, data_target_dev, reduction, data_out_dev)

print(data_out_dev)

evaluator = f.time_evaluator(number=100, repeat=2, min_repeat_ms=0)
print(evaluator)
t = evaluator(data_input_dev, data_target_dev,reduction, data_out_dev).mean * 1e3

print(
    "tutorial : %f ms"
    % t
)


io_efficiency = (length * 3 * dtype_sz) / (2**30) / (t / 1000)
print("io_efficiency : ", io_efficiency)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.