dmlc / dlpack Goto Github PK

View Code? Open in Web Editor NEW

884.0 47.0 135.0 2.08 MB

common in-memory tensor structure

Home Page: https://dmlc.github.io/dlpack/latest

License: Apache License 2.0

Makefile 3.06% C 0.54% C++ 39.19% CMake 12.50% Shell 2.37% Python 42.35%

tensor operator deep-learning

dlpack's Introduction

DLPack: Open In Memory Tensor Structure

Documentation: https://dmlc.github.io/dlpack/latest

DLPack is an open in-memory tensor structure for sharing tensors among frameworks. DLPack enables

Easier sharing of operators between deep learning frameworks.
Easier wrapping of vendor level operator implementations, allowing collaboration when introducing new devices/ops.
Quick swapping of backend implementations, like different version of BLAS
For final users, this could bring more operators, and possibility of mixing usage between frameworks.

We do not intend to implement Tensor and Ops, but instead use this as common bridge to reuse tensor and ops across frameworks.

Proposal Procedure

RFC proposals are opened as issues. The major release will happen as a vote issue to make sure the participants agree on the changes.

Project Structure

There are two major components so far

include: stabilized headers
contrib: in progress unstable libraries

People

Here are list of people who have been involved in DLPack RFC design proposals:

@soumith @piiswrong @Yangqing @naibaf7 @bhack @edgarriba @tqchen @prigoyal @zdevito

dlpack's People

Contributors

Stargazers

Watchers

Forkers

tqchen edgarriba yangqing sjoerdapp piyush3db cjolivier01 liuguoyou codeaudit antinucleon prigoyal wwcai-intellif sriharikarnam hongdayu liangfu lemmaa tuboss hunter-packages headupinclouds apeforest jfurtek batermj rapidsai crcrpar velconia junrushao santhosh-ks zhennanqin yzh119 samskalicky sprcsy innerlee burdennn zzzfinal0 ymeng-git elainebao awesomemachinelearning andrewpalumbo wolegechu chop2 claint76 jiangbingqing joejiong cwharris janucaria hityangzhen vvguo szha chisuhua zxh1993 xx-tao ptartan21 king-xiang leofang isabella232 guavaandnobi icemelon piamo icodein cumtchw jiangmijiangmi chuanlei again4you allegrofb yuchenjin gdaisukesuzuki quansight-labs stevenjokess redisai stjordanis jordanresearch sunshineywz123 williamstar oleksandr-pavlyk jwfromm csullivan yodeman rgommers 5433d-r32433 chenhuayou qinhj dbarbier wwtghx baicaipcx tirthasheshpatel algoskynet antorma levinxo jakirkham stream-computing hong3731 vishalbelsare chenglong92 jjwangbilin simely-simnz python-repository-hub wangpengabc daobook koolsite andrei-pokrovsky 59501327

dlpack's Issues

New device type: kDLCPUShared

Similar to kDLCPUPinned, CPU shared memory is allocated and managed quite differently from normal CPU memory. In DGL, we use shared memory to store large graphs so it is quite common to see operation to copy data between CPU memory and shared memory. I guess we could also use the extension type to handle that, but want to bring up this issue to see whether it is better to have a type for that in DLPack.

cc @zheng-da

Don't call it tensor

Just call it "multi/n-dimensional array". Tensor is a word with a very specific mathematical meaning. Calling any n-dimensional array a tensor is incorrect in the same way that calling every two-dimensional array a matrix is wrong. Numpy set a correct precedent. There's no reason to do worse than numpy.

Signedness of attributes like `DLTensor::ndim`, `DLTensor::shape`

Dear DLpack authors,

I was curious why several definitions in dlpack.h, specifically various DLTensor attributes are signed, when negative-valued arguments would seem to indicate obviously nonsensical tensor configurations (such as negative dimensions or a negative shape along a dimension).

Would PR to change these to an unsigned counterpart be accepted? ABI-wise, there should be no impact as they occupy the same amount of memory (and values using the sign bit would, in any case, not correspond to valid configurations).

Thanks,
Wenzel

[RFC] Add `kDLROCMHost` and `kDLCUDAManaged`

From #67 (comment) and #67 (comment):

I am proposing to add two new device types, following the line of #67:

kDLROCMHost
kDLCUDAManaged

The first addition is to mirror the current relation (since v0.5) between CUDA and ROCm now that we have kDLCUDA, kDLCUDAHost, and kDLROCM. ROCm also provides pinned/page-locked memory, so this is legit.

The second addition is for CUDA managed/unified memory, which does not belong to either host or device but to both. It seems natural to me to have a standalone type for it. ROCm currently does not provide managed memory, so we could add it in the future once AMD implements it.

Both additions seem straightforward for me to add without any issue, as they are orthogonal to existing device types (as they should be).

cc: @rgommers @tqchen @jakirkham @kkraus14 @kmaehashi @emcastillo @asi1024

Request to release 0.7

Please consider releasing version 0.7 with kDLOneAPI added to unblock pytorch/pytorch#78154

@tqchen

[RPC] Add Prefix to DLPack device and type Enumerations

Currently, we are using kFloat, kUInt for types and kGPU, kOpenCL for device types.

While these constant are fine if they sit in a standard namespace like mxnet, since DLPack is a global C structure that sits on global namespace, it might make sense to change the naming convention:

kFloat -> kDLFloat
kGPU -> kDLGPU

This might help avoid possible namespace conflicts with the existing packages and also makes it clear that constants come from DLPack. This will need an upgrade of the dependent frameworks though, and we would need to tag another release after this change. Possibly with #18

Any thoughts?

@piiswrong @nouiz @mli @Yangqing @zdevito @bhack

A bit more motivation in README?

Hi @tqchen, I've tried to follow original discussion mentioned in README, but got lost somewhere in the middle.

Idea of reusing operations across frameworks sounds good, but what is the benefit of this for

frameworks developers?
- will this allow avoid implementing at least part of the operations?
- is it going to work in BLAS-like way when one backend can be replaced by newer / more efficient or it is up to backends to use implementations from other libraries?
- will these conventions provide someone a way to write single code (say, for new operation) that can be used by different frameworks?
final users (I assume here users of DL frameworks, but there are people who use tensor libraries directly)?
- will user get more operations / better support for hardware?
- does it require user to manage installed libraries or side-operations will be already included during compilation?
what is the relation with non-DL tensor frameworks (Numpy, Blitz++ are examples of such)? Will those be able to use operations from other frameworks?

Probably pointing in README such potential benefits is a good idea.

Add Tensor Format option to DLTensor

As a follow-on to Issue #34, I propose the addition of describing the tensor format for the underlying data in the DLTensor struct.

This option would consume a single byte with one of two possible values: kDLRowMajor or kDLColumnMajor.

dlpack contrib not installed during make install step

The contrib folder contains headers that are specified as inferface as the dlpack library https://github.com/dmlc/dlpack/blob/master/CMakeLists.txt#L66
but not installed to the approperate folder during the install step: https://github.com/dmlc/dlpack/blob/master/CMakeLists.txt#L113

Release Vote for v0.2

This is an issue for releasing v0.2 . The issue is going to be opened for four days, if there is no objection from participants, a version will be tagged based on current master.

Release Note

Add DLManagedTensor structure for borrowing tensors
Add prefix DL to all enum constant values
- This requires dependent frameworks to upgrade their reference to these constant
- The ABI is compatible, as it is only change of constant name.
New device types
- kDLMetal for Apple Metal device
- kDLVPI for verilog simulator memory
- kDLROCM for AMD GPUs

Tensor Structure Definition

migrated from apache/mxnet#4735

No mention of unit in the description of `strides`

  /*!
   * \brief strides of the tensor,
   *  can be NULL, indicating tensor is compact.
   */
  int64_t* strides;

Does not say if the strides are in bytes or number of elements.I believe numpy uses the former and pyTorch uses the latter. There does not seem to be an obvious convention.

[ambiguity] Does DLManagedTensor::deleter delete its function argument?

It seems to me an undefined behavior in our protocol whether the deleter below deletes self as well. The inconsistency causes a recent crash when using MXNet as a backend for DGL, reported by @zheng-da.

dlpack/include/dlpack/dlpack.h

Lines 163 to 167 in 5c792ce

 /*! \brief Destructor signature void (*)(void*) - this should be called 

  * to destruct manager_ctx which holds the DLManagedTensor. It can be NULL 

  * if there is no way for the caller to provide a reasonable destructor. 

  */ 

 void (*deleter)(struct DLManagedTensor * self);

I did a quick check and found that the deleters in TVM and DGL do free self, but haven't verified it on PyTorch & Chainer side. Could someone help with this issue? Thanks!

[RFC] Add bfloat16 data type support

I'm arising this issue for dlpack bfloat16 support.

Bfloat16 is a popular 16-bit floating point format for machine learning, supported by multiple hardware, e.g. TPU. Compared to fp16, bfloat16 has a greater dynamic range, so it's useful for things like gradients that can be outside the dynamic range of fp16. Compared to fp32, Using bfloat16 reduces the size of data in memory and allows larger models to fit in the same amount of memory. So there are many advantages of bfloat16, it's a trend for different frameworks to support bfloat16. Tensorflow has already supported bfloat16 data type. And we are now supporting bfloat16 in MXNet.

Dlpack is an open in-memory tensor structure for sharing tensors among deep learning frameworks. Supporting bfloat16 can make dlpack more flexible and integrated in data sharing between different frameworks.

Current status of dlpack and bfloat16 support in Frameworks:

1. Pytorch:

Pytorch has two interfaces for converting data from/to dlpack format. tsor = torch.utils.dlpack.from_dlpack(dl) converts data from dlpack-defined tensor to pytorch-defined tensor. dl = torch.utils.dlpack.to_dlpack(tsor) converts data from pytorch-defined tensor to dlpack-defined tensor. And when using to_dlpack function, getDLDataType is used to check the data types that have been enabled for data sharing in dlpack:

DLDataType getDLDataType(const Tensor& t) {
  DLDataType dtype;
  dtype.lanes = 1;
  dtype.bits = t.element_size() * 8;
  switch (t.scalar_type()) {
    case ScalarType::Byte:
      dtype.code = DLDataTypeCode::kDLUInt;
      break;
    case ScalarType::Char:
      dtype.code = DLDataTypeCode::kDLInt;
      break;
    case ScalarType::Double:
      dtype.code = DLDataTypeCode::kDLFloat;
      break;
    case ScalarType::Float:
     dtype.code = DLDataTypeCode::kDLFloat;
      break;
    case …
    case ScalarType::BFloat16:
      throw std::logic_error("BFloat16 is not supported by dlpack");
      break;

For now as dlpack has not supported bfloat16 yet, getDLDataType throws an error when encountering bfloat16 data type. Once dlpack supports bfloat16 data type, this code can be easily changed.

2. MXNet:

Similar to pytorch, mxnet also has arr = mx.nd.from_dlpack(dl), dl = mx.nd.to_dlpack_for_read(arr) and dl = mx.nd.to_dlpack_for_write(arr) for dlpack/mxnet data sharing. Also DTypeTransform is used to check the data types.

static DLDataType DTypeTransform(int type_flag) {
    switch (type_flag) {
      case mshadow::kFloat32: return DLDataType{kDLFloat, 32, 1};
      case mshadow::kFloat64: return DLDataType{kDLFloat, 64, 1};
      case mshadow::kFloat16: return DLDataType{kDLFloat, 16, 1};
      case mshadow::kBfloat16: return DLDataType{kDLBfloat, 16, 1}; // add this line to support bfloat16
      case ......
      }
    }

add bfloat16 data type support in this function and then we can use this data type as inputs, params or outputs for operator computation.

3. Tensorflow:

Tensorflow haven't support dlpack yet, but there's a discussion on it (issue). Tensorflow has already support bfloat16.

As discussed above, we can see that bfloat16 has a good support in various frameworks. On the other hand, dlpack is also becoming more and more popular. So it will be really great if dlpack can have bfloat16 data type support.

Proposal for supporting bfloat16 in dlpack:

Here is a draft proposal for supporting bfloat16 in dlpack. the modification in dlpack will be very simple, just add one single line in DLDataTypeCode:

typedef enum {
  kDLInt = 0U,
  kDLUInt = 1U,
  kDLFloat = 2U,
  kDLBfloat = 3U, // add this line to support bfloat16
} DLDataTypeCode;

And it's done.

Do you have any ideas? Thank you @soumith @piiswrong @Yangqing @naibaf7 @bhack @edgarriba @tqchen @prigoyal @zdevito @pengzhao-intel @ZhennanQin

Support complex numbers in DLPack protocol

The DLPack protocol doesn't seem to have a representation of complex numbers. I don't know whether this is intentional (e.g., is the intent that clients use a structure-of-arrays representation of the real and imaginary parts?) or whether it is a missing feature.

(I therefore chose to leave this case unimplemented when adding DLPack support to JAX, but it would be nice to add it once the protocol is fixed.)

Bit-field or simple enum?

dlpack/include/dlpack/dlpack.h

Lines 38 to 47 in 24b4f92

 typedef enum { 

 kCPU = 1, 

 kGPU = 2, 

 // kCPUPinned = kCPU | kGPU 

 kCPUPinned = 3, 

 kOpenCL = 4, 

 kMetal = 8, 

 kVPI = 9, 

 kROCM = 10, 

 } DLDeviceType;

According to this comment:

// kCPUPinned = kCPU | kGPU

It seems these constants are defined as bit-field flags, and they can be combined with each other. However, if they are bit flags, only one of kMetal, kVPI and kROCM should be setting 0b00001000. If they are simple enum values, these values shouldn't be logically operated; a piece of description seems better than the comment shown above; and skipped values (5, 6, 7) are misleading.

If they are defined as bit flags, we can combine them together as results of machine capability query. But currently the possibility we use multiple *PUs (other than CPU+GPU) or APIs in a single NN task is quite low. Maybe in some deep feature extraction tasks it can be helpful.

Framework Adoption Status

This issue is used to status from framework side on adopting the data structure. So far during GTC @mli have talked to @Yangqing and @soumith and they declared joint effort on this.

Frameworks on board

mxnet, pytorch, caffe2, tiny-dnn

Steps of Adoption

1. Data structure adoption, make framework specific Tensor DLTensor compatible
1. Operator library isolation
1. Possible interpolation between structures

[RFC] Rename kDLGPU to kDLCUDA, and kDLCPUPinned to kDLCUDAHost

This RFC proposes to rename kDLGPU to kDLCUDA, and kDLCPUPinned to kDLCUDAHost. Two main reasons for this renaming:

There are now more kinds of GPUs, e.g., AMD Radeon GPUs and Qualcomm Adreno GPUs.
Mainstream frameworks like PyTorch clearly indicate CUDA device, e.g., PyTorch uses torch.cuda to support CUDA tensor types, so this renaming will make it more consistent with the other frameworks.

Look forward to hearing your thoughts!

Tag Release 0.1

The DLTensor structure is stablized for a while and since one major reason of DLPack is to be used across frameworks, I would recommend we tag release for major ABI version.

The first release is only going to be stable DLTensor structure.

[NOTICE] ABI Update For adding Version to DLPack

Dear DLPack community:

After quite a bit of discussions and coordinations, we are planning to do a ABI breaking change to add versioning and read only field to DLPack. DLPack has been a minimum stable ABI for exchanging Array data and we would like for it to continue stay that way.

In the meantime, we would like to have opportunities to carefully evolve DLPack, of course in still carefully considered manner. After long discussions, we have decided to make the following change.

Add a new data structure DLManagedTensorVersioned, which contains a new version field
Add DLPackVersion struct to indicate both api and abi versions.
Add flags field for boolean options, with support for READ_ONLY bit mask

We also propose to change Data API exchange protocol, to allow new versions of DLPack to return capsule with name "vdltensor"(instead of the old "dltensor"), this would .

The change is still ABI compatible, as the new data structure ABI comes with the new class DLManagedTensorVersioned. The data api however would involve an ABI update and update of all DLPack importer/exporters.

Such a move certainly impact a lot of packages and we would like to plan it carefully, as a result, we would like to have at least one month of notice to let everyone chime in, also see if we have enough volunteers to help update the data api exchanges in various packages.

struct DLManagedTensor {
   DLTensor dl_tensor;
   void * manager_ctx;
   void (*deleter)(struct DLManagedTensor * self);
};

/*!
 * \brief The DLPack and DLPack ABI versions of the tensor.
 */
typedef struct {
  /*! \brief DLPack version. */
  uint32_t dlpack;
  /*! \brief DLPack ABI version. */
  uint32_t abi;
} DLPackVersion;

struct DLManagedTensorVersioned {
    DLPackVersion version;
    void * manager_ctx;
    void (*deleter)(struct DLManagedTensorVersioned * self);
    uint64_t flags;
    DLTensor dl_tensor;
}

typedef DLPACK_BIT_MASK_READ_ONLY  1

DataAPI change: PyCapsule - rename to vdltensor, so there is proper error messages because of name mismatch

Add support to Axelera accelerator

Add new device type to dlpack.h

The device name will be kDLAxelera added to DLDeviceType enum

[Feature Request] protobuf dlpack support

hey guys what do you think about giving support to protobuf ?
I have a current need to share/log tensor data via python.

Wanted to discuss here before starting myself.
I image something like this soon in kornia

from dlpack import dlpack_pb2
import kornia as K

img: K.core.Image = K.io.read_image(...)

# img.to_proto() will convert tensor data (numpy/torch) to `dlpack_pb2`
K.io.save(..., img.to_proto())

img_proto: dlpack_pb2.DLTensor = K.io.load(....)
img_load: K.core.Image = Image.from_proto(img_proto)

Support "struct of arrays" based complex numbers

Follow-up of #50.

In #58, kDLComplex was added to support "array of structs" based complex numbers, which is the compact memory layout used in C/C++/Python/etc. However, as pointed out in #50 there is a need to support the "struct of arrays" layout as well (a struct containing two pointers for real and imag). This issue is to track such a need.

[feature request] Example wrapper for C#

A related issue in onnxruntime: microsoft/onnxruntime#4162. Not sure where this feature request should belong

There is DenseTensor type in Microsoft ML/onnxruntime. It would be good to have standard example wrapper for converting to and from DLPack. Then users could use the provided structs in their own C# wrappers for their own C function wrappers that accept or return DLPack tensors

A good example of this could be onnxruntime wrapper for C#: https://github.com/microsoft/onnxruntime/blob/3530ce541cbb66f05e523f92b62cebaa4793bd3f/csharp/src/Microsoft.ML.OnnxRuntime/NativeMethods.cs and https://github.com/microsoft/onnxruntime/tree/3530ce541cbb66f05e523f92b62cebaa4793bd3f/csharp/src/Microsoft.ML.OnnxRuntime

(I may upload a sample wrapper later on, but I don't have it yet)

[RFC] Managed Tensor Data Structure

Following #17

DLPack so far does not contain memory management for tensors and only asks users to pass non-managed DLTensor around, which serves our purposes so far. For each framework like PyTorch to ATen, and MXNet, there is a need for managing these tensors. This can be done in several ways:

1. Introducing destructor to a managed, as in #17
- This is good for pure C library like Torch, do not have cross library C++ ABI issue.
- Do not have runtime issue(allocator is implicit in destructor)
1. Make use C++'s shared_ptr management (one example in https://github.com/dmlc/dlpack/blob/master/contrib/dlpack/dlpackcpp.h#L21) MXNet uses this mechanism
- Good for c++ libraries, if only shared_ptr is used
- Do not have runtime issue (deleters can be encapsulated in shared_ptr)

The major question is as follows:

Is there motivation to share a managed tensor structure, or can we do it at per framework level
Since data management is usually coupled with runtime, is there any motivation to share some level of runtime(allocator, etc)

Specify DLPack helper C APIs

In the recent discussions scattered everywhere, it appears that some functions should better be just implemented by DLPack so that the downstream libraries do not have to reinvent the wheels. The possibilities include

tensor deleter (see the discussion starting #51 (comment))
query if CPU-accessible (see #71 (comment))
C APIs corresponding to __dlpack__ and __dlpack_device__ in the Python Array API standard for handing streams (#65)
C API that could be exposed as a new Python attribute __dlpack_info__ for returning API and ABI versions (and potentially more, see #34, #72)

cc: @tqchen @rgommers @seberg @eric-wieser @kkraus14 @jakirkham @hameerabbasi @vadimkantorov @oleksandr-pavlyk @szha @veritas9872

Operator Interface

We have not yet have an example operator interface, I would propose to have a dedicated issue for discussing this. Candidate interface posting is welcomed.

Current Consensus in Guideline

There are three categories of operators

level0: Those that does not requires workspace memory
level1: Those that can declare workspace requirement before execution
level2: Those that need a allocator

The former ones can be relaxed to the later ones. In general putting operator into the most restrictive types leaves chances for the user to decide what to do with them. e.g. level 0 and 1 allows static memory planning

Questions

Whether allow operator to reject call for given reasons (e.g. I only take compact tensors)
How to handle context specification and context switch (thread local runtime?)

Clarify alignment requirement

DLPack has this comment in the header file:

  /*!
   * \brief The data pointer points to the allocated data. This will be CUDA
   * device pointer or cl_mem handle in OpenCL. It may be opaque on some device
   * types. This pointer is always aligned to 256 bytes as in CUDA. The
   * `byte_offset` field should be used to point to the beginning of the data.
   *
   * Note that as of Nov 2021, multiply libraries (CuPy, PyTorch, TensorFlow,
   * TVM, perhaps others) do not adhere to this 256 byte aligment requirement
   * on CPU/CUDA/ROCm, and always use `byte_offset=0`.  This must be fixed
   * (after which this note will be updated); at the moment it is recommended
   * to not rely on the data pointer being correctly aligned.
   * ...
   */

This was discussed in data-apis/array-api#293 and came up in NumPy numpy/numpy#20338.

This comment by @rgommers summarizes the issue very well. Quoting the options in the comment:

These are the options:

A1: required alignment. Require the data pointer to always be aligned (using nonzero byte_offset), and do the gradual evolution plan in my comment above.

A2: no alignment. remove the allocation requirement completely from dlpack.h. no library needs to make any changes (except if current handling of byte_offset is buggy, like @seberg pointed out for PyTorch). NumPy and other new implementers then just use byte_offset=0 always (easiest), and we're done.

A3: optional alignment. Do not require alignment, but add a way to communicate from the producer to the consumer what the alignment of the data is.

The current status is that the fine print in dlpack.h requires alignment (option A1), but no one adheres to it or enforces it. This state is not very useful: it requires a >1 year evolution plan, and apparently there's no gain because of the third bullet above. So it looks like the best choices are either A2 or A3. A3 seems strictly better than A2, and most of the work it requires (versioning/extensibility) is work we wanted to do for other reasons already.

So here's a new proposal:

Decide that the long-term desired state is A3: optional alignment

NumPy and other new implementers to do whatever is simplest, i.e. to use byte_offset = 0 and data pointing to the first element in memory.

Update the comment in dlpack.h about this topic to reflect: current state, desired future state, and a link to a new issue on the DLPack repo with more info (outcome of this discussion to be summarized on that issue).

I agree with @rgommers that A3 is the best option because: most libraries don't care about alignment and we can communicate the alignment to some that do using the __dlpack_info__ dunder or a request API under discussion in #34. If others agree, then let's add that to the spec and remove the comment from the header.

I noticed Tensorflow 2.8.0 crashed with NumPy 1.22.3 (on Ubuntu 20.04) due to an alignment issue:

import numpy, tensorflow

x = numpy.arange(5, dtype='int64')
tf_x = tensorflow.experimental.dlpack.from_dlpack(x.__dlpack__())
tf_x[1]  # Fatal Failure
# 2022-03-15 18:29:10.587129: F ./tensorflow/core/framework/tensor.h:776] Check failed: IsAligned()
# Aborted

This happens for float16, float32, and int64.

(This crash doesn't happen for Tensorflow 2.7.0 and NumPy 1.22.3 but for all other versions, it does)

So, it'd be good to communicate alignment to the importing library. I don't think it'd be difficult to calculate that at importer side but it could be useful for opaque pointers.

Current privileges of dlpack adopters?

Hi, I have recently proposed an adoption of dlpack, however I am not sure my arguments are good.

It would be nice to have some list of advantages for frameworks/libraires who adopted dlpack.

So far I can see:

passing tensors on various devices (CPU, GPU, NPU?) between DL frameworks (e.g. this works for cupy and pytorch already)
using optimized operations from TVM (those operate with DLManagedTensor)
can use TensorComprehensions for operations on tensors (questionnable)

Something else?

Interaction of `DLManagedTensor::deleter` and the GIL

Dear DLPack authors,

DLPack is widely used to enable interoperability between large C++-based frameworks that furthermore provide Python bindings. The C++ parts of such a framework are often usable through an independent C++ API (e.g. in the case of PyTorch, Tensorflow, ..), where functions can be called from multi-threaded code. In contrast, multithreading in Python is much more restricted: any use of the CPython API requires that the Global Interpreter Lock (GIL) is held.

This discrepancy has implications on projects mixing C++ and Python code. Function in such a mixed C++/Python codebase should clarify require whether the caller should ensure that the GIL is held.

This is currently unclear in the case of DLManagedTensor::deleter, which could be called by the PyCapsule destructor (where the GIL is held) or from arbitrary C++ code at some later point following "consumption" of the capsule (where the GIL is not necessarily held -- this destructor call could even occur from a different thread!)

I don't have any strong opinions either way, but ideally the documentation of the interfaces should say so.

Thanks,
Wenzel

Import DLPack tensors directly into NumPy (without going via PyTorch or TF)

I made an experimental wrapper: https://github.com/vadimkantorov/pydlpack/blob/master/dlpack.py#L107

The most difficult part is managing memory / capsules. Currently it's sort of move-semantics (and deallocation is done in C). I'm sure you'd be able to do it better.

It would be a nice illustration in addition to existing borrowing from NumPy

A more complete usecase of mine: https://github.com/vadimkantorov/readaudio

enum size is not guaranteed to be sizeof(int)

DLDevice definition in dlpack:

typedef struct {
  /*! \brief The device type used in the device. */
  DLDeviceType device_type;
  /*! \brief The device index */
  int device_id;
} DLDevice;

t_tvm_device_ definition in codegen_cpu.cc@tvm

  t_tvm_device_ = llvm::StructType::create({t_int_, t_int_});

if dlpack was built with gcc -fshort-enums, then sizeof(DLDeviceType) will be 1 instead of 4, which will cause some errors when accessing device_type from llvm. one case I found is arg_binder.cc@tvm

  Bind_(device_type, TVMArrayGet(DataType::Int(32), handle, builtin::kArrDeviceType),
        arg_name + ".device_type", true);

add one dummy device type with value 0xffffffff in DLDeviceType should fix this issue

[RFC] Rename DLContext to DLDevice

This RFC proposes to rename the DLContext to DLDevice. DLContext indicates a device for Tensor and ops. Two main reasons for this change:

The "DLContext" name doesn’t intuitively reflect its semantics of a device.
Mainstream frameworks including TensorFlow and PyTorch both use device to represent the device to run. I think it’d be good that we use similar terminology to reduce the confusion.

This change will require downstream projects to change accordingly. Feel free to bring up your thoughts and concerns.

Meanwhile, we have a similar RFC for the name change in TVM discuss forum.

Specify Python embedding of DLPack tensors

DLPack only specifies a C++ API, but in practice there's a Python embedding that multiple frameworks support (via Python capsules) that does not seem to be formally specified or standardized.

The protocol seems to be:

producers embed a DLPackManagedTensor as Python capsule with name "dltensor".
when a consumer consumes a DLPackManagedTensor, it renames the capsule to "used_dltensor" so the same capsule cannot be consumed twice.
different frameworks seem to act differently as to how a consumer should treat a capsule destructor. MXNet seems to remove the capsule destructor, but PyTorch seems to leave it alone (I may have misread the code in either of these two cases.) It would be good to clarify what the correct behavior is. For JAX, I chose to remove the capsule destructor.

Typoe fixes in comments

I noticed whilst comparing versions that a couple of comments that were OK in version 0.0 somehow became nonsensical along the way by v0.6
This trivial patch fixes the grammar.

--- dlpack-0.6.orig/README.md
+++ dlpack-0.6/README.md
@@ -2,14 +2,14 @@

 [![Build Status](https://github.com/dmlc/dlpack/actions/workflows/main.yaml/badge.svg?branch=main)](https://github.com/dmlc/dlpack/actions/workflows/main.yaml)

-DLPack is an open in-memory tensor structure to for sharing tensor among frameworks. DLPack enables
+DLPack is an open in-memory tensor structure for sharing tensors among frameworks. DLPack enables

 - Easier sharing of operators between deep learning frameworks.
 - Easier wrapping of vendor level operator implementations, allowing collaboration when introducing new devices/ops.
 - Quick swapping of backend implementations, like different version of BLAS
 - For final users, this could bring more operators, and possibility of mixing usage between frameworks.

-We do not intend to implement of Tensor and Ops, but instead use this as common bridge
+We do not intend to implement Tensor and Ops, but instead use this as common bridge
 to reuse tensor and ops across frameworks.

 ## Proposal Procedure

Common CUDA allocator?

I’m making this issue here in DLPack, as I do not think of a better place for this. This issue is between many software and as this is the goal of DLPack. So here seem a good place for this issue. If you know of a better place, tell me.

It happen more and more frequently that in one script, multiple software that need the GPU are used. For optimization reasons, they all implement a memory allocator on top of cudamalloc.

Problems:
The fact that there is many memory allocator in the same experiment cause some problems like memory fragmentation.

Possible solution:
The problems could be solved by having all software(or most software) reuse a common allocator.

Do you agree we should spend time on this problem?
Would you agree to use a common allocator if we find a good proposition?
If so, what features do you need for that allocator?
Who else we should contact related to this?

@soumith @mrocklin @sklam

Specify synchronization semantics

From the CUDA side, nearly all of the DL frameworks and array libraries that support dlpack use CUDA streams and stream order both their computations and memory allocations. In its current form, dlpack doesn't specify any synchronization semantics as well as doesn't have a way to specify information to allow for a producer-consumer pair to exchange the necessary information to continue to stream order computations.

I imagine there's a similar problem in other contexts as well (OpenCL, ROCm, etc.) where maybe it's possible to generalize an approach.

cc @oleksandr-pavlyk

ABI: uint8 too small for potential `complex quad` itemsize bits?

If quad-precision takes off, then the current uint8 for the itemsize cannot represent the 128+128 = 256 bits of a complex quad number. Even now that would fail for most long double storage formats (those cannot be represented right now, so I am not sure it matters).

If an ABI break is necessary in the future, maybe this should be bumped to uint16 (maybe some other ones as well)? I guess there will be other ways to work around the limitation, though. And I admit quad-precision complex may well be as bad as it gets and is just at the limit.

[DISCUSS] Support kDLBool type

To represent tensors of type bool - is it possible adding to the DLDataTypeCode enum another type (kDLBool)? That can be very useful, as some well known machine learning platforms (TF, torch, Onnxruntime) support models whose inputs/outputs are tensors of type bool. Thanks!

Please generate new version/release

Could you please generate a new version/release for this project?

Question about API choices

I have a couple of question, about what might be useful to include or modify.

What is the purpose of lanes? Lanes seems like a way to describe access of data (i.e. alignment) and not the data itself. But since it affects the shape it is very limiting to use lanes != 1 (none of the Py-libs do or support it, I think)?

Unless lanes don't affect shape/strides, in which case they could convey other information.
How about adding alignment=256 or pow2_alignment=8? An alignment larger than itemsize could indicate that it is valid to do vectorized reads of that size (may require the use of byte_offset if the first reads starts at an offset?).
The current stream exchange allows the consumer to synchronize with the producer at time of consumption. If all the consumer wants to do is a computation like:
```
@compile_for_dlpack
def update_simulation(dlpack_array):
     data = dlpack_array.__dlpack__(stream=s2)  # synchronizes with "s2"
     # launch work on s2
     return  # returns without synch of original stream. 

# user code:
arr = MyArray()
for i in range(1000):
    update_simulation(arr)
    do_analysis(arr)  # cannot be auto-synch'ed
```
cannot guarantee do_analysis waits, unless update_simulation does a full synchronization? (Maybe this is just not important?)

Is stream lifetime management difficult? I can see mylib.from_dlpack(arr) synchronizing only once (although it might be nicer to be safer by default, but I don't know). But for the computational library use-case seems more relevant?
(I have read the the thread about introducing the stream= API, but it is not really clear to me why the current scheme is much simpler)

[RFC] Introduce a OpaqueHandle type

So far DLPack has introduced a few primitive data types, int, float, uint and bfloat. However, we still want to be able to provide a type code for quick extension in a compatible way.

One potential proposal is to bring a OpaqueHandle, type code, which means the target data is actually opaque, while the bitwidth and lanes are still specified. This type code could be used as code for testing data types that are not yet supported and allow frameworks to exchange the data as long as they agree on the dtype.

Standardize C interface for stream exchange

Following up on #57 where we figured out the correct stream exchange and synchronization semantics and a Python interface for doing so, we need to do the same for a C interface.

TLDR from #57:

Frameworks aren't necessarily equipped to share streams with lifetime controls outside of their own internal execution, so adding a stream member to the dlpack struct won't suffice
Consumer can optionally provide a stream to the producer when asking for the dlpack object
Producer is responsible for guaranteeing that the data in the dlpack object they produce is safe to operate on in the stream provided by the Consumer

cc @tqchen @harrism @jrhemstad @rgommers @leofang @oleksandr-pavlyk @szha @veritas9872

Give DLDeviceType a sentinel value

The DLDeviceType enum enumerates the list of device types.

The TVM project defines another enumeration, TVMDeviceExtType, that provides a supplemental set of devices / enumerators. It's important that there's no overlap of the integers provided by DLDeviceType and TVMDeviceExtType.

Unfortunately there's currently no good mechanism to notice when changes to either project lead to both using the same integer value in those enumerations.

We could address this by adding a sentinel value to DLDeviceType, e.g.:

typedef enum {
  kDLDeviceType_Begin = 1,
  kDLCPU = kDLDeviceType_Begin,
   ...
  kDLWebGPU = 15,
  /*! \brief Qualcomm Hexagon DSP */
  kDLHexagon = 16,
  kDLDeviceType_End, // all DLDeviceType enumerators are guaranteed to be numerically lower than this integer
} DLDeviceType;

With this in place, TVM could safely avoid problems using something like this:

typedef enum {
  kDLAOCL = kDLDeviceType_End,
  kDLSDAccel,
  kOpenGL,
  kDLMicroDev,
  kDLWebGPU,
  // AddExtraTVMType which is not in DLPack here
} TVMDeviceExtType;

or this:

typedef enum {
  kDLAOCL = ...,
  ...
} TVMDeviceExtType;

// Relies on kDLAOCL having the lowest integer value in TVMDeviceExtType.
static_assert(KDLAOCL > kDLEnumEnd);
``

About description update

The Github 'About' description reads

RFC for common in-memory tensor structure and operator interface for deep learning system

I believe this could be updated to remove RFC now that dlpack is accepted across many communities.

Updated About proposal:
Common in-memory tensor structure and operator interface for deep learning systems

Future ABI compatibility

DLPack’s vision of a common in-memory tensor format that spans device and memory types is fantastic.

However, in its form, there is no upgrade path for adding new items to the either the DLTensor or DLManagedTensor structs in way that would maintain ABI compatibility.

I would like to propose the addition of two components to the DLTensor struct. This will break current ABI compatibility, but will in the long run future proof the design.

Additions to DLTensor

unt8/16_t version
uint64_t future_bytes

Adding the version allows the receiving library to determine if the DLTensor can be consumed. The receiver may not have a matching version, but as long as it knows the version it can make a decision on if the data can be correctly used.

Adding future_bytes allows for the addition of new options to DLTensor. One of which might be data layout, ie row-major or column major (c-format vs FORTRAN). I will open a separate issue for this feature.

Zero-copy conversion from Numpy to DLPack

Should we put an example somewhere to illustrate that it is trivially easy to borrow numpy NDArrays to DLManagedTensor without copying the content? I happened to be working on this today.

[DISCUSS][RFC] DLPack Versioning and ABI Update

This is a thread to bring awareness and discuss the recent set of proposed change of adding versioning information to DLPack.

Summary

Up until now the DLPack C struct itself does not carry ABI version information, and we always maintain backward compatibility. Up until now there is no ABI breaking changes.

The overall stability is one reason why frameworks adopt DLPack in the first place and we would like to continue to do moving forward. In the meantime, it is indeed helpful to clarify the DLPack ABI version in the data structure itself. contains a draft proposal for the change. We can still

In short, we are looking into the possibility of attaching version information to the DLTensor struct, to minimize the ABI change, the version information is appended in the end of the current DLTensor

Summary of key changes

C struct: append version struct in the end of current DLTensor, we can still discuss the specific impl of version struct. There is also a possibility of readonly flag/mask in light of the change
Python API: introduce __dlpack_info__ function that returns the current supported API/ABI version.

Compatibility implications

One thing that is worth clarifying is that whether this constitute a ABI breaking change. It depends on how people use it. Let us clarify the following scenarios.

S0: Frameworks that consume the old DLPack ABI should be able to continue consume the new DLTensor and DLManagedTensor by ignoring the fields
- Specifically, the memory layout of the existing fields remains the same, so old access code should work out of the box
- The deleter signature remains the same, which means call into the deleter of DLManagerTensor(where deleter is defined by producer that produces the new DLTensor) should safely delete the new structure.
S1: Framework that produces old DLPack ABI Tensor can be consumed by the new framework, if the new framework queries the existence of __dlpack_info__ then do the conversion according
- Note the catch is that if a new framework was handled an arbitrary (old) DLTensor, querying the ABI field will result in an undefined behavior. So the guard on the python side exporter is needed.
S2: Interestingly, the old framework can include thew new DLTensor/DLManagedTensor struct, not updating any of its exchange implementations. From consumer pov, they can receive and use DLTensor as if it is an old struct.
- The deleter will ensure proper deletion.
- It is fine as a result for framework to "soft update" to include the new header, but not updating __dlpack_info__ to indicate that it is already at a new version

Normally S0 means that the data structure ABI is still backward compatible (new data structure being used in old scenarios). S1 somewhat seats in the future-compatible regime(old data structure being used in new scenarios).

Related threads

previous discussion #34
#101 PR draft

What to Expect

This is a notice and discussion thread to let the community that this is happening. It would be great to get folks to also tag related framework communities, so the ecosystem is fully aware of(and ideally buy-in) the proposed change before we formally proceed.

Given this change does have compatibility implications (although not breaking depending on how we see it), we will have a longer period of NOTICE(expect 1-2 months) before we proceed. Please tag as many folks you think are related as possible.

In the future, we also suggest to open thread of this kind to discuss the compatibility implication of the changes, if any.

Endianness support is missing

Complaints on endianness has been something I've recurrently seen (ex: CuPy cupy/cupy#3652 and mpi4py mpi4py/mpi4py#177), and I anticipate at some point we'd start receiving bug reports on this. Apparently there is at least a few communities out there (astropy and hdf5) that prefer (or could work with) non-native (that is, big) endianness data. This causes problems if two libraries exchange but do not communicate the endianness for how to interpret the data.

I suggest two possible solutions:

Add an endianness enum (big, little, native, etc) and include it in DLDataType as a new struct member:
- Cons: This will be an API/ABI incompatible change, unfortunately.
Alternatively, we could apply a bitwise mask to DLDataType::code to make it carry this information:
- The mask should be designed such that when not applied it refers to the little endianness, the de facto standard used by all projects so far

cc: @tqchen @rgommers @tirthasheshpatel

Root Discussion Issue

This is the root discussion issue for the proposal. I have give most related folks write access to the repo, but let us do it through PR so that they can be reviewed and discussed. Feel free to propose changes

Items

Tensor Structure
Operator Interface
DNN Library wrapping?
- CuDNN
- MKLDNN
- libDNN

	/! \brief Destructor signature void ()(void*) - this should be called
	* to destruct manager_ctx which holds the DLManagedTensor. It can be NULL
	* if there is no way for the caller to provide a reasonable destructor.
	*/
	void (deleter)(struct DLManagedTensor self);

	typedef enum {
	kCPU = 1,
	kGPU = 2,
	// kCPUPinned = kCPU \| kGPU
	kCPUPinned = 3,
	kOpenCL = 4,
	kMetal = 8,
	kVPI = 9,
	kROCM = 10,
	} DLDeviceType;