elixir-nx / nx Goto Github PK
View Code? Open in Web Editor NEWMulti-dimensional arrays (tensors) and numerical definitions for Elixir
Multi-dimensional arrays (tensors) and numerical definitions for Elixir
Most NumPy aggregate functions have a keepdims
option that will keep the reduced dimensions as size 1 so the rank of the tensor stays the same. This is very easily implemented with some shape changes from Nx
and an additional reshape
in EXLA
.
It's very common, especially in neural network ops. Numerous examples in here: https://github.com/google/flax/blob/master/flax/core/nn/normalization.py
Implementing something similar in defn
I believe would involve a transform to calculate the reshape, which would be slightly annoying to have to do every time.
Today Nx operations fail if they find a NaN and/or Infinity (although defn behaviour will be compiler independent). Do we need to implement handling of NaN and infinity within Nx? What are the use cases?
Grouped Convolutions are common enough that we should support them. Here's some discussion and papers on grouped convolutions: https://paperswithcode.com/method/grouped-convolution
XLA supports grouped convolutions with it's feature_group_count
option (right now we always set it to 1). We will just need to add support for groups in the binary implementation of Nx.conv
. This involves an additional iteration in the binary implementation as well as some additional shape checks.
XLA also supports batch_group_count
for grouping parts of the batch. I see less value in adding this option as I haven't been able to find any applications that are all that common.
As an example, this does not compile:
defn uniform(opts \\ []) do
shape = transform(opts, &Keyword.fetch!(&1, :shape))
opts = keyword!(opts, type: {:f, 32}, shape: {}, scale: 1.0e-2)
Nx.random_uniform(shape, type: opts[:type]) * opts[:scale]
end
But this does:
defn uniform(_t, opts \\ []) do
shape = transform(opts, &Keyword.fetch!(&1, :shape))
opts = keyword!(opts, type: {:f, 32}, shape: {}, scale: 1.0e-2)
Nx.random_uniform(shape, type: opts[:type]) * opts[:scale]
end
Even though t
is unused.
xarray can provide a lot of guidance here: http://xarray.pydata.org/en/stable/
It would also be interesting to see tensor backends that target libraries/formats such as pandas/apache arrow and see the changes that we would need to make to model to support those.
I just built without docker, all tests passing except the Convolution causes a segmentation fault. I have a feeling it's a version issue, because it had a warning along these lines:
Failed to determine best cudnn convolution algorithm. Falling back to default. Performance may be sub optimal.
For reference:
I have been reading a bit about XLAs Infeed/Outfeed Ops and how they can be used to send data to/from the device to host during a computation, which presents an opportunity for what I think would be significant performance increases.
JAX currently holds the MLPerf records training on a large TPU cluster. Looking at their ResNet implementation you can see how their training loop makes use of XLA Ops (loops and infeeds) to speed up training. You can use Infeeds to perform multiple steps per batch without having to rerun the computation. Infeeds accept input shapes and tokens which are used for enforcing an order between operations across replicas/partitions. Adding this feature would allow us to do something similar for whatever NN library we decide to implement, and would also give users the flexibility to speed up their own custom training loops.
In the same sense that you can pass data to a device during a computation, you can receive data from a still-running computation using outfeeds. An outfeed accepts a shape as well, and then an outfeed receiver handles data received from the Outfeed. The Python XLA client implements an outfeed receiver in C++, but reading the implementation notes, it seems like Elixir is a perfect fit for handling everything the Python Outfeed Receiver is trying to do in C++.
Reading about infeeds/outfeeds in the context of TPUs, it seems that they are almost ALWAYS infeed/outfeed bound, so taking advantage TPUs in "coreless" mode is really important for performance. A TPU running in coreless mode is basically just using a TPUs CPU. The TPUs CPU has 300GB of memory which can be used for preprocessing/transformations in the data pipeline. It seems the most efficient way to train with TPU would be to:
defn
compiled functions when needed or just plain Elixir for IO stuff. An additional advantage we have is that it should be very straightforward to do these transformations in parallel to pass to multiple TPU cores. A single TPU Pod has 2048 TPU cores, so they are massively parallel, but Elixir is the perfect language for handling this.One of the big questions is how we implement something like this so it's backend agnostic. Having an Nx.infeed
or an Nx.outfeed
wouldn't really make sense. I think these probably tie in best with Nx.device
.
Proposal: we will change the :data
field of Nx.Tensor
to be a
{device_module, term}
. The default is {Nx.BinaryDevice, binary}
.
All functions in Nx
will expect the device to be the binary device.
If the device is elsewhere (think GPU), then it needs to be either
read or transferred.
# Transfers data to the given device.
#
# Nx.device_transfer(tensor)
# Nx.device_transfer(tensor, Exla.NxDevice, device: {:cuda, 1})
#
# If a device is not given, Nx.BinaryDevice is used, which means
# the data is read into an Elixir binary. If the device is already
# Nx.BinaryDevice, it returns the tensor as is.
#
# Once transfer is done, the data is deallocated from the given
# tensor.
#
# If the device has already been deallocated, it raises.
Nx.device_transfer(tensor, device \\ Nx.BinaryDevice, device_opts \\ [])
# Read data that is on the device.
#
# It returns a tensor where the device is Nx.BinaryDevice.
# The data is not deallocated from the current device. If the
# device is already Nx.BinaryDevice, it returns the tensor as is.
#
# If the device has already been deallocated, it raises.
Nx.device_read(tensor)
# Deallocates data from device. Returns either
# :ok or :already_deallocated.
Nx.device_deallocate(tensor)
defmodule Nx.Device do
@type state :: term
@callback transfer({mod, term}, type, dims, opts) :: state
@callback read(term) :: bitstring
@callback deallocate(term) :: :ok | :already_deallocated
end
If you wrap the contents of this test in a for:
https://github.com/elixir-nx/exla/blob/main/test/exla/nx_device_test.exs#L4
Like this:
test "transfers data from nx<->device" do
for _ <- 1..10000 do
t = Nx.tensor([1, 2, 3, 4])
Then it fails with a 100% certainty for me. I think we are doing either one of:
When sending data to the device, we are using zero-copy (which we shouldn't, we should only do zero-copy on run, since we know the binary can't be GCed meanwhile)
When reading the data for a CPU device, we are pointing to a place in memory instead of allocating an Erlang binary (unlikely?)
Leaving this here for tracking/discussion on what should and shouldn't be included. Looking at XLA/JAX/Numpy, these are some common ones we are missing:
There are a few others as well that we should look into, but these seemed to be the most common.
With #134 we now support interior padding, but unfortunately the gradient is broken for Nx.pad
. In order to fix it, we will need to introduce Nx.slice
Validate that the function given to reduce is not really a closure (i.e. it does not accept parent's expressions
Make sure functions become expressions when handling Expr for grads (instead of invoked on the compiler)
Checking out the project for the first time and encountered an error during the first build attempt running mix test
:
โ exla git:(main) mix test
... <successes omitted> ...
Repository rule local_python_configure defined at:
/home/elbow-jason/.cache/exla/tf-6af836f407f546cf2f9ab3b5fcb7a8285bda5c96/third_party/py/python_configure.bzl:275:26: in <toplevel>
Analyzing: target //tensorflow/compiler/xla/exla:libexla.so (13 packages loaded, 14 targets configured)
ERROR: An error occurred during the fetch of repository 'local_execution_config_python':
Traceback (most recent call last):
File "/home/elbow-jason/.cache/exla/tf-6af836f407f546cf2f9ab3b5fcb7a8285bda5c96/third_party/py/python_configure.bzl", line 213
_get_numpy_include(<2 more arguments>)
File "/home/elbow-jason/.cache/exla/tf-6af836f407f546cf2f9ab3b5fcb7a8285bda5c96/third_party/py/python_configure.bzl", line 187, in _get_numpy_include
execute(repository_ctx, <3 more arguments>)
File "/home/elbow-jason/.cache/exla/tf-6af836f407f546cf2f9ab3b5fcb7a8285bda5c96/third_party/remote_config/common.bzl", line 217, in execute
fail(<1 more arguments>)
Problem getting numpy include path.
Traceback (most recent call last):
File "<string>", line 1, in <module>
ModuleNotFoundError: No module named 'numpy'
Is numpy installed?
INFO: Repository go_sdk instantiated at:
no stack (--record_rule_instantiation_callstack not enabled)
Repository rule _go_download_sdk defined at:
/home/elbow-jason/.cache/bazel/_bazel_elbow-jason/bbfe9bcff2dc48e9f808ab24728cb493/external/io_bazel_rules_go/go/private/sdk.bzl:52:20: in <toplevel>
ERROR: Analysis of target '//tensorflow/compiler/xla/exla:libexla.so' failed; build aborted: Traceback (most recent call last):
File "/home/elbow-jason/.cache/exla/tf-6af836f407f546cf2f9ab3b5fcb7a8285bda5c96/third_party/py/python_configure.bzl", line 213
_get_numpy_include(<2 more arguments>)
File "/home/elbow-jason/.cache/exla/tf-6af836f407f546cf2f9ab3b5fcb7a8285bda5c96/third_party/py/python_configure.bzl", line 187, in _get_numpy_include
execute(repository_ctx, <3 more arguments>)
File "/home/elbow-jason/.cache/exla/tf-6af836f407f546cf2f9ab3b5fcb7a8285bda5c96/third_party/remote_config/common.bzl", line 217, in execute
fail(<1 more arguments>)
Problem getting numpy include path.
Traceback (most recent call last):
File "<string>", line 1, in <module>
ModuleNotFoundError: No module named 'numpy'
Is numpy installed?
INFO: Elapsed time: 21.495s
INFO: 0 processes.
FAILED: Build did NOT complete successfully (13 packages loaded, 14 targets configured)
FAILED: Build did NOT complete successfully (13 packages loaded, 14 targets configured)
make: *** [Makefile:25: all] Error 1
** (Mix) Could not compile with "make" (exit status: 2).
You need to have gcc and make installed. If you are using
Ubuntu or any other Debian-based system, install the packages
"build-essential". Also install "erlang-dev" package if not
included in your Erlang/OTP version. If you're on Fedora, run
"dnf group install 'Development Tools'".
Not sure what system info is needed for diagnosis. Some basic info:
โ exla git:(main) uname -a
Linux elbow-at-home 5.8.0-38-generic #43~20.04.1-Ubuntu SMP Tue Jan 12 16:39:47 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux
โ exla git:(main) gcc --version
gcc (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0
Copyright (C) 2019 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
โ exla git:(main) make --version
GNU Make 4.2.1
Built for x86_64-pc-linux-gnu
Copyright (C) 1988-2016 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
โ exla git:(main) asdf current erlang
erlang 23.1.5 /home/elbow-jason/.tool-versions
โ exla git:(main) asdf current elixir
elixir 1.11.2-otp-23 /home/elbow-jason/.tool-versions
โ exla git:(main) asdf current python
python 3.6.12 /home/elbow-jason/.tool-versions
Currently missing gradients for these functions:
I would like to collect your opinion on introducing a defn
variant that uses the equal sign, so instead of this:
defn tanh(_shape, y, _x), do: 1.0 - y * y
we can write this:
defn tanh(_shape, y, _x) = 1.0 - y * y
The pro is that it looks closer to mathematical definitions and it is less verbose. The cons is that it looks less like regular Elixir code but it can be a plus to help find defn
functions in a module with mixed definitions.
If we decide to go down this route, we have two options:
=
syntax everywhere, multiline looks like this: defn gradient({w1, b1, w2, b2}, batch_images, batch_labels) =
(
grad_b1 = grad_b1(batch_labels)
grad_b2 = grad_b2(w2, batch_labels)
grad_w1 = grad_w1(w2, batch_images, batch_labels)
grad_w2 = grad_w2(w1, b1, batch_images, batch_labels)
{grad_w1, grad_b1, grad_w2, grad_b2}
)
2, Allow both do/end and =
What are your thoughts?
Proposal still has to be drafted.
We are not currently matching on the type on the c++ side of things.
Requires erlang/otp#2890
On both Nx and Exla.
At least custom_grad
and zero_grad
. Anything else @seanmor5?
Right now, we're working off a recent TF master revision. We'll want to pin to a more stable release instead, and then work out a process for periodically updating the TF version.
For example, should this be allowed?
Nx.add([1, 2, 3], [4, 5, 6])
In any case, it will still return tensors.
This test will not compile:
defn zeros(opts \\ []) do
shape = transform(opts, &Keyword.fetch!(&1, :shape))
opts = keyword!(opts, type: {:f, 32}, shape: {})
Nx.as_type(Nx.broadcast(0, shape), opts[:type])
end
test "zeros with default args" do
assert zeros(shape: {}) == Nx.tensor(0.0, type: {:f, 32})
end
This test will compile but raise (at least on Nx.Defn backend):
defn zeros(_t, opts \\ []) do
shape = transform(opts, &Keyword.fetch!(&1, :shape))
opts = keyword!(opts, type: {:f, 32}, shape: {})
Nx.as_type(Nx.broadcast(0, shape), opts[:type])
end
test "zeros with default args" do
assert zeros(Nx.tensor([1, 2, 3]), shape: {}) == Nx.tensor(0.0, type: {:f, 32})
end
** (ArgumentError) defn must return an expression tensor or a tuple, got: #Nx.Tensor<
f32
0.0
>
This workaround will also raise:
defmodule Zeros do
def zeros(shape, opts \\ []) do
type = opts[:type] || {:f, 32}
constant(0, [shape: shape, type: type])
end
defnp constant(t, opts) do
opts = keyword!(opts, type: {:f, 32}, shape: {})
Nx.as_type(Nx.broadcast(t, opts[:shape]), opts[:type])
end
end
** (ArgumentError) defn functions expects either numbers or tensors as arguments. Got: [shape: [shape: {}], type: {:f, 32}]
We will want to make tag long running NIFs as dirty. Probably compile
(dirty CPU NIF) and run
(should have two versions, the CPU one is a dirty CPU NIF, and the GPU one is a dirty IO NIF).
Maybe we can call those binary_to_device_mem, read_device_mem and deallocate_device_mem?
This is a discussion that showed up internally and apparently there is a similar movement in PyTorch: http://nlp.seas.harvard.edu/NamedTensor
Some questions are:
Should we provide some default names?
Should we support constant names? i.e. have :i, :j, :k, etc work regardless of the names in the tensor. This can allow people to create generic algorithms without hardcoding dimensions.
Should we allow specifying which dimensions are batch dimensions (so we get vmap but without making it a transform)? Similar to the privacy section in the document above but already generalized for batching.
We should introduce a to_heatmap
conversion, similar to how it is done in Matrex.
The Nx.Util module seems to be the best candidate for having this function. Currently Nx.Util is handful of conversion functions from tensors to flat lists and scalars. However, we could also argue that heatmap belongs to Nx
because in practice it is two operations:
Normalizing the data to heatmap values based on its min and max
Converting the values above to an ansi value we can print
The first part could be implemented in Nx, which means it could be called inside defn
and performed more efficiently. At the same time, step 1 is done with two traversals of a binary, which is relatively fast anyway.
For achieve step 2, my suggestion is for to_heatmap
to return a Nx.Heatmap
, which wraps the tensor and implements the Inspect protocol so it prints the heatmap. The advantage of making it a struct is that we can also implement this protocol for LiveBook and so on.
The heatmap will apply to the 2 lowest dimensions. If there are higher dimensions, they will be wrapped in lists. For example, a 14x14 tensor would look like this:
#Nx.Heatmap<
s64[14][14]
..............
..............
..............
..............
..............
..............
..............
..............
..............
..............
..............
..............
..............
..............
>
However, if the dimensions are {2, 14, 14}, we would get:
#Nx.Heatmap<
s64[2][14][14]
[
..............
..............
..............
..............
..............
..............
..............
..............
..............
..............
..............
..............
..............
..............,
..............
..............
..............
..............
..............
..............
..............
..............
..............
..............
..............
..............
..............
..............
]
>
For now, at least:
I have been reading more into sparse matrices and one of the issues in scikit learn is that, when you have a sparse matrix, it is extremely discouraged that you use np
functions because they would lose their sparse properties. Instead, you should use the sparse specific versions.
While this is a fine recommendation (and generally suitable to Elixir), it introduces an issue in that defn
functions only allow function calls to the Nx
module. Given we most likely want to support sparse specific compilers in defn
in the future, this puts us in a rough spot:
defn
, we need to call Nx
with sparse matricesdefn
, you should not call Nx
with sparse matricesI believe we can address this with (surprise, surprise!) one extra level of indirection. We should make the Nx
module a meta-module that doesn't really implement the operations but just define their API. For example, the exp implementation would rather look like this:
def exp(tensor) do
tensor = tensor(tensor)
tensor.data.__struct__.exp(tensor)
end
This has a couple benefits:
Nx
module as isWhile I would be generally worried about this approach since the indirection can affect performance, the cost of an extra function call is irrelevant when working with tensors, so we are fine.
This is related to #174 so perhaps it's more appropriate to move this into the discussion in that issue.
Right now, the dot product is only aware of contracting dimensions. One example of something we can't support right now is a bilinear transformation like in: https://pytorch.org/docs/stable/nn.functional.html#bilinear
Given input shapes: {a, b}, {z, b, c}, {a, c}
, for a bilinear transformation the output shape should be: {a, z}
given by:
input1
|> Nx.dot(weight)
|> Nx.dot(input2)
However, even if we configure the contracting dimensions correctly, we'll end up with shape {a, z, a}
. Instead what we need is to treat the first dimension as a batch dimension so the final dot product between shapes {a, z, c}
and {a, c}
is really a batched dot product between shapes {z, c}
and {c}
resulting in an output of {a, z}
So far we have mean
and sum
, but we are missing:
reduce_max
reduce_min
reduce_prod
As well as variants for reduce_window
:
reduce_window_sum
reduce_window_mean
reduce_window_max
reduce_window_min
reduce_window_prod
We could also probably tackle the cumulative ops. We can also consider different names for these functions. Maybe rather than reduce_max
and reduce_min
, just a max/1
and min/1
.
See the tensorflow feature and linked paper: https://www.tensorflow.org/api_docs/python/tf/vectorized_map
max_float_type
will rewrite all floats with equal or higher size to the given one.
@defn_compiler {EXLA, max_float_type: {:bf, 16}}
max_float_type(expr, {:bf, 16})
If we go the transformer route, we could make the transformer module public, which means we can support both APIs (either via a transform or via an option). Implementing the traversal is easy and requires rewriting the types of the appropriate nodes. Only a handful of notes require special attention:
Currently, we explicitly transfer buffers to the device before calling the run
NIF from within Exla.Executable.run
when the platform is a GPU like this:
outside_cpu = client.platform == :cuda || client.platform == :rocm
keep_on_device_int = if keep_on_device || outside_cpu, do: 1, else: 0
device_id = device_assignment_to_device_id(executable, {replica, partition})
inputs =
Enum.map(arguments, fn
%Buffer{ref: {ref, _}, data: nil} ->
ref
buffer = %Buffer{data: data, shape: shape, ref: nil} ->
if outside_cpu do
%Buffer{ref: {ref, _}} = Buffer.place_on_device(buffer, client, device_id)
ref
else
{data, shape.ref}
end
end)
Originally, this was done to decouple what could have made the run
NIF IO bound or CPU bound. After determining the GPU case is always closer to IO bound, we no longer need this logic; however, removing it leads to OOM errors on the MNIST example.
I would like to add a higher-level looping construct to defn
. In Elixir, we can use for+:reduce
but I am afraid it will be too verbose and foreign for new users. Let's imagine we want to translate this python code:
def _smooth(x):
out = np.empty_like(x)
for i in range(1, x.shape[0] - 1):
for j in range(1, x.shape[1] - 1):
out[i, j] = x[i + -1, j + -1] + x[i + -1, j + 0] + x[i + -1, j + 1] +
x[i + 0, j + -1] + x[i + 0, j + 0] + x[i + 0, j + 1] +
x[i + 1, j + -1] + x[i + 1, j + 0] + x[i + 1, j + 1]) // 9
return out
With for+:reduce, we would write it as:
def smooth(x) do
for i <- 1..elem(x.shape, 0)-1, j <- 1..elem(x.shape, 1)-1, reduce: x do
x ->
put_in x[i, j], x[i + -1, j + -1] + x[i + -1, j + 0] + x[i + -1, j + 1] +
x[i + 0, j + -1] + x[i + 0, j + 0] + x[i + 0, j + 1] +
x[i + 1, j + -1] + x[i + 1, j + 0] + x[i + 1, j + 1]) / 9
end
end
I propose we introduce a loop construct, inspired by futhark that looks like this:
loop tuple_or_var [= expr], [pattern <- expr]+ do
end
Rewriting the above to this loop construct, we have:
def smooth(x) do
loop x,
i <- 1..elem(x.shape, 0)-1,
j <- 1..elem(x.shape, 1)-1 do
put_in x[i, j], x[i + -1, j + -1] + x[i + -1, j + 0] + x[i + -1, j + 1] +
x[i + 0, j + -1] + x[i + 0, j + 0] + x[i + 0, j + 1] +
x[i + 1, j + -1] + x[i + 1, j + 0] + x[i + 1, j + 1]) / 9
end
end
There is one downside with this approach: the only form of loops we have in XLA are while loops which are sequential. Other languages, such as taichi, can optimize them to run in parallel. For this reason, we may want to introduce higher level constructs for manipulating tensors. In particular, I believe we should introduce functions such as map
, map_with_index
, reduce
and reduce_with_index
. I have some thoughts in how we can implement said functions so they also work with batching out of the box, but I am waiting for some feedback on this issue before moving forward.
One option is to remove EXLA_TARGET. if that doesn't work, we should namespace the directories with the target.
Following the README:
โ exla git:(main) mix deps.get
Resolving Hex dependencies...
Dependency resolution completed:
Unchanged:
benchee 1.0.1
deep_merge 1.0.0
elixir_make 0.6.1 RETIRED!
(invalid) Wrong permissions on the mix.exs file
All dependencies are up to date
While the EXLA compiler already performs operator fusion, we should also be able to do some operator fusion for the built-in compiler written in Elixir. We can fuse most unary operators and binary operators with constants to reduce the amount of traversals. It still won't hold a candle against compiled modes but it should be overall positive to performance.
Today exla_client.cc
will only zero copy binaries under certain conditions but run
treats it as if zero-copy always happens.
My suggestion is to break this function in three: one for zero copy allocation and one for device allocation. Then in run, we will try to zero copy and if not, we call the device allocation function, storing the results of zero-copying or not in a separate vector.
Then, after running, we traverse the inputs again, choosing to either release or zero copy them based on the results.
Then we make Exla work with a superset of the given types, keeping its internal representation as much as possible.
This means we can implement quantization, type replacement, and similar with Expr instead of implementing it for each compiler.
Constructs:
Passes:
For example, if we need to load 2GB of data. Now we need to first load it into the memory and then load it into the GPU. We would like to that without having to load it all onto the CPU first.
The biggest question is what would be the API on the Elixir side. We could have a Stream
based one (immutable) or a process-based one (mutable). Perhaps both?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.