nv-legate / cunumeric Goto Github PK

View Code? Open in Web Editor NEW

605.0 605.0 69.0 18.99 MB

An Aspiring Drop-In Replacement for NumPy at Scale

Home Page: https://docs.nvidia.com/cunumeric/24.06/

License: Apache License 2.0

Python 57.57% C++ 27.45% Cuda 12.96% C 0.38% Shell 0.57% CMake 1.07%

cunumeric's People

Contributors

Stargazers

Watchers

cunumeric's Issues

`np.sum` over an array axis giving a wrong result

Hi. I'm trying legate.numpy on Piz Daint. The following should give a symmetric matrix, however with legate.numpy, it gives a matrix of zeros:

import legate.numpy as np


x = np.random.random((60, 30))

def euclidean_broadcast(x, y):
    """r_ij = (x_ij - y_ij)^2"""

    diff = x[:, np.newaxis, :] - y[np.newaxis, :, :]
    return (diff * diff).sum(axis=2)


edm = euclidean_broadcast(x, x)
print(edm[:3, :3])

I'm running with

legate --launcher srun --nodes 1 --gpus 1 --fbmem 14000  edm.py -lg:numpy:test --eager-alloc-percentage 5

edm.py is the script with code above.

Everything looks fine up to the (diff * diff). It's the sum(axis=2) what gives the zeros. It happens when using --gpus. The cpu version works fine.

I'm using this commits:

legate.numpy 496c64d (2021-05-12)
legate.core  9e327b7 (2021-05-12)

Please, let me know if you need more information. Thanks in advance!

sharding functor related error when running gemm.py

I'm executing gemm.py on 8 nodes with this command line and see the following error

/g/g15/yadav2/legate.core/install/bin/legate /g/g15/yadav2/legate.numpy/examples/gemm.py -n 46340 -p 64 -i 10 --num_nodes 32 --omps 2 --ompthreads 18 --nodes 32 --numamem 30000 --eager-alloc-percentage 1 --cpus 1 --sysmem 10000 --launcher jsrun --cores-per-node 40 --verbose
Running: jsrun -n 32 -r 1 -a 1 -c 40 -g 0 -b none /g/g15/yadav2/legate.core/install/bin/legion_python /g/g15/yadav2/legate.numpy/examples/gemm.py -n 46340 -p 64 -i 10 --num_nodes 32 -ll:py 1 -lg:local 0 -ll:ocpu 2 -ll:othr 18 -ll:onuma 1 -ll:util 2 -ll:bgwork 2 -ll:csize 10000 -ll:nsize 30000 -ll:ncsize 0 -level openmp=5 -lg:eager_alloc_percentage 1
[7 - 20003ac2f8b0]    5.675422 {5}{runtime}: [error 605] LEGION ERROR: Illegal output shard 32 from sharding functor 1073741900. Shards for this index space launch must be between 0 and 32 (exclusive). (from file /g/g15/yadav2/legate.core/legion/runtime/legion/runtime.cc:15560)
For more information see:
http://legion.stanford.edu/messages/error_code.html#error_code_605

Using `step` in slicing gives unclear error

Problem

Something like a[::3] does not work and gives TypeError: '<' not supported between instances of 'NoneType' and 'int', which is unclear about what happened.

To reproduce

test.py

from legate import numpy
a = numpy.random.random(100)
print(a[::3])

legate --cpus 1 ./test.py -lg:numpy:test

Output

Traceback (most recent call last):
  File "<prefix>/lib/python3.8/site-packages/legion_top.py", line 408, in legion_python_main
    run_path(args[start], run_name='__main__')
  File "<prefix>/lib/python3.8/site-packages/legion_top.py", line 200, in run_path
    exec(code, module.__dict__, module.__dict__)
  File "./test.py", line 3, in <module>
    print(a[::3])
  File "<prefix>/lib/python3.8/site-packages/legate/numpy/array.py", line 382, in __getitem__
    shape=None, thunk=self._thunk.get_item(key, stacklevel=2)
  File "<prefix>/lib/python3.8/site-packages/legate/numpy/deferred.py", line 428, in get_item
    view, dim_map = self._get_view(key)
  File "<prefix>/lib/python3.8/site-packages/legate/numpy/thunk.py", line 378, in _get_view
    view = (self._standardize_slice_key(key, 0),)
  File "<prefix>/lib/python3.8/site-packages/legate/numpy/thunk.py", line 325, in _standardize_slice_key
    or (key.stop < 0 and -key.step > diff)
TypeError: '<' not supported between instances of 'NoneType' and 'int'

Note

Not urgent because it does not fail silently and still gives some error after all.

More accurate array overlap test

The current RegionField overlap test (meant to check whether an intra-array copy is safe to be implemented with a single Legion copy/task) is too conservative to be useful.

https://github.com/nv-legate/legate.numpy/blob/5ec509907b9e49d98edcbc2fdc9ff2b55d2f5f33/legate/numpy/runtime.py#L496-L517

It ignores slice steps, and is very inaccurate when going from a 2d view to a 1d base array, e.g. for:

a = np.arange(25)
b = a.reshape((5, 5))

it will decide that b[3:5:, 0:2] and b[3:5, 2:4] overlap, because it translates the rectangles to the base 1d space, and considers the bounding boxes on that space (a[15:22] and a[17:24] in this example).

Copying a slice to another slice in the same array fails silently

Bug report due to @piyueh

The following code, when run with -lg:numpy:test, prints [False], indicating that the slice has not been updated:

from legate import numpy
a = numpy.random.random((3, 3))
a[:, 0] = a[:, 2]
print(numpy.allclose(a[:, 0], a[:, 2]))

After some digging I found that we skip copies between sub-regions if they're backed by the same field, which is actually only safe if the slices are equivalent: https://github.com/nv-legate/legate.numpy/blob/2b460c5dfdd60b673e37e25231bf625fdf3ead0e/legate/numpy/deferred.py#L101-L105

If we simply skip this check then the copy ends up happening through a CopyTask, which works with subregions of the same base region.

However, the runtime errors out if the two slices overlap, e.g. if we do a[0,0:2] = a[0,1:3] (vanilla NumPy accepts this, and does the expected thing). We should at least check for overlaps in python and produce a reasonable error message.

We also want to add a case for this to the test suite.

Undefined symbol: legion_auto_generate_id

I installed legate.numpy on legate.core on my Ubuntu 18.04. After I manually exported LEGATE_MAX_DIMS and LEGATE_MAX_FIELDS to environment variables, I further got the following error when I import legate.numpy. Any idea how to fix this?

Python 3.8.8 | packaged by conda-forge | (default, Feb 20 2021, 16:22:27) 
[GCC 9.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import legate.numpy as np
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/kwu/anaconda3/envs/legate/lib/python3.8/site-packages/legate/numpy/__init__.py", line 21, in <module>
    from legate.numpy import linalg, random
  File "/home/kwu/anaconda3/envs/legate/lib/python3.8/site-packages/legate/numpy/linalg/__init__.py", line 19, in <module>
    from legate.numpy.linalg.linalg import *
  File "/home/kwu/anaconda3/envs/legate/lib/python3.8/site-packages/legate/numpy/linalg/linalg.py", line 18, in <module>
    from legate.numpy.array import ndarray, runtime
  File "/home/kwu/anaconda3/envs/legate/lib/python3.8/site-packages/legate/numpy/array.py", line 24, in <module>
    from legate.core import LegateArray
  File "/home/kwu/anaconda3/envs/legate/lib/python3.8/site-packages/legate/core/__init__.py", line 19, in <module>
    from legate.core.legate import (
  File "/home/kwu/anaconda3/envs/legate/lib/python3.8/site-packages/legate/core/legate.py", line 32, in <module>
    from legate.core.legion import (
  File "/home/kwu/anaconda3/envs/legate/lib/python3.8/site-packages/legate/core/legion.py", line 858, in <module>
    class IndexPartition(object):
  File "/home/kwu/anaconda3/envs/legate/lib/python3.8/site-packages/legate/core/legion.py", line 868, in IndexPartition
    part_id=legion.legion_auto_generate_id(),
  File "/home/kwu/anaconda3/envs/legate/lib/python3.8/site-packages/cffi/api.py", line 912, in __getattr__
    make_accessor(name)
  File "/home/kwu/anaconda3/envs/legate/lib/python3.8/site-packages/cffi/api.py", line 908, in make_accessor
    accessors[name](name)
  File "/home/kwu/anaconda3/envs/legate/lib/python3.8/site-packages/cffi/api.py", line 838, in accessor_function
    value = backendlib.load_function(BType, name)
AttributeError: function/symbol 'legion_auto_generate_id' not found in library '<None>': python: undefined symbol: legion_auto_generate_id

Exception raised when using sys.exit(0)

Problem

When having sys.exit(0) to indicate a program's termination, Legate seems to catch the SystemExit raised by sys.exit(0) and treat it as an error.

To reproduce

step 1: prepare two Python test1.py and test2.py with these contents:

test1.py

import sys
from legate import numpy
sys.exit(0)

test2.py
```
import sys
from legate import numpy
```

step 2: run both scripts with legate, for example:

$ legate --cpus 1 test1.py

and

$ legate --cpus 1 test2.py

Expected and actual outputs

Both scripts are supposed to output nothing. However, test1.py returns this message

[0 - 7f34317b87c0]    0.807057 {6}{python}: python exception occurred within task:

I guess Legate catches the SystemExit from sys.exit and treat it as a normal exception, i.e., en error.

Meaning of parameters in NumPyProjectionFunctorRadix2D

Hi, in the NumPyProjectionFunctorRadix2D (similar in NumPyProjectionFunctorRadix3D)
https://github.com/nv-legate/legate.numpy/blob/896f4fd9b32db445da6cdabf7b78d523fca96936/src/proj.cc#L528

there are three parameters: template <int DIM, int RADIX, int OFFSET>.
The DIM is the dimension of a tensor. What about RADIX and OFFSET? I notice that in the register_projection_functors function, the RADIX is given as 4 and OFFSET range from 0 to 3.

register_functor<NumPyProjectionFunctorRadix3D<0, 4, 0>>(
runtime, offset, NUMPY_PROJ_RADIX_3D_X_4_0);

How about 4D or NDArray? Should I format the 4D tensor projection function as follows:

register_functor<NumPyProjectionFunctorRadix4D<0, 4, 0>>(
runtime, offset, NUMPY_PROJ_RADIX_4D_X_4_0);
register_functor<NumPyProjectionFunctorRadix4D<0, 4, 1>>(
runtime, offset, NUMPY_PROJ_RADIX_4D_X_4_0);
register_functor<NumPyProjectionFunctorRadix4D<0, 4, 2>>(
runtime, offset, NUMPY_PROJ_RADIX_4D_X_4_0);
register_functor<NumPyProjectionFunctorRadix4D<0, 4, 3>>(
runtime, offset, NUMPY_PROJ_RADIX_4D_X_4_0);

register_functor<NumPyProjectionFunctorRadix4D<1, 4, 0>>(
runtime, offset, NUMPY_PROJ_RADIX_4D_Y_4_0);
register_functor<NumPyProjectionFunctorRadix4D<1, 4, 1>>(
runtime, offset, NUMPY_PROJ_RADIX_4D_Y_4_0);
register_functor<NumPyProjectionFunctorRadix4D<1, 4, 2>>(
runtime, offset, NUMPY_PROJ_RADIX_4D_Y_4_0);
register_functor<NumPyProjectionFunctorRadix4D<1, 4, 3>>(
runtime, offset, NUMPY_PROJ_RADIX_4D_Y_4_0);

register_functor<NumPyProjectionFunctorRadix4D<2, 4, 0>>(
runtime, offset, NUMPY_PROJ_RADIX_4D_Z_4_0);
register_functor<NumPyProjectionFunctorRadix4D<2, 4, 1>>(
runtime, offset, NUMPY_PROJ_RADIX_4D_Z_4_0);
register_functor<NumPyProjectionFunctorRadix4D<2, 4, 2>>(
runtime, offset, NUMPY_PROJ_RADIX_4D_Z_4_0);
register_functor<NumPyProjectionFunctorRadix4D<2, 4, 3>>(
runtime, offset, NUMPY_PROJ_RADIX_4D_Z_4_0);

register_functor<NumPyProjectionFunctorRadix4D<3, 4, 0>>(
runtime, offset, NUMPY_PROJ_RADIX_4D_D_4_0);
register_functor<NumPyProjectionFunctorRadix4D<3, 4, 1>>(
runtime, offset, NUMPY_PROJ_RADIX_4D_D_4_0);
register_functor<NumPyProjectionFunctorRadix4D<3, 4, 2>>(
runtime, offset, NUMPY_PROJ_RADIX_4D_D_4_0);
register_functor<NumPyProjectionFunctorRadix4D<3, 4, 3>>(
runtime, offset, NUMPY_PROJ_RADIX_4D_D_4_0);

Out-of-bounds basic slices are not an error in vanilla NumPy

The expression np.zeros((5,))[5:] evaluates to array([], dtype=float64) in NumPy, but causes a IndexError: index 5 is out of bounds for axis 0 with size 5 in Legate.NumPy.

legate numpy very slow compared to Python+Numpy

I've been testing a simple Laplace Eq. solver to compare Python+Numpy to legate.numpy and legate is hugely slower than Numpy.

The code is taken from: https://barbagroup.github.io/essential_skills_RRC/laplace/1/ . The code I actually run is the following:

import numpy as np
import time


def L2_error(p, pn):
    return np.sqrt(np.sum((p - pn)**2)/np.sum(pn**2))
# end if


def laplace2d(p, l2_target):
    '''Iteratively solves the Laplace equation using the Jacobi method

    Parameters:
    ----------
    p: 2D array of float
        Initial potential distribution
    l2_target: float
        target for the difference between consecutive solutions

    Returns:
    -------
    p: 2D array of float
        Potential distribution after relaxation
    '''

    l2norm = 1.0
    icount = 0
    tot_time = 0.0
    pn = np.empty_like(p)
    while l2norm > l2_target:

        start = time.perf_counter()

        icount = icount + 1
        pn = p.copy()
        p[1:-1,1:-1] = .25 * (pn[1:-1,2:] + pn[1:-1, :-2] \
                              + pn[2:, 1:-1] + pn[:-2, 1:-1])

        ##Neumann B.C. along x = L
        p[1:-1, -1] = p[1:-1, -2]     # 1st order approx of a derivative 
        l2norm = L2_error(p, pn)
        end = time.perf_counter()

        tot_time = tot_time + (end-start)

    # end while

    print("l2norm = ",l2norm)
    print("icount = ",icount)
    print("Total Iteration Time = ",tot_time)
    print("   Time per iteration = ",tot_time/icount)

    return p
# end if



if __name__ == "__main__":

    nx = 401
    ny = 401

    # Initial conditions
    p = np.zeros((ny,nx)) ##create a XxY vector of 0's

    # Dirichlet boundary conditions
    x = np.linspace(0,1,nx)
    p[-1,:] = np.sin(1.5*np.pi*x/x[-1])
    del x


    start = time.time()
    p = laplace2d(p.copy(), 1e-8)
    stop = time.time()

    print("Elapsed time = ",(stop-start)," secs")
    print(" ")


# end if

When I run it on my laptop with Anaconda Python3 and Numpy I get the following:

$ python3 jacobi.py 
l2norm =  9.99986062249016e-09
icount =  153539
Total Iteration Time =  127.02529454990054
   Time per iteration =  0.0008273161512703648
Elapsed time =  127.14257955551147  secs

When I change the import line to legate.numpy, I usually stop the code after 15 minutes of wall time. I have let it run for up to 60 minutes and it never converges.

As a check, I've run the Numpy code with legate itself and it exactly matches the Numpy results.

I have been experimenting with replacing the l2norm computations with numpy specific functions (np.subtract, np.square, etc.) but I have achieved no increase in performance.

Does anyone have any recommendations?

Thanks!

Jeff

(edit by Manolis: added some formatting for the code sections)

Multiple reducing dimensions in unary operations

The current unary reduction is missing implementations for the following cases:

np.argmin and np.argmax with no axis value
any unary reduction with more than one reducing dimension

use OpenBLAS develop branch

This is clearly an issue in OpenBLAS but it blocks my Legate Numpy install and is unexpected, based on my experience with OpenBLAS in other contexts.

jhammond@nuclear:~/LEGATE/np$ python3 ./install.py --install-dir $HOME/LEGATE --with-core $HOME/LEGATE 2>&1 | tee log
Verbose build is  off
Legate is installing OpenBLAS into a local directory...
Cloning into '/tmp/tmpm780ryjm'...
Note: switching to 'd2b11c47774b9216660e76e2fc67e87079f26fa1'.

You are in 'detached HEAD' state. You can look around, make experimental
changes and commit them, and you can discard any commits you make in this
state without impacting any branches by switching back to a branch.

If you want to create a new branch to retain commits you create, you may
do so (now or later) by using -c with the switch command. Example:

  git switch -c <new-branch-name>

Or undo this operation with:

  git switch -

Turn off this advice by setting config variable advice.detachedHead to false

Switched to a new branch 'master'
getarch_2nd.c: In function ‘main’:
getarch_2nd.c:14:35: error: ‘SGEMM_DEFAULT_UNROLL_M’ undeclared (first use in this function); did you mean ‘SBGEMM_DEFAULT_UNROLL_M’?
   14 |     printf("SGEMM_UNROLL_M=%d\n", SGEMM_DEFAULT_UNROLL_M);
      |                                   ^~~~~~~~~~~~~~~~~~~~~~
      |                                   SBGEMM_DEFAULT_UNROLL_M
getarch_2nd.c:14:35: note: each undeclared identifier is reported only once for each function it appears in
getarch_2nd.c:15:35: error: ‘SGEMM_DEFAULT_UNROLL_N’ undeclared (first use in this function); did you mean ‘SBGEMM_DEFAULT_UNROLL_N’?
   15 |     printf("SGEMM_UNROLL_N=%d\n", SGEMM_DEFAULT_UNROLL_N);
      |                                   ^~~~~~~~~~~~~~~~~~~~~~
      |                                   SBGEMM_DEFAULT_UNROLL_N
getarch_2nd.c:16:35: error: ‘DGEMM_DEFAULT_UNROLL_M’ undeclared (first use in this function); did you mean ‘XGEMM_DEFAULT_UNROLL_M’?
   16 |     printf("DGEMM_UNROLL_M=%d\n", DGEMM_DEFAULT_UNROLL_M);
      |                                   ^~~~~~~~~~~~~~~~~~~~~~
      |                                   XGEMM_DEFAULT_UNROLL_M
getarch_2nd.c:17:35: error: ‘DGEMM_DEFAULT_UNROLL_N’ undeclared (first use in this function); did you mean ‘QGEMM_DEFAULT_UNROLL_N’?
   17 |     printf("DGEMM_UNROLL_N=%d\n", DGEMM_DEFAULT_UNROLL_N);
      |                                   ^~~~~~~~~~~~~~~~~~~~~~
      |                                   QGEMM_DEFAULT_UNROLL_N
getarch_2nd.c:21:35: error: ‘CGEMM_DEFAULT_UNROLL_M’ undeclared (first use in this function); did you mean ‘XGEMM_DEFAULT_UNROLL_M’?
   21 |     printf("CGEMM_UNROLL_M=%d\n", CGEMM_DEFAULT_UNROLL_M);
      |                                   ^~~~~~~~~~~~~~~~~~~~~~
      |                                   XGEMM_DEFAULT_UNROLL_M
getarch_2nd.c:22:35: error: ‘CGEMM_DEFAULT_UNROLL_N’ undeclared (first use in this function); did you mean ‘QGEMM_DEFAULT_UNROLL_N’?
   22 |     printf("CGEMM_UNROLL_N=%d\n", CGEMM_DEFAULT_UNROLL_N);
      |                                   ^~~~~~~~~~~~~~~~~~~~~~
      |                                   QGEMM_DEFAULT_UNROLL_N
getarch_2nd.c:23:35: error: ‘ZGEMM_DEFAULT_UNROLL_M’ undeclared (first use in this function); did you mean ‘XGEMM_DEFAULT_UNROLL_M’?
   23 |     printf("ZGEMM_UNROLL_M=%d\n", ZGEMM_DEFAULT_UNROLL_M);
      |                                   ^~~~~~~~~~~~~~~~~~~~~~
      |                                   XGEMM_DEFAULT_UNROLL_M
getarch_2nd.c:24:35: error: ‘ZGEMM_DEFAULT_UNROLL_N’ undeclared (first use in this function); did you mean ‘QGEMM_DEFAULT_UNROLL_N’?
   24 |     printf("ZGEMM_UNROLL_N=%d\n", ZGEMM_DEFAULT_UNROLL_N);
      |                                   ^~~~~~~~~~~~~~~~~~~~~~
      |                                   QGEMM_DEFAULT_UNROLL_N
getarch_2nd.c:71:50: error: ‘SGEMM_DEFAULT_Q’ undeclared (first use in this function); did you mean ‘SBGEMM_DEFAULT_Q’?
   71 |     printf("#define SLOCAL_BUFFER_SIZE\t%ld\n", (SGEMM_DEFAULT_Q * SGEMM_DEFAULT_UNROLL_N * 4 * 1 *  sizeof(float)));
      |                                                  ^~~~~~~~~~~~~~~
      |                                                  SBGEMM_DEFAULT_Q
getarch_2nd.c:72:50: error: ‘DGEMM_DEFAULT_Q’ undeclared (first use in this function); did you mean ‘SBGEMM_DEFAULT_Q’?
   72 |     printf("#define DLOCAL_BUFFER_SIZE\t%ld\n", (DGEMM_DEFAULT_Q * DGEMM_DEFAULT_UNROLL_N * 2 * 1 *  sizeof(double)));
      |                                                  ^~~~~~~~~~~~~~~
      |                                                  SBGEMM_DEFAULT_Q
getarch_2nd.c:73:50: error: ‘CGEMM_DEFAULT_Q’ undeclared (first use in this function); did you mean ‘SBGEMM_DEFAULT_Q’?
   73 |     printf("#define CLOCAL_BUFFER_SIZE\t%ld\n", (CGEMM_DEFAULT_Q * CGEMM_DEFAULT_UNROLL_N * 4 * 2 *  sizeof(float)));
      |                                                  ^~~~~~~~~~~~~~~
      |                                                  SBGEMM_DEFAULT_Q
getarch_2nd.c:74:50: error: ‘ZGEMM_DEFAULT_Q’ undeclared (first use in this function); did you mean ‘SBGEMM_DEFAULT_Q’?
   74 |     printf("#define ZLOCAL_BUFFER_SIZE\t%ld\n", (ZGEMM_DEFAULT_Q * ZGEMM_DEFAULT_UNROLL_N * 2 * 2 *  sizeof(double)));
      |                                                  ^~~~~~~~~~~~~~~
      |                                                  SBGEMM_DEFAULT_Q
make: *** [Makefile.prebuild:74: getarch_2nd] Error 1
Makefile:154: *** OpenBLAS: Detecting CPU failed. Please set TARGET explicitly, e.g. make TARGET=your_cpu_target. Please read README for the detail..  Stop.
Traceback (most recent call last):
  File "./install.py", line 543, in <module>
    driver()
  File "./install.py", line 539, in driver
    install_legate_numpy(unknown=unknown, **vars(args))
  File "./install.py", line 359, in install_legate_numpy
    install_openblas(openblas_dir, thread_count, verbose)
  File "./install.py", line 143, in install_openblas
    execute_command(
  File "./install.py", line 62, in execute_command
    subprocess.check_call(args, cwd=cwd, shell=shell)
  File "/usr/lib/python3.8/subprocess.py", line 364, in check_call
    raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '['make', '-j', '8', 'USE_THREAD=1', 'NO_STATIC=1', 'USE_OPENMP=1', 'NUM_PARALLEL=32', 'LIBNAMESUFFIX=legate']' returned non-zero exit status 2.

Question: `--no-openmp`, `has_openmp` and OpenBLAS

Whether OpenBLAS will be built with OpenMP or not is determined by the result of function has_openmp(). The result of has_openmp() depends on g++ compilation outcome. That says, even if I use --no-openmp to install Legate NumPy, the underlying OpenBLAS is still built with OpenMP (if my g++ supports it). Is this a designed behavior? Thanks.

Performance regression testing

It would be good to periodically run performance regression tests, on representative hardware combinations. A project like https://github.com/spcl/npbench could be used as a starting point for a benchmark suite, or an actual benchmark run we did in the past.

AttributeError: type object 'ndarray' has no attribute 'convert_to_legate_array'

Location: line 1622 in array.py
https://github.com/nv-legate/legate.numpy/blob/3452c85f93c4a886e9f4bff5f2e87b20f98b30bf/legate/numpy/array.py#L1622

I believe this is a bug. This looks like a typo to convert_to_legate_ndarray (the method at line 109 to 118 in the same file):
https://github.com/nv-legate/legate.numpy/blob/3452c85f93c4a886e9f4bff5f2e87b20f98b30bf/legate/numpy/array.py#L109-L118

If an example is needed to see how this is triggered, here's one:

step 1: create test.py:

from legate import numpy
a = numpy.array([1, 2, 3, 0, 4, 5, 6], dtype=float)
b = numpy.array([1, 0, 3, 0, 4, 5, 0], dtype=float)
d = numpy.divide(a, b, out=numpy.zeros_like(b), where=(b != 0))
print("d:", d)

step 2: run with, e.g., $ legate --cpus 1 ./test.py

The output:

Traceback (most recent call last):
  File "<prefix>/lib/python3.8/site-packages/legion_top.py", line 394, in legion_python_main
    run_path(args[start], run_name='__main__')
  File "<prefix>/lib/python3.8/site-packages/legion_top.py", line 193, in run_path
    exec(code, module.__dict__, module.__dict__)
  File "./test.py", line 4, in <module>
    d = numpy.divide(a, b, out=numpy.zeros_like(b), where=(b != 0))
  File "<prefix>/lib/python3.8/site-packages/legate/numpy/module.py", line 853, in divide
    return true_divide(
  File "<prefix>/lib/python3.8/site-packages/legate/numpy/module.py", line 1050, in true_divide
    return ndarray.perform_binary_op(
  File "<prefix>/lib/python3.8/site-packages/legate/numpy/array.py", line 2058, in perform_binary_op
    cls.get_where_thunk(
  File "<prefix>/lib/python3.8/site-packages/legate/numpy/array.py", line 1622, in get_where_thunk
    array = cls.convert_to_legate_array(where)
AttributeError: type object 'ndarray' has no attribute 'convert_to_legate_array'

After changing it to convert_to_legate_ndarray, the division works as expected.

Comprehensively exercise each function in the test suite

Each supported function should be fully exercised in the test suite, e.g.:

all datatypes supported for the operation
all supported options
using a where array, if supported by the operation
all broadcasting modes
all cases of implicit up-casting (e.g. adding an integer and a real array) -- we should test at least every combination of an integer, a floating point and a complex array
other cases of store transformations on inputs (e.g. slice, transpose)
all array dimensions, up to the max number of dimensions that Legate was compiled for
passing existing arrays as outputs

To keep things sane, each of the above parameters can be tested in isolation.

To cover an arbitrary amount of dimensions it will be necessary to programmatically generate inputs, e.g. see https://github.com/nv-legate/legate.numpy/blob/896f4fd9b32db445da6cdabf7b78d523fca96936/tests/binary_op_broadcast.py and https://github.com/nv-legate/legate.numpy/blob/067a541905bf3bfc8d3727c6e1fe97a4855729b9/tests/intra_array_copy.py.

The NumPy test suite may be a good starting point, see #22.

2D array operations trigger `NotImplementedError: Legate needs support for more than 3 dimensions`

Problem

When doing a division that requires a 1D denominator to be broadcasted to a 2D array, and when the shape is larger than a certain size, an exception is raised

Traceback (most recent call last):
  File "<removed>/lib/python3.8/site-packages/legion_top.py", line 394, in legion_python_main
    run_path(args[start], run_name='__main__')
  File "<removed>/lib/python3.8/site-packages/legion_top.py", line 193, in run_path
    exec(code, module.__dict__, module.__dict__)
  File "./test1.py", line 14, in <module>
    c = (a[:, 1:] - a[:, :-1]) / (b[1:] - b[:-1])
  File "<removed>/lib/python3.8/site-packages/legate/numpy/array.py", line 776, in __truediv__
    return self.internal_truediv(
  File "<removed>/lib/python3.8/site-packages/legate/numpy/array.py", line 519, in internal_truediv
    return self.perform_binary_op(
  File "<removed>/lib/python3.8/site-packages/legate/numpy/array.py", line 2054, in perform_binary_op
    out._thunk.binary_op(
  File "<removed>/lib/python3.8/site-packages/legate/numpy/deferred.py", line 4876, in binary_op
    ) = self.runtime.compute_broadcast_transform(
  File "<removed>/lib/python3.8/site-packages/legate/numpy/runtime.py", line 2605, in compute_broadcast_transform
    raise NotImplementedError(
NotImplementedError: Legate needs support for more than 3 dimensions

To reproduce

step 1: create a test Python script, say, test.py. Its content is:

from legate import numpy
a = numpy.random.random((400, 2001))
b = numpy.random.random(2001)
c = (a[:, 1:] - a[:, :-1]) / (b[1:] - b[:-1])

step 2: run test.py with legate. I'm using this one for my test:
```
$ legate --cpus 1 ./test.py
```

Other notes

Using different shapes/sizes to generate a and b seems to also affect the errors. Smaller shapes/sizes do not give any error. For example, (4, 21) and (21,) for a and b respectively do not raise any errors.

Also, using the same shape but with different runtime flags may or may not return errors. For example, using (40, 201) for a and (201,) for b:

Using legate --cpus 0 --omps 1 --ompthreads 1 ./test.py works fine. No error.
Using legate --cpus 0 --omps 1 --ompthreads -ll:okindhack 1 ./test.py returns the error of NotImplementedError: Legate needs support for more than 3 dimensions.

My workaround

If I explicitly do the broadcasting before the division, everything is fine.

Tests fail with a Mapper error

When run in debug mode, some tests fail with the following error:

[0 - 7f8847346700]    1.473962 {5}{runtime}: [error 67] LEGION ERROR: Invalid mapper output from invocation of 'map_task' on mapper NumPy Mapper on Node 0. Mapper specified instance that does not meet region requirement 2 for task legate::numpy::BinaryUniversalFunction<legate::numpy::AddOperation<double> >::NormalTask (ID 133). The index space for the instance has insufficient space for the requested logical region. (from file /legate.core/legion/runtime/legion/legion_tasks.cc:3149)
For more information see:
http://legion.stanford.edu/messages/error_code.html#error_code_67

Adopt existing NumPy test suites

We should at least pass the same tests that Numpy uses, potentially replicated at multiple scales. Some bugs only became visible when the array size is substantial.

Using a scalar in `allclose` raises an `AttributeError`

Problem

Using a scalar in allclose raises AttributeError: PROJ_1D_1D_.

To reproduce

step 1: create test.py

from legate import numpy as lnp
import numpy as realnp

# vanilla numpy works
a = realnp.full(10, 1e-1)
print(realnp.allclose(a, 1e-1))

# legate numpy not working
la = lnp.full(10, 1e-1)
print(lnp.allclose(la, 1e-1))

step 2: run test.py with legate --cpus 1 ./test.py -lg:numpy:test

Output

The first part that uses vanilla NumPy prints True.

The second part that uses Legate NumPy raises:

Traceback (most recent call last):
  File "<prefix>/lib/python3.8/site-packages/legion_top.py", line 408, in legion_python_main
    run_path(args[start], run_name='__main__')
  File "<prefix>/lib/python3.8/site-packages/legion_top.py", line 200, in run_path
    exec(code, module.__dict__, module.__dict__)
  File "./test.py", line 10, in <module>
    print(lnp.allclose(la, 1e-1))
  File "<prefix>/lib/python3.8/site-packages/legate/numpy/module.py", line 459, in allclose
    return ndarray.perform_binary_reduction(
  File "<prefix>/lib/python3.8/site-packages/legate/numpy/array.py", line 2068, in perform_binary_reduction
    dst._thunk.binary_reduction(
  File "<prefix>/lib/python3.8/site-packages/legate/numpy/deferred.py", line 5167, in binary_reduction
    ) = self.runtime.compute_broadcast_transform(
  File "<prefix>/lib/python3.8/site-packages/legate/numpy/runtime.py", line 2500, in compute_broadcast_transform
    self.first_proj_id + getattr(NumPyProjCode, proj_name),
  File "<prefix>/lib/python3.8/enum.py", line 384, in __getattr__
    raise AttributeError(name) from None
AttributeError: PROJ_1D_1D_

Expected behavior

Working like vanilla NumPy, or raising an exception with a clear message of what is not supported.

`array.deepcopy` does not have the correct call signature

Problem

When doing a deep copy (using Python's native copy module), an error is raised due to a missing argument in the call signature of array.__deepcopy__:
https://github.com/nv-legate/legate.numpy/blob/2b460c5dfdd60b673e37e25231bf625fdf3ead0e/legate/numpy/array.py#L303-L306

To reproduce

step 1: create test.py:

import copy
from legate import numpy
a = numpy.arange(100)
b = copy.deepcopy(a)

step 2: run with legate --cpus 1 ./test.py -lg:numpy:test

Error traceback:

Traceback (most recent call last):
  File "<prefix>/lib/python3.8/site-packages/legion_top.py", line 394, in legion_python_main
    run_path(args[start], run_name='__main__')
  File "<prefix>lib/python3.8/site-packages/legion_top.py", line 193, in run_path
    exec(code, module.__dict__, module.__dict__)
  File "./test.py", line 4, in <module>
    b = copy.deepcopy(a)
  File "<prefix>/lib/python3.8/copy.py", line 153, in deepcopy
    y = copier(memo)
TypeError: __deepcopy__() takes 1 positional argument but 2 were given

Notes

Though it's not shown in the traceback, but once I add a second argument memo to the definition of array.__deepcopy__, the error is gone. (i.e., changing line 303 in array.py from def __deepcopy__(self): to def __deepcopy__(self, memo):) However, I don't know if the result of the copying is correct or not.
According to the last paragraph of copy's documentation, an additional argument (additional to self) is required in __deepcopy__:

In order for a class to define its own copy implementation, it can define special methods __copy__() and __deepcopy__(). ... ... The latter is called to implement the deep copy operation; it is passed one argument, the memo dictionary. ...
Though ndarray has its own .copy() method, but the __copy__ and __deepcopy__ from native Python is still useful. For example, when using a class:
```
class DummyMesh:
    def __init__(self, bg, ed, n):
        self.vertices = numpy.linspace(bg, ed, n+1)
```
it's easier to copy an instance using copy.deepcopy, e.g., grid_a = DummyMesh(0., 1., 10); grid_b = copy.deepcopy(grid_a). In this situation, the __deepcopy__ of the ndarray is triggered. Otherwise, users have to write more lines of code just to make a deep copy of the instance grid_a.
From the definition of the shallow copy array.__copy__ (line 298) and deep copy array.__deepcopy__, it seems they both are doing the same thing. Is this also the default behavior in vanilla NumPy? Just curious about this.

Support 2d array copies that use advanced indexing

Problem

When using a boolean array to access elements of another array (with a non-trivial size), it raises an error:

TypeError: nonzero() missing 1 required positional argument: 'stacklevel'

To reproduce

step 1: create test.py with the following code:

from legate import numpy
qw = numpy.random.random((100, 100))
qw[qw < 0.3] = 1.0

step 2: run test.py with legate, e.g.,
```
$ legate --cpus 1 ./test.py
```

Output

Traceback (most recent call last):
  File "<blahblah>/python3.8/site-packages/legion_top.py", line 394, in legion_python_main
    run_path(args[start], run_name='__main__')
  File "<blahblah>/lib/python3.8/site-packages/legion_top.py", line 193, in run_path
    exec(code, module.__dict__, module.__dict__)
  File "./test.py", line 3, in <module>
    qw[qw < 0.3] = 1.0
  File "<blahblah>/lib/python3.8/site-packages/legate/numpy/array.py", line 753, in __setitem__
    self._thunk.set_item(key, value_array._thunk, stacklevel=2)
  File "<blahblah>/lib/python3.8/site-packages/legate/numpy/deferred.py", line 487, in set_item
    index_array = self._create_indexing_array(
  File "<blahblah>/lib/python3.8/site-packages/legate/numpy/deferred.py", line 324, in _create_indexing_array
    tuple_of_arrays = key.nonzero()
TypeError: nonzero() missing 1 required positional argument: 'stacklevel'

Expected output

Either it works or raises a NotImplementedError so that users know it has not yet been implemented.

Notes

just like other advanced indexing, when using a smaller size/shape for the qw in this example, everything is working fine.
Also, using GPUs, e.g., legate --gpus 1 ./test.py, works fine.

Garbage collection not working properly

Problem

During loops, memory usage keeps growing when it should keep constant. Looks like a garbage-collection issue.

To reproduce

Option 1: using `examples/stencil.py` (take more time to see the crash)

step 1: go to legate.numpy/examples/stencil.py
step 2: run legate --cpus 1 --sysmem 1500 --eager-alloc-percentage 1 ./stencil.py --num 3000 --benchmark 20 (I lower down the system memory to make the out-of-memory happen faster.)

The crash happens at about the 14th benchmark iteration, so to get a profiling result, change --benchmark to 13. That is, legate --profile --cpus 1 --sysmem 1500 --eager-alloc-percentage 1 ./stencil.py --num 3000 --benchmark 13 -lg:numpy:test -lg:inorder.

Here is the profiling result: legate_prof.tar.gz

Option 2: using custom code (faster to see the crash)

step 1: create test.py

from legate import numpy

a0 = numpy.random.random((1004, 1004))
b0 = numpy.random.random((1004, 1004))
c0 = numpy.random.random((1004, 1004))

counter = 0
while True:
    a = a0.copy()
    b = b0.copy()
    c = c0.copy()

    for i in range(2):
        a[2:-2, i] = a[2:-2, 2].copy()
        b[2:-2, i] = b[2:-2, 2].copy()
        c[2:-2, i] = c[2:-2, 2].copy()

    for i in range(-3):
        a[2:-2, i] = a[2:-2, -3].copy()
        b[2:-2, i] = b[2:-2, -3].copy()
        c[2:-2, i] = c[2:-2, -3].copy()

    for i in range(2):
        a[i, 2:-2] = a[2, 2:-2].copy()
        b[i, 2:-2] = b[2, 2:-2].copy()
        c[i, 2:-2] = c[2, 2:-2].copy()

    for i in range(-3):
        a[i, 2:-2] = a[-3, 2:-2,].copy()
        b[i, 2:-2] = b[-3, 2:-2,].copy()
        c[i, 2:-2] = c[-3, 2:-2,].copy()

    counter += 1
    print(counter)

step 2: run the script with legate --cpus 1 --sysmem 750 --eager-alloc-percentage 1 ./test.py.

The out-of-memory happened at 935th iteration. So to get a profiling output, add if counter % 934 == 0: break after print(counter), and then do legate --profile --cpus 1 --sysmem 750 --eager-alloc-percentage 1 ./test.py -lg:numpy:test -lg:inorder.

Here's the profiling result: legate_prof.tar.gz

ImportError or NameError?

In the example of Jacobi solver, the code uses NameError to catch import failure on legate.numpy. If the intention is to handle the situation when legate.numpy is not available/installed, is it better to catch ImportFailure instead? Thanks.

https://github.com/nv-legate/legate.numpy/blob/026061bd8b5d02ceb7fcb25be73cece6d302c767/examples/jacobi.py#L26-L29

Return order of nonzero elements differs from NumPy

Consider the following program:

a = np.arange(25).reshape((5,5))
print(np.nonzero(a > 17))

When ran with python NumPy (or legate.numpy on 1 CPU) the non-zero indices are returned in this order:

x = [3 3 4 4 4 4 4]
y = [3 4 0 1 2 3 4]

If instead we run with legate.numpy on 4 CPUs (using NUMPY_TEST to force legate.numpy to do distributed execution) (command line: NUMPY_TEST=1 legate nz-order.py -lg:numpy:test --cpus 4) we get:

x = [4 4 4 3 3 4 4]
y = [0 1 2 3 4 3 4]

I.e. legate.numpy returns the indices grouped by tile (see nz-order.pdf for a visualization), instead of returning them according to the global row-major order, as is guaranteed in the NumPy API. This is a side effect of how distributed nonzero is implemented. Making this work like NumPy would require a sort after every nonzero call.

We could simply decide to live with this incompatibility, since I expect most code using nonzero will not explicitly depend on the order that nonzero elements are returned in. The most likely scenario I can think of where this incompatibility would be problematic is if the user code mixes the results of different nonzero calls in the same operation:

import legate.numpy as np

# too small to be partitioned; indices will be in C order
small = np.ones((2,2))
small_is = np.nonzero(small)

# large enough to be partitioned; indices will be grouped by tile.
large = np.zeros((10000,10000))
large[2500,2500] = 2.0
large[2500,7500] = 3.0
large[2501,2500] = 4.0
large[2501,7500] = 5.0
large_is = np.nonzero(large)

small[small_is] = large[large_is]
print(small)

View partitioning ignores top-level array and other views

Currently legate.numpy partitions the region backing an array view (slice) without considering other views or the top-level array. This can lead to tile misalignment, e.g. on the stencil benchmark, where the top-level "grid" array has one extra cell on each side compared to the "center" view. If asked to split a 37839 x 37839 grid into 4 tiles across the X dimension, legate.numpy will create the following partitions:

        tile 0  tile 1      tile 2       tile 3
center: 1-9460  9461-18920  18921-28380  28381-37837
grid:   0-9459  9460-18919  18920-28379  28380-37838

Notice that the boundaries are not aligned. This behavior has the potential to cause extra traffic, as we switch between the two partitions while working on the different arrays.

We could try to capture this case by partitioning regions for views following the top-level region partition. We would likely want to guard this optimization with a heuristic, to only apply it where it would actually be beneficial (e.g. the proposed "derived" partitioning is not horribly imbalanced, in which case we would prefer the original "equal" partitioning strategy for the view).

Note that this proposal doesn't cover the case where the partitioning of view A should be aligned with view B, but neither is related to the top-level array's partition. A lazy evaluation engine could make an informed decision even in this scenario, by waiting to tile until it has seen a number of tiled partitions that need to be made, at which point it can tile them together using something like the unification algorithm used by Regent's auto-parallelizer.

Catching out-of-bounds errors when doing advanced indexing

In NumPy, basic slices can lie partially or completely outside the bounds of an array, and the out-of-bounds indices are simply ignored. However, requesting out-of-bounds indices is an error when using advanced indexing:

>>> np.arange(10)[ 12:14 ]
array([], dtype=int64)
>>> np.arange(10)[ [12,13] ]
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
IndexError: index 12 is out of bounds for axis 0 with size 10

Currently in Legate such cases of advanced indexing are implemented using gather/scatter copies, where out-of-bounds indices are currently ignored, so we are not emitting an error for this case.

We have requested support for a check at the Realm level (see StanfordLegion/legion#1084), but even when this is implemented we may want to avoid it, as copies will be much faster without it.

Support partial-machine tiling

Currently legate.numpy will either tile an array across all the nodes in the machine, or place the entire array on a single node. Both strategies may be suboptimal in certain cases, e.g. the jacobi benchmark, where the following matrix-vector multiplication happens repeatedly every timestep (R is a NxN matrix, all other variables are Nx1 vectors):

x = (b - np.dot(R, x)) / d

The optimal partitioning for the vectors is to split them across sqrt(N) nodes, one for each column tile of the matrix R.

Besides providing the mechanism for this, someone needs to advise legate.numpy what the optimal partitioning is for each operation's output array. Doing this requires looking ahead at how this array is later used, and thus requires lazy evaluation.

Build error - comparison to zero in scalar_unary_red_omp.cc - both g++ and clang++

Summary:
There are 2 instances of what appears to be a comparison of a complex/bool/float (generic 'auto' variable) to zero in unary/scalar_unary_red_omp.c. g++ and clang++ both fail, saying there's no match for the != operator with the provided types.

unary/scalar_unary_red_omp.cc:130:77: error: no match for ‘operator!=’ (operand types are ‘const std::complex<float
>’ and ‘int’)

I've built legate.core (without cuda, see 'aside' at bottom of post) from source, have a pre-installed openblas, and get this error when building legate-numpy on both OSX and Ubuntu

Instances of zero comparison (unary/scalar_unary_red_omp.cc). Both trigger compiler errors
1:

  130 |         for (size_t idx = 0; idx < volume; ++idx) locals[tid] += inptr[idx] != 0;

137         for (size_t idx = 0; idx < volume; ++idx) {
138           auto point = pitches.unflatten(idx, rect.lo);
139           locals[tid] += in[point] != 0;

Command
On ubuntu (OSX command is similar):

python3 install.py --with-core /home/shivneural/legate/legate.core/target --with-openblas /usr/lib/x86_64-linux-gnu/openblas-pthread/

Environment:
I'm encountering this error on both OSX and Ubuntu (and have tried a few different compilers):

OSX 10.15.7 Catalina
Compilers tried:
a. clang++ version 12.0.0.
b. clang++ Apple LLVM version 7.0.2 (clang-700.1.81)
c. g++-11 (Homebrew GCC 11.1.0) 11.1.0 (fails due to different errors).
Ubuntu 20.04.2 LTS (Focal Fossa)
Compilers tried:
a. g++ (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0.

Full error:

g++ -o unary/unary_red_omp.cc.o -c unary/unary_red_omp.cc   -fopenmp -I/home/shivneural/legate/legate.core/install/
thrust  -I. -I/usr/lib/x86_64-linux-gnu/openblas-pthread/include -std=c++14 -Wfatal-errors -I/home/shivneural/legat
e/legate.core/target/include -O2 -fno-strict-aliasing  -DLEGATE_USE_CUDA -I/include -fPIC -DLEGATE_USE_OPENMP
unary/scalar_unary_red_omp.cc: In instantiation of ‘void legate::numpy::ScalarUnaryRedImplBody<legate::numpy::Varia
ntKind::OMP, legate::numpy::UnaryRedCode::COUNT_NONZERO, CODE, DIM>::operator()(uint64_t&, legate::AccessorRO<typen
ame legate::LegateTypeOf<CODE>::type, DIM>, Legion::Rect<N>&, const legate::numpy::Pitches<(DIM - 1)>&, bool) const
 [with legate_core_type_code_t CODE = COMPLEX64_LT; int DIM = 1; uint64_t = long unsigned int; legate::AccessorRO<t
ypename legate::LegateTypeOf<CODE>::type, DIM> = Legion::FieldAccessor<LEGION_READ_PRIV, std::complex<float>, 1, lo
ng long int, Realm::AffineAccessor<std::complex<float>, 1, long long int>, false>; typename legate::LegateTypeOf<CO
DE>::type = std::complex<float>; Legion::Rect<N> = Realm::Rect<1, long long int>]’:
./unary/scalar_unary_red_template.inl:132:75:   required from ‘legate::numpy::UntypedScalar legate::numpy::ScalarUn
aryRedImpl<KIND, legate::numpy::UnaryRedCode::COUNT_NONZERO>::operator()(legate::numpy::ScalarUnaryRedArgs&) const 
[with legate_core_type_code_t CODE = COMPLEX64_LT; int DIM = 1; legate::numpy::VariantKind KIND = legate::numpy::Va
riantKind::OMP]’
/home/shivneural/legate/legate.core/target/include/utilities/dispatch.h:67:40:   required from ‘constexpr decltype(
auto) legate::inner_type_dispatch_fn<DIM>::operator()(legate::LegateTypeCode, Functor, Fnargs&& ...) [with Functor 
= legate::numpy::ScalarUnaryRedImpl<legate::numpy::VariantKind::OMP, legate::numpy::UnaryRedCode::COUNT_NONZERO>; F
nargs = {legate::numpy::ScalarUnaryRedArgs&}; int DIM = 1; legate::LegateTypeCode = legate_core_type_code_t]’
/home/shivneural/legate/legate.core/target/include/utilities/dispatch.h:141:41:   required from ‘constexpr decltype
(auto) legate::double_dispatch(int, legate::LegateTypeCode, Functor, Fnargs&& ...) [with Functor = legate::numpy::S
calarUnaryRedImpl<legate::numpy::VariantKind::OMP, legate::numpy::UnaryRedCode::COUNT_NONZERO>; Fnargs = {legate::n
umpy::ScalarUnaryRedArgs&}; legate::LegateTypeCode = legate_core_type_code_t]’
./unary/scalar_unary_red_template.inl:167:27:   required from ‘legate::numpy::UntypedScalar legate::numpy::scalar_u
nary_red_template(legate::TaskContext&) [with legate::numpy::VariantKind KIND = legate::numpy::VariantKind::OMP]’
unary/scalar_unary_red_omp.cc:151:61:   required from here
unary/scalar_unary_red_omp.cc:130:77: error: no match for ‘operator!=’ (operand types are ‘const std::complex<float
>’ and ‘int’)
  130 |         for (size_t idx = 0; idx < volume; ++idx) locals[tid] += inptr[idx] != 0;
      |                                                                  ~~~~~~~~~~~^~~~
compilation terminated due to -Wfatal-errors.
make: *** [/home/shivneural/legate/legate.core/target/share/legate/legate.mk:200: unary/scalar_unary_red_omp.cc.o] 
Error 1

Changing the 0 to a std::complex<float>(0.0f,0.0f) emits an error that the operand types are a bool and complex float.

Perhaps I'm configuring something incorrectly, in which case any guidance is appreciated.

(Aside: My Ubuntu machine is GCP instance w/ a T4 GPU, running cuda 10.1. When kicking of a legate-core with-cuda build it fails as it can't recognize the "_habs" half precision function when building legion
legate/legate.core/legion/runtime/mathtypes/half.h(364): error: identifier "__habs" is undefined. Looks like T4's Turing architecture isn't one of legate's supported platforms, but AFAIK Turing supports half precision)
Thanks

Handle overlapping sub-arrays in copies

Currently every array copy statement is translated into a single Legion copy operation or a single task launch. If the LHS and RHS in the copy statement refer to the same array and the slices overlap, e.g. for a[0:2] = a[1:3], then we will get a runtime error due to aliasing of the region requirements in the emitted operation.

This operation works fine in vanilla NumPy. To make it work in legate.numpy we would need to copy the RHS into an intermediate array, and copy from that into the LHS.

For basic copy statements it is possible we can come up with a cheap check to accurately detect overlap, and thus decide if the intermediate array is necessary (see #39). For advanced copy statements, however, this check would be much more expensive (since the set of affected indices is data-dependent), thus we should always use an intermediate array.

The combination of `reshape` and `tile` crashed

Problem

When using numpy,tile, if the input array is an array obtained from reshape, the program crashed at Legion level, i.e., no Python error traceback.

To reproduce

step 1: create test.py

from legate import numpy

# this works
print(numpy.tile(numpy.array([[2], [1]], dtype=numpy.float64), (1, 10)))

# this does not work
print(numpy.tile(numpy.arange(2, 0, -1, dtype=numpy.float64).reshape((2, 1)), (1, 10)))

step 2: run legate --cpus 1 ./test.py -lg:numpy:test -lg:inorder

Output

The first print correctly prints

[[2. 2. 2. 2. 2. 2. 2. 2. 2. 2.]
 [1. 1. 1. 1. 1. 1. 1. 1. 1. 1.]]

However, the second print frailed because the it crashed. The error:

[0 - 7f08045c77c0]    0.807842 {5}{runtime}: [error 164] LEGION ERROR: Dynamic type mismatch in 'get_index_space_domain' (from file <prefix>/legate.core/legion/runtime/legion/region_tree.inl:3213)
For more information see:
http://legion.stanford.edu/messages/error_code.html#error_code_164

Signal 6 received by node 0, process 398141 (thread 7f08045c77c0) - obtaining backtrace
Signal 6 received by process 398141 (thread 7f08045c77c0) at: stack trace: 14 frames
  [0] = /usr/lib/libpthread.so.0(+0x13960) [0x7f08045a4960]
  [1] = /usr/lib/libc.so.6(gsignal+0x145) [0x7f080410aef5]
  [2] = /usr/lib/libc.so.6(abort+0x116) [0x7f08040f4862]
  [3] = <prefix>/lib/liblegion.so(+0x7b9a36) [0x7f0805f09a36]
  [4] = <prefix>/lib/liblegion.so(Legion::Internal::IndexSpaceNodeT<1, long long>::get_index_space_domain(void*, unsigned int)+0x77) [0x7f0805fbca67]
  [5] = <prefix>/lib/liblegion.so(Legion::Internal::PhysicalRegionImpl::get_instance_info(legion_privilege_mode_t, unsigned int, unsigned long, void*, unsigned int, char const*, bool, bool, bool, int)+0x26e) [0x7f0805f1102e]
  [6] = <prefix>/lib/liblgnumpy.so(Legion::FieldAccessor<(legion_privilege_mode_t)1, double, 2, long long, Realm::AffineAccessor<double, 2, long long>, false> legate::LegateDeserializer::unpack_accessor_RO<double, 2>(Legion::PhysicalRegion const&, Realm::Rect<2, long long> const&)+0x244) [0x7f06e9cd0094]
  [7] = <prefix>/lib/liblgnumpy.so(legate::numpy::TileTask<double>::cpu_variant(Legion::Task const*, std::vector<Legion::PhysicalRegion, std::allocator<Legion::PhysicalRegion> > const&, Legion::Internal::TaskContext*, Legion::Runtime*)+0x17f) [0x7f06eb028fdf]
  [8] = <prefix>/lib/liblgnumpy.so(void Legion::LegionTaskWrapper::legion_task_wrapper<&(void legate::LegateTask<legate::numpy::TileTask<double> >::legate_task_wrapper<&legate::numpy::TileTask<double>::cpu_variant>(Legion::Task const*, std::vector<Legion::PhysicalRegion, std::allocator<Legion::PhysicalRegion> > const&, Legion::Internal::TaskContext*, Legion::Runtime*))>(void const*, unsigned long, void const*, unsigned long, Realm::Processor)+0x50) [0x7f06eb0334f0]
  [9] = <prefix>/lib/librealm.so(+0x2ae179) [0x7f080488f179]
  [10] = <prefix>/lib/librealm.so(+0x2ae236) [0x7f080488f236]
  [11] = <prefix>/lib/librealm.so(+0x2b0b28) [0x7f0804891b28]
  [12] = <prefix>/lib/librealm.so(+0x29001a) [0x7f080487101a]
  [13] = /usr/lib/libc.so.6(+0x52540) [0x7f0804120540]

Expected

Either

[[2. 2. 2. 2. 2. 2. 2. 2. 2. 2.]
 [1. 1. 1. 1. 1. 1. 1. 1. 1. 1.]]
[[2. 2. 2. 2. 2. 2. 2. 2. 2. 2.]
 [1. 1. 1. 1. 1. 1. 1. 1. 1. 1.]]

Or NotImplementedError.

Handle views in advanced indexing

Copies that involve advanced indexing are implemented with a scatter/gather copy. The current code ignores the transforms on the RegionFields, i.e. it doesn't take into account the translation from NumPy (local) index space (every view's indices start from 0) to Legion (global) index space (every subregion's indices start wherever that subregion is placed within the parent region).

In the general case the base, index and value arrays can all be views:

a = np.arange(10)
b = np.arange(10)
c = np.arange(9, -1, -1)
x = a[2:7][ c[5:10] ] # __get_item__
a[2:7][ c[5:10] ] = b[3:8] # __set_item__

The logic to handle the general case of __get_item__ might be:

use the RegionField backing the newly created result array (which has no transform) as dst in the gather/scatter copy
use root of base array as src
if neither the base nor the index array are views, use index array's RegionField as src_indirect
if only the index array is a view, materialize it as a new array, getting rid of any transforms (like doing c[5:10] * 1 above), use the final array as src_indirect
otherwise, apply the base array's transform (\x. x+2 in the example above) on each element of the index array (the transform operation will naturally get rid of the index's transform, if any), use the output array as src_indirect

And for __set_item__:

use value array's RegionField as src in the gather/scatter copy
if the value array is a view, create a new field to materialize its transform (stores the mapping 0:5 -> 3:8 for the example above), use that as src_indirect
use root of base array as dst
if neither the base nor the index array are views, use index array's RegionField as dst_indirect
if only the index array is a view, materialize it as a new array, getting rid of any transforms (like doing c[5:10] * 1 above), use the final array as dst_indirect
otherwise, apply the base array's transform (\x. x+2 in the example above) on each element of the index array (the transform operation will naturally get rid of the index's transform, if any), use the output array as dst_indirect

The work on StanfordLegion/legion#705 would allow us to avoid materializing at least some of these fields.

Edit: add some more detail, enumerate more cases, fix typos

Documentation of the flag --no-openmp

The documentation of --no-openmp says "Build Legate NumPy with OpenMP". I'm kind of confused. Does --no-openmp disable or enable OpenMP? Thanks!

https://github.com/nv-legate/legate.numpy/blob/90601de0b1f03d0a1753776f3f37b1fe9d7f7972/install.py#L467-L474

Memory grows on top of the memory allocated through `--fbmem`

I encountered a memory issue that is different from issue #33: runtime memory usage grows beyond the value of --fbmem, and also the memory usage in the profiling result is still constant. I'm not sure whether I misconfigured something or not.

An example is the cg.py from legate.numpy/example. Run cg.py on a A100 80GB-variant with

NUMPY_FIELD_REUSE_FREQ=1 \
    legate --gpus 1 --fbmem 80000 --eager-alloc-percentage 1 \
        ./cg.py --num 235 --benchmark 10

The program crashed at the 3rd benchmark run with this error message: Internal Legate CUBLAS failure with error code 13 in file dot.cu at line 587. It doesn't say anything about memory. But when I monitored the runtime memory usage through nvidia-smi, the memory grew along with time, and the program crashed when the memory ran out.

The first thing I don't understand is that the memory grew on top of the memory allocated through --fbmem. The second thing is that the profiling result does not show any memory growth. It remains constant in the profiling result.

The CUDA (including cublas) version is 11.3.0.

(Lower down --fbmem to allow the memory growth also allows more benchmark runs. For example --fbmem 75000 allows all 10 benchmark runs to finish.)

Uninitialized data warnings on advanced indexing copies

The following program:

from legate import numpy
a = numpy.arange(50)
indices = numpy.arange(10)
print(a[indices])

causes the runtime to emit this warning:

[0 - 7fdf65929700]    2.045884 {4}{runtime}: [warning 1071] LEGION WARNING: Region requirement 1 of operation Copy (UID 5) in parent task legion_python_main (UID 1) is using uninitialized data for field(s) 1048578 of logical region (2,1,2) (from file /gpfs/fs1/mpapadakis/legate.core/legion/runtime/legion/legion_ops.cc:1192)
For more information see:
http://legion.stanford.edu/messages/warning_code.html#warning_code_1071

Need a way to pass build flag to OpenBLAS

My CPU is Intel's Comet Lake, which can not be detected by OpenBLAS v0.3.10 automatically (see OpenMathLib/OpenBLAS#2769). The solution suggested by OpenBLAS is either providing a flag TARGET=... or using later versions. However, install.py has hard-coded the OpenBLAS version to v0.3.10 and does not provide any methods to provide custom flags to build OpenBLAS.

Currently, I'm building my own OpenBLAS. I'm just thinking maybe it's nicer to have a way to provide custom OpenBLAS flags to install.py? Or at least add some notes in README to let users know they have to build/install their own OpenBLAS if they have newer CPUs?

Thanks!

Force computation completion

Is there a way to programmatically force Legate to wait for the completion of all pending operations? In the examples, the way to go is to read the output basically, i.e., assert not math.isnan(np.sum(output)). Is there a different way that doesn't incur the penalty of accessing all output elements?

linspace silently failed when using GPU without any runtime error

Problem

numpy.linspace does not work correctly and returns an all-zero array without throwing any runtime errors or warnings.

To reproduce

Option 1: using `legate.numpy`'s own tests

step 1: go to legate.numpy's source code folder
step 2: run ./test.py --use cuda --gpus 1

Option 2:

step 1: create test.py

from legate import numpy
a = numpy.linspace(0.0, 4.0, 501)
print(a.mean(), a.sum())
print(a)

step 2: run the script with legate --gpus 1 ./test.py -lg:numpy:test

Resutls

When using legate.numpy's own test suite, the result show the test for linspace failed.

When using the test script in option 2, the result shows an all-zero array.

Expected results

I noticed linspace is not listed in the Legate NumPy API reference, so I think linspace belongs to the group that is not implemented yet? Then, in this case, the function should return NotImplementedError instead of just an all-zero array.

Non-deterministic wrong result from tensordot

This program (derived from tests/tensordot.py):

import legate.numpy as lg
import numpy as np

a = lg.random.rand(3, 5, 4).astype(np.float16)
b = lg.random.rand(4, 5, 3).astype(np.float16)

a = lg.random.rand(3, 5, 4).astype(np.float16)
b = lg.random.rand(5, 4, 3).astype(np.float16)
cn = np.tensordot(a, b)
print('cn', flush=True)
print(cn, flush=True)
c = lg.tensordot(a, b)
print('c', flush=True)
print(c, flush=True)

assert np.allclose(cn, c)

when run as follows:

LEGATE_TEST=1 legate 79.py -lg:numpy:test --cpus 4

fails about 20% of the time, with:

cn
[[4.07  4.83  5.01 ]
 [4.2   4.562 5.863]
 [4.344 4.52  3.914]]
c
[[4.07  4.83  5.01 ]
 [4.2   4.562 5.863]
 [4.344 4.52  3.916]]
[0 - 700005133000]    0.946367 {6}{python}: python exception occurred within task:
Traceback (most recent call last):
  File "/Users/mpapadakis/legate.core/install/lib/python3.8/site-packages/legion_top.py", line 410, in legion_python_main
    run_path(args[start], run_name='__main__')
  File "/Users/mpapadakis/legate.core/install/lib/python3.8/site-packages/legion_top.py", line 234, in run_path
    exec(code, module.__dict__, module.__dict__)
  File "79.py", line 16, in <module>
    assert np.allclose(cn, c)
AssertionError

Cannot use numpy.int32 or numpy.int64 as indices to access single array element

Problem

When using numpy.int32 or numpy.int64 as indices to get an element from a 2D array, the code triggers something that is not implemented, i.e., NotImplementedError. However, if converting the indices to native int, then everything works.

I'm not sure if this is just an unimplemented feature, or if something's wrong, as this should be basic indexing and no advanced indexing involved.

numpy.int64 seems to be the default type of the element returned by looping a NumPy integer array. So I feel this is a common use case.

To reproduce

step 1: create two python scripts: test_1.py and test_2.py

In test_1.py

from legate import numpy
a = numpy.random.random((1000, 2000))
idx = numpy.random.randint(0, 99, 10, int)
idy = numpy.random.randint(0, 99, 10, int)

for i, j in zip(idx, idy):
    print("index type: ({}, {}); ".format(type(i), type(j)), end="")
    print("a[i, j] = {}".format(a[i, j]))

In test_2.py

from legate import numpy
a = numpy.random.random((1000, 2000))
idx = numpy.random.randint(0, 99, 10, int)
idy = numpy.random.randint(0, 99, 10, int)

for i, j in zip(idx, idy):
    i, j = int(i), int(j)
    print("index type: ({}, {}); ".format(type(i), type(j)), end="")
    print("a[i, j] = {}".format(a[i, j]))

step 2: run both scripts

$ legate --cpus 1 ./test_1.py

and

$ legate --cpus 1 ./test_2.py

Expected behavior

Both test_1.py and test_2.py should both output 10 lines of message in a format of index type: (class XXX, class XXX); a[i, j] = YYYYYYYYYYY. XXX is either numpy.int64 or int, depending on it's test_1.py or test_2.py, while YYYYYYYYYYY are random numbers.

Actual output

test_1.py reports an error (but successfully prints the first part on the message of the first line, i.e, index type: ...):

Traceback (most recent call last):
  File "<prefix>/lib/python3.8/site-packages/legion_top.py", line 394, in legion_python_main
    run_path(args[start], run_name='__main__')
  File "<prefix>/lib/python3.8/site-packages/legion_top.py", line 193, in run_path
    exec(code, module.__dict__, module.__dict__)
  File "./test_1.py", line 8, in <module>
    print("a[i, j] = {}".format(a[i, j]))
  File "<prefix>/lib/python3.8/site-packages/legate/numpy/array.py", line 381, in __getitem__
    shape=None, thunk=self._thunk.get_item(key, stacklevel=2)
  File "<prefix>/lib/python3.8/site-packages/legate/numpy/deferred.py", line 343, in get_item
    index_array = self._create_indexing_array(
  File "<prefix>/lib/python3.8/site-packages/legate/numpy/deferred.py", line 332, in _create_indexing_array
    raise NotImplementedError("need support for concatenating arrays")
NotImplementedError: need support for concatenating arrays
index type: (<class 'numpy.int64'>, <class 'numpy.int64'>);

NDArray support

Is there any plan for the NDArray?

Dockerfiles removed

You have removed "outdated" dockerfiles. Are there plans to restore them?

Forcing code to run on GPU

I believe I have built legate.core and legate.numpy correctly. I can run "legate" and get a prompt.

I used the last example from https://github.com/barbagroup/CFDPython/blob/master/lessons/15_Step_12.ipynb as mentioned on the github page. I used the default grid of 41x 41 with nit=250 to get a longer runtime. However, I can't seem to get the code to run on the GPU. I put the code into a file and run it as "legate cfd.py".

I'm running this on a laptop with a 4GB GeForce 1650 GPU (I built using the Volta architecture). When I use the option "--gpus 1" it tells me I don't have enough memory. Is the number after "--gpus" referring to the "number" of GPUs or does it refer to the device numbers? When I tried "--gpus 0", thinking it was referring to the device, it runs but it runs at the same speed as the CPU. Plus "nvidia-smi" doesn't show the code ever running on the GPU.

BTW - I built legate-core using the following command:

./install.py --cuda --with-cuda /opt/nvidia/hpc_sdk/Linux_x86_64/21.3/ --arch volta --install-dir /usr/local/legate

I built legate.numpy using the following command:

python setup.py --with-core /usr/local/legate

BTW - is there a way to check that legate was built using GPUs beyond just running a code and trying to force it to run on a GPU?

Thanks!

Jeff

arange fails when `start` + `step` is greater than or equal to `stop`

Problem

Something like legate.numpy.arange(1, 3, 5) thows an error AttributeError: 'Future' object has no attribute 'compute_parallel_launch_space'. (Here, start is 1, stop is 3, and step is 5. See the signature from vanilla NumPy.)

To reproduce

step 1: create test.py:

from legate import numpy as legatenumpy
import numpy as truenumpy
t_start, t_end, dt = 0, 1, 2
t_legate = legatenumpy.arange(t_start, t_end, dt)
t_numpy = truenumpy.arange(t_start, t_end, dt)
print(truenumpy.allclose(t_legate, t_numpy))
print(t_legate)
print(t_numpy)

step 2: run with legate --cpus 1 ./test.py -lg:numpy:test

Expected result

Either

True
[0]
[0]

or NotImplementedError, or an error message indicating this specific use case of arange is not currently supported.

Actual result

Traceback (most recent call last):
  File "<prefix>/lib/python3.8/site-packages/legion_top.py", line 394, in legion_python_main
    run_path(args[start], run_name='__main__')
  File "<prefix>/lib/python3.8/site-packages/legion_top.py", line 193, in run_path
    exec(code, module.__dict__, module.__dict__)
  File "./test.py", line 6, in <module>
    t_legate = legatenumpy.arange(t_start, t_end, dt)
  File "<prefix>/lib/python3.8/site-packages/legate/numpy/module.py", line 61, in arange
    result._thunk.arange(start, stop, step, stacklevel=(stacklevel + 1))
  File "<prefix>/lib/python3.8/site-packages/legate/numpy/deferred.py", line 2338, in arange
    launch_space = dst.compute_parallel_launch_space()
AttributeError: 'Future' object has no attribute 'compute_parallel_launch_space'

Notes

Only throws the error when using lg:numpy:test.
This use case of arange might be an edge case in most of applications. (I think)
In production run (without -lg:numpy:test), I guess this error will probably never be triggered because this use case creates just-one-element array and does not go with legion codepath?

Update OpenBLAS version to the latest one

OpenBLAS released version 0.3.15. We should update to the latest version.

`uninitialized data for field` error

It seems that one of the recent commits (possibly e24dbdd) introduced the following error for some codes:

[0 - 7f743c098700]   10.165263 {5}{runtime}: [error 68] LEGION ERROR: Region requirement 1 of operation legate::numpy::NoncommutativeBinaryUniversalFunction<legate::numpy::SubtractOperation<double> >::NormalTask (UID 164) in parent task legion_python_main (UID 1) is using uninitialized data for field(s) 1048579 of logical region (16,1,1) with read-only privileges (from file /gpfs/fs1/mzalewski/repos/quickstart-collection/legate.core/legion/runtime/legion/legion_ops.cc:1170)

Document lg:numpy command-line options

Currently legate.numpy will remove any argument it recognizes from the command line when the Runtime singleton is constructed (i.e. when the legate.numpy module is loaded):

https://github.com/nv-legate/legate.numpy/blob/224de8f7d61d4ef3cecc6bf1b8cd476c8b3f88cf/legate/numpy/runtime.py#L1037-L1054

This matches the way legate.core and Legion/Realm work, with each layer removing the arguments it recognizes. Legate.core does this through the launcher script, for which we automatically get documentation from argparse. Legate.numpy's arguments, however, are only known to the Runtime constructor, and not documented anywhere else. -lg:numpy:test and -lg:numpy:shadow are developer options so I don't think they need to be publicly documented, but AFAIK -lg:numpy:summarize is a user-targeted option, so we likely want to document it somehow (ideally in the launcher script, if we detect that the legate.numpy module has been loaded).

Test suite additional configurations

Cunumeric-specific additional configurations to run our existing tests (in addition to generic configurations listed in nv-legate/legate.core#27):

~~Run without partitioning (currently we always force partitioned execution with LEGATE_TEST=1).~~ Recent versions of legate.core automatically choose to use the non-partitioned codepath if there is only one processor available.
Run in eager mode (currently we always force deferred mode with -cunumeric:test). We may want to have a separate flag to force eager mode.
Run in hybrid deferred/eager mode (mirroring what the user would encounter in practice). We may need to adjust our settings to ensure that transition between eager and deferred arrays actually occur.
~~Run with shadow debugging enabled.~~ This option has been removed

Redirecting outputs to overlapping slices of inputs is buggy

I believe examples like the following would raise an interfering requirement error, in both master and branch-21.10:

legate.numpy.add(x[1:5], 1, out=x[2:6])

We need to fix the operators whose outputs can be redirected such that the intermediate results are materialized before getting assigned to the designated arrays. The in-place update code already handles this using the alias check on Legate Stores, so we can reuse that code.

Realm not completing gather copy on the GPU

Problem

Advanced indexing of a relatively huge (e.g., length 10K) 1D array returns UnboundLocalError: local variable 'shardfn' referenced before assignment, rather than NotImplementedError.

I understand that advanced indexing is mostly not yet implemented. Most related routines raise NotImplementedError to let users know about this situation. However, this particular use case raises this different error, which seems to be a bug to me.

To reproduce

step 1: prepare test.py:

from legate import numpy
a = numpy.arange(10000)
print(a[(1, 2, 3), ])

step 2: run with, for example
```
$ legate --cpus 1 test.py
```

Output

Traceback (most recent call last):
  File "<blahblah>/lib/python3.8/site-packages/legion_top.py", line 394, in legion_python_main
    run_path(args[start], run_name='__main__')
  File "<blahblah>/lib/python3.8/site-packages/legion_top.py", line 193, in run_path
    exec(code, module.__dict__, module.__dict__)
  File "./test.py", line 3, in <module>
    print(a[(1, 2, 3), ])
  File "<blahblah>/lib/python3.8/site-packages/legate/numpy/array.py", line 381, in __getitem__
    shape=None, thunk=self._thunk.get_item(key, stacklevel=2)
  File "<blahblah>/lib/python3.8/site-packages/legate/numpy/deferred.py", line 414, in get_item
    copy = Copy(mapper=self.runtime.mapper_id, tag=shardfn)
UnboundLocalError: local variable 'shardfn' referenced before assignment

Expected results

Either [1, 2, 3] or NotImplementedError.

Notes

Interestingly, smaller arrays do not have this issue. For example, if a = numpy.arange(100), the code works fine.
Another way to make it works is to use GPUs instead of CPUs. For example, legate --gpus 1 test.py works fine. This is interesting, as the GPU implementation seems to be more stable than CPU implementation?

nv-legate / cunumeric Goto Github PK

cunumeric's People

Contributors

Stargazers

Watchers

Forkers

cunumeric's Issues

Problem

To reproduce

Output

Note

Problem

To reproduce

Expected and actual outputs

Problem

To reproduce

Other notes

My workaround

Problem

To reproduce

Output

Expected behavior

Problem

To reproduce

Error traceback:

Notes

Problem

To reproduce

Output

Expected output

Notes

Problem

To reproduce

Option 1: using examples/stencil.py (take more time to see the crash)

Option 2: using custom code (faster to see the crash)

Problem

To reproduce

Output

Expected

Problem

To reproduce

Option 1: using legate.numpy's own tests

Option 2:

Resutls

Expected results

Problem

To reproduce

Expected behavior

Actual output

Problem

To reproduce

Expected result

Actual result

Notes

Problem

To reproduce

Output

Expected results

Notes

Recommend Projects

Recommend Topics

Recommend Org

Option 1: using `examples/stencil.py` (take more time to see the crash)

Option 1: using `legate.numpy`'s own tests