bears-r-us / arkouda Goto Github PK

Arkouda (αρκούδα): Interactive Data Analytics at Supercomputing Scale :bear:

License: Other

Chapel 37.54% Python 57.86% Jupyter Notebook 1.11% Shell 0.25% Makefile 0.38% Batchfile 0.01% Gnuplot 0.87% C 0.03% C++ 1.95%

chapel python eda hpc data data-science data-analysis distributed-computing

arkouda's People

Contributors

Stargazers

Watchers

Forkers

spartee bradcray ronawho oplambeck shamisp cdrickett slnguyen e-kayrakli lydia-duncan dlongnecke-cray awthomp rahulghangas hokiegeek2 ben-albrecht daviditen zhuangzzzi brendapraggastis bmcdonald3 rolandqli mppf vasslitvinov abhi5658054 nguyendo24 ssempax1 lnikon zhihuidu bader-research kaydoh glitch stress-tess prashanth314 ksgoodi narenkhatwani vishalbelsare chinmay-ghosh rajendra7406 louisjenkinscs pl4tt ethan-debandi99 mostroski milthorpe joshmarshall1 compiling-is-winning jeichert60 adbmd dcherian whoiscnu sdbachman jaketrookman sotiriskaragiannis jeremiah-corrado rtnine danipmax 21771 j-r-jones jabraham17 jakobtroidl theoddczar789 tgstevensonredrocket egelberg arezaii jemeljanov sysharma brandon-neth shreyaskhandekar ericlavin7 ego lkampoli ajpotts palina-pauliuchenka danilafe fatemehramezanii stoutes tejasingam drculhane keshavaspanda latesnow joshaghani90 itsquinnmoore bhurwitz33 wlav

arkouda's Issues

Better document how to run tests

test needs a README.md to show how to run tests

Add pdarray.min() and pdarray.max()

add function to return the data set names from HDF5 file

This could be python-side only functionality but in the future we prob need this on the chapel-side

Zero Raised to Negative Number

Say you have array a = [0,1,2] and b = [-2,-2,-2]
In Python, you perform a**=b
The resulting array is [inf, 0.5,0.25]

Python would seem to allow for this behavior by not throwing an error, but returning "inf". However, in our OperatorMsg.chpl we don't allow division by zero.

We can either leave it be, or add an || reduce to see if any element is 0, and || reduce to see if any element is negative. This may be wasteful/expensive iteration though.

Int Raised to Negative Number

binopvv Int ** Int currently checks to see if anything in r.a is <0

when "starstar" {
if || reduce (r.a<0){
//instead of error, could we paste the below code but of type float?
return "Error: Attempt to exponentiate base of type Int64 to negative exponent";
}
st.addEntry(rname, l.size, int);
var e = toSymEntry(st.lookup(rname), int);
e.a= l.ar.a;
}

Could we modify the current code to:
when "starstar" {
if || reduce (r.a<0){
st.addEntry(rname, l.size, real);
var e = toSymEntry(st.lookup(rname), real);
e.a= l.ar.a;
}
else{
st.addEntry(rname, l.size, int);
var e = toSymEntry(st.lookup(rname), int);
e.a= l.ar.a;
}
}

Add ak.argmax and ak.argmin

add argmax(maxloc) and argmin(minloc) reductions

link FFTW/MPI against arkouda_server

demonstrate chapel/MPI linkage using FFTW/MPI... as a model for how do do this with other MPI based libs. example code is in toys/fftw-mpi.chpl

Add more types like ak.uint64

add uint64 or decide (prob not) to use uint64 or int64 only
types like uint16, uint32, float32

Look into weakref for python to track all pdarrays and objects that reference server-side objects

Eliminate copies from ak.arange(), etc.

Current pattern (see, for example, MsgProcessing.chpl in proc arange):

var aD = makeDistDom(len);
var a = makeDistArray(len, int);
forall i in aD {
    a[i] = i;
}
st.addEntry(new shared SymEntry(a));

This creates the SymEntry by copying a, in order to avoid accessing the SymEntry within the forall, which hotspots on the class instance.
However, I think we can avoid the hotspotting with a task-private alias. Would the following work?

var e = new shared SymEntry(len, int);
forall i in e.aD {
    var a = e.a;
    a[i] = i;
}
st.addEntry(e);

Efficient strategy for ak.in1d() with large set

Current strategy for ak.in1d(A, B) is basically:

|| reduce [b in B] A == b;

This is the best strategy for small B, but doesn't scale to large B. The strategy numpy uses for large B is essentially (pseudocode):

uA, uAinv = find_segments(A, return_inverse=True)
uB = unique(B)
together = concat(uA, uB)
perm = argsort(together)
sorted = together[perm]
matches = (sorted[1:] == sorted[:-1]) + [False]
ret = empty(A.shape, dtype=bool)
ret[perm] = matches
return ret[uAinv]

This could be done on a per-locale basis.

Another way would be to convert B to a lock-free associative domain, and then do

[a in A] Bset.contains(a);

ID scans of bool cast to int

1D Scans in 1.19.0 of bool array cast to int need a copy made to parallelize

CHGL Integration

Currently, there is a prototype that follows very closely with how Arkouda implements their pdarray and their server. This proof of concept, I believe, demonstrates the kind of functionality CHGL would require.

https://github.com/pnnl/chgl/blob/master/src/CHGL-Client.py

https://github.com/pnnl/chgl/blob/master/src/CHGL-Server.chpl

I would like to direct further updates towards building alongside if not on top of Arkouda. @mhmerril

add checking for valid arguments in the arkouda HDF5 based I/O

add at a minimum a check for file existence and field existence in the HDF5 file.

Need python3 tests to check arkouda against numpy/pandas

Enhancements to consider:

adding more sanity checks to the check.py
creating a new randomized test for some feature
others kinds... use your imagination ;-)

Add setitem(slice) scatter indexing a[slice] = val or pdarray

integer indexing of pdarrays should accept numpy scalars

This should be made to work:

assert array[np.int64(42)] == array[42]

HDF5 I/O crashed on mike's mac

HDF5 file I/O works fine when one file is specified but
when two or more files are specified the program crashes on mike's mac.
This does not seem to be the case on the sherlock IB linux cluster.

This could have any one of several causes...

bug in arkouda
bug in chapel llvm build
bug in HDF5 module
i don't know

Mike.

Add basic authentication

Currently, a running arkouda server can be hijacked by anyone on the system that knows the hostname and port. We should implement some sort of authentication, either via a password or a token. My vote is for a token, because it's what Jupyter uses, and it's less of a burden on the user.

arkouda_server.chpl should:

generate a cryptographically secure random token (probably with Crypto.CryptoBuffer
print the token upon server startup (like jupyter notebook)
save the token (and hostname:port) to a user-readable-only file like ~/.arkouda-running-servers, so that a user can find how to reconnect to a running server (jupyter also does something like this)
check for the correct token in the "connect" message

For convenience, ak.connect() should be modified to accept the raw connection string that arkouda_server prints on startup with the hostname, port, and token.

optimize ak.in1d()

optimize ak.in1d for cases when the second pdarray is much smaller than the first pdarray.

fix boolean binop return types

return types should mirror those of numpy

Add more operators to bool, int64, and float64

binary operators and operator=

Add ak.argsort() to mirror np.argsort()

iv:pdarray:int64 = ak.argsort(pdarray)
returns an index vector which sorts the original pdarray
currently only needed for int64
toys/CountingSort.chpl has a start at this

optimize ak.histogram

possibly use PrivateDist or ReplicatedDist if the number of bins is less than a million or so.
this is to remove the need for network atomics.

Two efuncMsg()s?

Hi Arkoudans —

While looking through source tonight, I noticed that there are two implementions of efuncMsg() that appear to be potential duplicates, one in EfuncMsg.chpl, the other in MsgProcessing.chpl. I didn't look much further than that but wanted to mention it before forgetting in case it's an oversight.

make ak.histogram return same as np.histogram?

It would be nice to write

bins, counts = ak.histogram(array, 100)

where bins and counts are returned as numpy arrays by default, not pdarrays. Then it is simple to plot the histogram with matplotlib:

from matplotlib import pyplot as plt
plt.bar(bins, counts, width=1)

There could be a keyword argument that tells arkouda to return pdarrays instead of numpy arrays.

min scan bug

It seems like there might be a bug in min scan. I tested with this code:

use BlockDist;
var D = {0..#10};
var DD: domain(1) dmapped Block(boundingBox=D) = D;
var A: [DD] int = [5, 4, 3, 2, 1, 0, 1, 2, 3, 4];
writeln(min scan A);
// Expected output: [5 4 3 2 1 0 0 0 0 0]

compiled with

chpl -senableParScan --fast minscanexample.chpl

and got

./minscanexample -nl 4
-2 -9223372036854775804 -9223372036854775805 -9223372036854775806 1 -9223372036854775808 1 1 3 3

Output of max scan is also weird.

Benchmarks for arkouda vs numpy performance

@mhmerrill Do you have or could you generate benchmarks indicating the performance improvement that arkouda provides over numpy?

Fix ak.arange to behave like numpy.arange

Add support for rank-2 arrays to support matrix operations

I'm wondering what it would take to support matrix operations on 2D arrays, as HNX currently uses this extensively. I would like to contribute this but would like to know where I should begin.

The message for constructing arrays seem to be implemented here

https://github.com/mhmerrill/arkouda/blob/b2a9f42b43dbbc770ac41b931336b18c6b79dd1c/GenSymIO.chpl#L8

It seems all SymEntry types are considered arrays...

https://github.com/mhmerrill/arkouda/blob/332120e01dc029821ac4cca57637ed4a911e348b/MultiTypeSymEntry.chpl#L113-L121

Their domains do not restrict them to being 1D, so maybe I can begin there, and then add operands based on the type of the domain itself. Is there anything I'm missing? @mhmerrill

Make arkouda backwards-compatible with Chapel 1.18

In a discussion with @mhmerrill, he mentioned that (a) he was curious what it would take to make today's 1.19-compatible version of arkouda continue to compile with 1.18 and (b) how GitHub issues work. To that end, I opened up this issue as a sample to consider. I've thrown one of the pre-existing standard labels onto it, but note that new labels can be created and customized (and standard ones removed) to make the issue tracker exactly what you want (for example, we could create distinct versions of "bug", "enhanvement", etc. issues for things that belong on the Chapel team's side (e.g., "make this better in chpl 1.20.0" or on the arkouda side).

fix sorta broken implementation of bools

fix sorta broken implementation of bools (Python3 vs. Numpy behavior) and bool is a type not a string like I have it (~ vs. not)

fix bug in read_all

Should treat str argument as glob expression, like read_hdf.

Integrate GenSymIO.chpl

The code in GenSymIO.chpl correctly and performantly (for BlockDist) reads in HDF5 files to a Block distributed array, but it is not callable from arkouda yet. Add ak.read_hdf(dsetName, filenames) and message processing machinery on the backend. Finally, create a python test script that connects to the server, reads in files specified on the command line into a pdarray, and prints it (with timings).

optimize ak.unique and ak.value_counts

possibly use PrivateDist or ReplicatedDist for ranges of values less than a million.
possibly use PrivateDist couples with assoc array for large ranges of values which are sparsely populated.

make ak.dtypes more like np.dtypes

to allow for casting and more numpy like operation

add where() function

proc where_helper(cond: [?D] bool, A: [D] ?t, B: [D] t): [D] t

and vector-scalar hybrids. Callable from python as c = ak.where(cond, a, b).

HyperNetX (HNX) and Arkouda

I believe it would be best if Arkouda could prioritize implementation and fine-tuning performance with regards to HNX. There was talk of a Chapel HyperGraph LIbrary (CHGL) - HyperNetX (HNX) pipeline, but Arkouda is a very promising alternative. HNX is implemented entirely via numpy and scipy, and so would massively benefit from pdarray, and Arkouda would benefit by stress-testing Arkouda and exposing any bottlenecks in its current design and stage.

Enhance check/test code

write some python scripts to give pass/fail checks between arkouda and numpy

Value checks for division by and negative exponents in pow

-when base array is of type int, and exponent contains an el <0, we need to catch this.
fix this code: https://github.com/mhmerrill/arkouda/blob/9150fdd42fda293f9dab90f54d7b002b0aaa57f7/OperatorMsg.chpl#L130-L133

-check divisions by 0
and this: https://github.com/mhmerrill/arkouda/blob/9150fdd42fda293f9dab90f54d7b002b0aaa57f7/OperatorMsg.chpl#L154-L161

add pow method

We should only worry about the binary version.

int ** int returns int if exponent is non-negative, float if negative.
if base or exponent is float, return float
bool as base is a no-op
bool as exponent equivalent to where(exponent, base, 1)

IO features and fixes

Add features:

print dataset names from HDF5 file (issue #23 )
ability to save a pdarray to HDF5 file(s). I'm thinking the options will be: save to one big file or write one file per locale.
convenience function for re-loading a previously saved pdarray from HDF5

Bug fix: still some unhandled errors with file existence/permissions

Make 'pdarray' extend 'ndarray'?

I'm wondering whether or not pdarray should extend nparray so that any code checking to see if the type of an argument is a ndarray or not?

Consolidate thresholds for data structure selection

There are a few places in the code (e.g. argsort, unique) where we compute some statistic of the data and compare it to a hard-coded threshold to choose a data structure (e.g. dense histogram vs. associative array). I'm thinking we should:

define these thresholds in a separate module (like ServerConfig)
define them in more intuitive terms, like "max amount of scratch memory to use per locale", so that users can more easily tune them for different systems.

LSD Radix Sort

@mhmerrill has a distributed LSD radix sort on the refactor-modules branch in toys/RadixSortLSD.chpl that is really fast. In order to get this into master, we should:

comb through it to eliminate unnecessary memory usage
generalize it to work on tuples of ints and expose python function for multi-column argsort
generalize to real dtypes
define good tests on generated and live data

groupby

Implement grouping and grouped reductions, a la pandas.groupby().aggregate(). On the python side, this would look like

grouped = ak.groupby(key_array)
keys, reduced_values = grouped.aggregate(value_array, operator)

Under the hood, grouped would be a new python class with attributes:

key_array (pdarray): reference to the original array used to define the grouping
permutation (pdarray): the result of calling key_array.argsort()
segments (pdarray): the offsets in permutation for each group of equivalent keys
unique_keys (pdarray): the key that corresponds to each segment/group

On the chapel side, the groupby function would do two things:

Call argsort to get the permutation that sorts key_array
Traverse the sorted key_array to find the segment boundaries
Might as well return the unique_keys array along the way
The return message would be a multi-array creation message a la ak.value_counts(). @mhmerrill has volunteered for the groupby code.

@reuster986 will take the aggregate section and all the python-side code. The .aggregate(value_array, operator) method of a groupby object would pass a message to the arkouda server of the form:

"segmentedReduction permuted_values segments operator"

where permuted_values = value_array[self.permutation]. The return syntax would be similar to ak.value_counts().

the segmented reduction would look like:

proc segmented_reduce(values:[] ?t, segments:[?D] int, reducer) {
    var res: [D] t;
    forall i in D {
        var hi = if (i==D.high) then values.domain.high else segments[i+1] - 1;
        res[i] = reducer(values[segments[i]..hi], t);
    }
    return res;
}

proc add_reducer(v:[] t, type t): t {
    return + reduce v;
}

The message handler would have to select the reduction operator and the element type. Look at argsort code for inspiration on the segmented reduction function.

Question: how do you specify the type of a function pointer as an argument in a signature?

Add ak.array() and pdarray.to_ndarray()

We need a function on the python side that accepts a list-like object and sends it to the chapel backend to create a pdarray. The complementary method, pdarray.to_ndarray(), would also be useful, but with some safety checks to disallow sending back huge arrays. Both of these functions will require a means of bulk data serialization over ZMQ between python and chapel.

add some more random pdarray generators

like np.rand() and np.randn()

Add chapel-side function to read multiple HDF5 datasets

Currently, the way to read multiple datasets from a collection of HDF5 files is to call ak.read_all(), which issues a readhdf message for each dataset. The problem with this approach is that, if the user passes a glob expression for the filenames argument, then each chapel-side readhdfMsg call evaluates the glob for itself, and if new files get added in the meantime, the resulting arrays will have mismatched sizes. What should really happen is there should be a chapel-side readAllHdfMsg function that evaluates the glob expression once, and then reads in each dataset from the same collection of files.

add a way to send keyboard interrupt from python to arkouda and catch it in arkouda_server/chapel

that is catch it in arkouda_server/chapel