bears-r-us / arkouda Goto Github PK
View Code? Open in Web Editor NEWArkouda (αρκούδα): Interactive Data Analytics at Supercomputing Scale :bear:
License: Other
Arkouda (αρκούδα): Interactive Data Analytics at Supercomputing Scale :bear:
License: Other
test needs a README.md to show how to run tests
This could be python-side only functionality but in the future we prob need this on the chapel-side
Say you have array a = [0,1,2] and b = [-2,-2,-2]
In Python, you perform a**=b
The resulting array is [inf, 0.5,0.25]
Python would seem to allow for this behavior by not throwing an error, but returning "inf". However, in our OperatorMsg.chpl we don't allow division by zero.
We can either leave it be, or add an || reduce to see if any element is 0, and || reduce to see if any element is negative. This may be wasteful/expensive iteration though.
binopvv Int ** Int currently checks to see if anything in r.a is <0
when "starstar" {
if || reduce (r.a<0){
//instead of error, could we paste the below code but of type float?
return "Error: Attempt to exponentiate base of type Int64 to negative exponent";
}
st.addEntry(rname, l.size, int);
var e = toSymEntry(st.lookup(rname), int);
e.a= l.ar.a;
}
Could we modify the current code to:
when "starstar" {
if || reduce (r.a<0){
st.addEntry(rname, l.size, real);
var e = toSymEntry(st.lookup(rname), real);
e.a= l.ar.a;
}
else{
st.addEntry(rname, l.size, int);
var e = toSymEntry(st.lookup(rname), int);
e.a= l.ar.a;
}
}
add argmax(maxloc) and argmin(minloc) reductions
demonstrate chapel/MPI linkage using FFTW/MPI... as a model for how do do this with other MPI based libs. example code is in toys/fftw-mpi.chpl
add uint64 or decide (prob not) to use uint64 or int64 only
types like uint16, uint32, float32
Current pattern (see, for example, MsgProcessing.chpl in proc arange
):
var aD = makeDistDom(len);
var a = makeDistArray(len, int);
forall i in aD {
a[i] = i;
}
st.addEntry(new shared SymEntry(a));
This creates the SymEntry by copying a
, in order to avoid accessing the SymEntry within the forall, which hotspots on the class instance.
However, I think we can avoid the hotspotting with a task-private alias. Would the following work?
var e = new shared SymEntry(len, int);
forall i in e.aD {
var a = e.a;
a[i] = i;
}
st.addEntry(e);
Current strategy for ak.in1d(A, B)
is basically:
|| reduce [b in B] A == b;
This is the best strategy for small B, but doesn't scale to large B. The strategy numpy uses for large B is essentially (pseudocode):
uA, uAinv = find_segments(A, return_inverse=True)
uB = unique(B)
together = concat(uA, uB)
perm = argsort(together)
sorted = together[perm]
matches = (sorted[1:] == sorted[:-1]) + [False]
ret = empty(A.shape, dtype=bool)
ret[perm] = matches
return ret[uAinv]
This could be done on a per-locale basis.
Another way would be to convert B to a lock-free associative domain, and then do
[a in A] Bset.contains(a);
1D Scans in 1.19.0 of bool array cast to int need a copy made to parallelize
Currently, there is a prototype that follows very closely with how Arkouda implements their pdarray
and their server. This proof of concept, I believe, demonstrates the kind of functionality CHGL would require.
https://github.com/pnnl/chgl/blob/master/src/CHGL-Client.py
https://github.com/pnnl/chgl/blob/master/src/CHGL-Server.chpl
I would like to direct further updates towards building alongside if not on top of Arkouda. @mhmerril
add at a minimum a check for file existence and field existence in the HDF5 file.
Enhancements to consider:
check.py
This should be made to work:
assert array[np.int64(42)] == array[42]
HDF5 file I/O works fine when one file is specified but
when two or more files are specified the program crashes on mike's mac.
This does not seem to be the case on the sherlock IB linux cluster.
This could have any one of several causes...
Mike.
Currently, a running arkouda server can be hijacked by anyone on the system that knows the hostname and port. We should implement some sort of authentication, either via a password or a token. My vote is for a token, because it's what Jupyter uses, and it's less of a burden on the user.
arkouda_server.chpl
should:
Crypto.CryptoBuffer
~/.arkouda-running-servers
, so that a user can find how to reconnect to a running server (jupyter also does something like this)For convenience, ak.connect()
should be modified to accept the raw connection string that arkouda_server
prints on startup with the hostname, port, and token.
optimize ak.in1d for cases when the second pdarray is much smaller than the first pdarray.
return types should mirror those of numpy
binary operators and operator=
iv:pdarray:int64 = ak.argsort(pdarray)
returns an index vector which sorts the original pdarray
currently only needed for int64
toys/CountingSort.chpl has a start at this
possibly use PrivateDist or ReplicatedDist if the number of bins is less than a million or so.
this is to remove the need for network atomics.
Hi Arkoudans —
While looking through source tonight, I noticed that there are two implementions of efuncMsg()
that appear to be potential duplicates, one in EfuncMsg.chpl, the other in MsgProcessing.chpl. I didn't look much further than that but wanted to mention it before forgetting in case it's an oversight.
It would be nice to write
bins, counts = ak.histogram(array, 100)
where bins
and counts
are returned as numpy arrays by default, not pdarrays. Then it is simple to plot the histogram with matplotlib:
from matplotlib import pyplot as plt
plt.bar(bins, counts, width=1)
There could be a keyword argument that tells arkouda to return pdarrays instead of numpy arrays.
It seems like there might be a bug in min scan
. I tested with this code:
use BlockDist;
var D = {0..#10};
var DD: domain(1) dmapped Block(boundingBox=D) = D;
var A: [DD] int = [5, 4, 3, 2, 1, 0, 1, 2, 3, 4];
writeln(min scan A);
// Expected output: [5 4 3 2 1 0 0 0 0 0]
compiled with
chpl -senableParScan --fast minscanexample.chpl
and got
./minscanexample -nl 4
-2 -9223372036854775804 -9223372036854775805 -9223372036854775806 1 -9223372036854775808 1 1 3 3
Output of max scan
is also weird.
@mhmerrill Do you have or could you generate benchmarks indicating the performance improvement that arkouda provides over numpy?
I'm wondering what it would take to support matrix operations on 2D arrays, as HNX currently uses this extensively. I would like to contribute this but would like to know where I should begin.
The message for constructing arrays seem to be implemented here
https://github.com/mhmerrill/arkouda/blob/b2a9f42b43dbbc770ac41b931336b18c6b79dd1c/GenSymIO.chpl#L8
It seems all SymEntry
types are considered arrays...
Their domains do not restrict them to being 1D, so maybe I can begin there, and then add operands based on the type of the domain itself. Is there anything I'm missing? @mhmerrill
In a discussion with @mhmerrill, he mentioned that (a) he was curious what it would take to make today's 1.19-compatible version of arkouda continue to compile with 1.18 and (b) how GitHub issues work. To that end, I opened up this issue as a sample to consider. I've thrown one of the pre-existing standard labels onto it, but note that new labels can be created and customized (and standard ones removed) to make the issue tracker exactly what you want (for example, we could create distinct versions of "bug", "enhanvement", etc. issues for things that belong on the Chapel team's side (e.g., "make this better in chpl 1.20.0
" or on the arkouda side).
fix sorta broken implementation of bools (Python3 vs. Numpy behavior) and bool is a type not a string like I have it (~ vs. not)
Should treat str argument as glob expression, like read_hdf
.
The code in GenSymIO.chpl correctly and performantly (for BlockDist) reads in HDF5 files to a Block distributed array, but it is not callable from arkouda yet. Add ak.read_hdf(dsetName, filenames) and message processing machinery on the backend. Finally, create a python test script that connects to the server, reads in files specified on the command line into a pdarray, and prints it (with timings).
possibly use PrivateDist or ReplicatedDist for ranges of values less than a million.
possibly use PrivateDist couples with assoc array for large ranges of values which are sparsely populated.
to allow for casting and more numpy like operation
proc where_helper(cond: [?D] bool, A: [D] ?t, B: [D] t): [D] t
and vector-scalar hybrids. Callable from python as c = ak.where(cond, a, b)
.
I believe it would be best if Arkouda could prioritize implementation and fine-tuning performance with regards to HNX. There was talk of a Chapel HyperGraph LIbrary (CHGL) - HyperNetX (HNX) pipeline, but Arkouda is a very promising alternative. HNX is implemented entirely via numpy and scipy, and so would massively benefit from pdarray
, and Arkouda would benefit by stress-testing Arkouda and exposing any bottlenecks in its current design and stage.
write some python scripts to give pass/fail checks between arkouda and numpy
-when base array is of type int, and exponent contains an el <0, we need to catch this.
fix this code: https://github.com/mhmerrill/arkouda/blob/9150fdd42fda293f9dab90f54d7b002b0aaa57f7/OperatorMsg.chpl#L130-L133
-check divisions by 0
and this: https://github.com/mhmerrill/arkouda/blob/9150fdd42fda293f9dab90f54d7b002b0aaa57f7/OperatorMsg.chpl#L154-L161
We should only worry about the binary version.
int ** int
returns int
if exponent is non-negative, float
if negative.float
, return float
bool
as base is a no-opbool
as exponent equivalent to where(exponent, base, 1)
Add features:
Bug fix: still some unhandled errors with file existence/permissions
I'm wondering whether or not pdarray
should extend nparray
so that any code checking to see if the type of an argument is a ndarray
or not?
There are a few places in the code (e.g. argsort, unique) where we compute some statistic of the data and compare it to a hard-coded threshold to choose a data structure (e.g. dense histogram vs. associative array). I'm thinking we should:
@mhmerrill has a distributed LSD radix sort on the refactor-modules branch in toys/RadixSortLSD.chpl that is really fast. In order to get this into master, we should:
Implement grouping and grouped reductions, a la pandas.groupby().aggregate()
. On the python side, this would look like
grouped = ak.groupby(key_array)
keys, reduced_values = grouped.aggregate(value_array, operator)
Under the hood, grouped
would be a new python class with attributes:
key_array
(pdarray): reference to the original array used to define the groupingpermutation
(pdarray): the result of calling key_array.argsort()
segments
(pdarray): the offsets in permutation
for each group of equivalent keysunique_keys
(pdarray): the key that corresponds to each segment/groupOn the chapel side, the groupby
function would do two things:
argsort
to get the permutation that sorts key_array
key_array
to find the segment boundariesunique_keys
array along the wayak.value_counts()
. @mhmerrill has volunteered for the groupby
code.@reuster986 will take the aggregate
section and all the python-side code. The .aggregate(value_array, operator)
method of a groupby
object would pass a message to the arkouda server of the form:
"segmentedReduction permuted_values segments operator"
where permuted_values = value_array[self.permutation]
. The return syntax would be similar to ak.value_counts()
.
the segmented reduction would look like:
proc segmented_reduce(values:[] ?t, segments:[?D] int, reducer) {
var res: [D] t;
forall i in D {
var hi = if (i==D.high) then values.domain.high else segments[i+1] - 1;
res[i] = reducer(values[segments[i]..hi], t);
}
return res;
}
proc add_reducer(v:[] t, type t): t {
return + reduce v;
}
The message handler would have to select the reduction operator and the element type. Look at argsort
code for inspiration on the segmented reduction function.
Question: how do you specify the type of a function pointer as an argument in a signature?
We need a function on the python side that accepts a list-like object and sends it to the chapel backend to create a pdarray. The complementary method, pdarray.to_ndarray(), would also be useful, but with some safety checks to disallow sending back huge arrays. Both of these functions will require a means of bulk data serialization over ZMQ between python and chapel.
like np.rand() and np.randn()
Currently, the way to read multiple datasets from a collection of HDF5 files is to call ak.read_all()
, which issues a readhdf
message for each dataset. The problem with this approach is that, if the user passes a glob expression for the filenames
argument, then each chapel-side readhdfMsg
call evaluates the glob for itself, and if new files get added in the meantime, the resulting arrays will have mismatched sizes. What should really happen is there should be a chapel-side readAllHdfMsg
function that evaluates the glob expression once, and then reads in each dataset from the same collection of files.
that is catch it in arkouda_server/chapel
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.