Giter VIP home page Giter VIP logo

arkouda's People

Contributors

ajpotts avatar awthomp avatar bmcdonald3 avatar bradcray avatar brandon-neth avatar bryantlam avatar dlongnecke-cray avatar drculhane avatar e-kayrakli avatar egelberg avatar ethan-debandi99 avatar ezfman avatar glitch avatar hokiegeek2 avatar jabraham17 avatar jaketrookman avatar jeichert60 avatar jeremiah-corrado avatar joshmarshall1 avatar lydia-duncan avatar mhmerrill avatar mppf avatar reuster986 avatar ronawho avatar shreyaskhandekar avatar slnguyen avatar spartee avatar stress-tess avatar theoddczar789 avatar vasslitvinov avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

arkouda's Issues

Zero Raised to Negative Number

Say you have array a = [0,1,2] and b = [-2,-2,-2]
In Python, you perform a**=b
The resulting array is [inf, 0.5,0.25]

Python would seem to allow for this behavior by not throwing an error, but returning "inf". However, in our OperatorMsg.chpl we don't allow division by zero.

We can either leave it be, or add an || reduce to see if any element is 0, and || reduce to see if any element is negative. This may be wasteful/expensive iteration though.

Int Raised to Negative Number

binopvv Int ** Int currently checks to see if anything in r.a is <0

when "starstar" {
if || reduce (r.a<0){
//instead of error, could we paste the below code but of type float?
return "Error: Attempt to exponentiate base of type Int64 to negative exponent";
}
st.addEntry(rname, l.size, int);
var e = toSymEntry(st.lookup(rname), int);
e.a= l.ar.a;
}

Could we modify the current code to:
when "starstar" {
if || reduce (r.a<0){
st.addEntry(rname, l.size, real);
var e = toSymEntry(st.lookup(rname), real);
e.a= l.ar.a;
}
else{
st.addEntry(rname, l.size, int);
var e = toSymEntry(st.lookup(rname), int);
e.a= l.ar.a;
}
}

link FFTW/MPI against arkouda_server

demonstrate chapel/MPI linkage using FFTW/MPI... as a model for how do do this with other MPI based libs. example code is in toys/fftw-mpi.chpl

Eliminate copies from ak.arange(), etc.

Current pattern (see, for example, MsgProcessing.chpl in proc arange):

var aD = makeDistDom(len);
var a = makeDistArray(len, int);
forall i in aD {
    a[i] = i;
}
st.addEntry(new shared SymEntry(a));

This creates the SymEntry by copying a, in order to avoid accessing the SymEntry within the forall, which hotspots on the class instance.
However, I think we can avoid the hotspotting with a task-private alias. Would the following work?

var e = new shared SymEntry(len, int);
forall i in e.aD {
    var a = e.a;
    a[i] = i;
}
st.addEntry(e);

Efficient strategy for ak.in1d() with large set

Current strategy for ak.in1d(A, B) is basically:

|| reduce [b in B] A == b;

This is the best strategy for small B, but doesn't scale to large B. The strategy numpy uses for large B is essentially (pseudocode):

uA, uAinv = find_segments(A, return_inverse=True)
uB = unique(B)
together = concat(uA, uB)
perm = argsort(together)
sorted = together[perm]
matches = (sorted[1:] == sorted[:-1]) + [False]
ret = empty(A.shape, dtype=bool)
ret[perm] = matches
return ret[uAinv]

This could be done on a per-locale basis.

Another way would be to convert B to a lock-free associative domain, and then do

[a in A] Bset.contains(a);

HDF5 I/O crashed on mike's mac

HDF5 file I/O works fine when one file is specified but
when two or more files are specified the program crashes on mike's mac.
This does not seem to be the case on the sherlock IB linux cluster.

This could have any one of several causes...

  1. bug in arkouda
  2. bug in chapel llvm build
  3. bug in HDF5 module
  4. i don't know

Mike.

Add basic authentication

Currently, a running arkouda server can be hijacked by anyone on the system that knows the hostname and port. We should implement some sort of authentication, either via a password or a token. My vote is for a token, because it's what Jupyter uses, and it's less of a burden on the user.

arkouda_server.chpl should:

  • generate a cryptographically secure random token (probably with Crypto.CryptoBuffer
  • print the token upon server startup (like jupyter notebook)
  • save the token (and hostname:port) to a user-readable-only file like ~/.arkouda-running-servers, so that a user can find how to reconnect to a running server (jupyter also does something like this)
  • check for the correct token in the "connect" message

For convenience, ak.connect() should be modified to accept the raw connection string that arkouda_server prints on startup with the hostname, port, and token.

optimize ak.in1d()

optimize ak.in1d for cases when the second pdarray is much smaller than the first pdarray.

Add ak.argsort() to mirror np.argsort()

iv:pdarray:int64 = ak.argsort(pdarray)
returns an index vector which sorts the original pdarray
currently only needed for int64
toys/CountingSort.chpl has a start at this

optimize ak.histogram

possibly use PrivateDist or ReplicatedDist if the number of bins is less than a million or so.
this is to remove the need for network atomics.

Two efuncMsg()s?

Hi Arkoudans —

While looking through source tonight, I noticed that there are two implementions of efuncMsg() that appear to be potential duplicates, one in EfuncMsg.chpl, the other in MsgProcessing.chpl. I didn't look much further than that but wanted to mention it before forgetting in case it's an oversight.

make ak.histogram return same as np.histogram?

It would be nice to write

bins, counts = ak.histogram(array, 100)

where bins and counts are returned as numpy arrays by default, not pdarrays. Then it is simple to plot the histogram with matplotlib:

from matplotlib import pyplot as plt
plt.bar(bins, counts, width=1)

There could be a keyword argument that tells arkouda to return pdarrays instead of numpy arrays.

min scan bug

It seems like there might be a bug in min scan. I tested with this code:

use BlockDist;
var D = {0..#10};
var DD: domain(1) dmapped Block(boundingBox=D) = D;
var A: [DD] int = [5, 4, 3, 2, 1, 0, 1, 2, 3, 4];
writeln(min scan A);
// Expected output: [5 4 3 2 1 0 0 0 0 0]

compiled with

chpl -senableParScan --fast minscanexample.chpl

and got

./minscanexample -nl 4
-2 -9223372036854775804 -9223372036854775805 -9223372036854775806 1 -9223372036854775808 1 1 3 3

Output of max scan is also weird.

Add support for rank-2 arrays to support matrix operations

I'm wondering what it would take to support matrix operations on 2D arrays, as HNX currently uses this extensively. I would like to contribute this but would like to know where I should begin.

The message for constructing arrays seem to be implemented here

https://github.com/mhmerrill/arkouda/blob/b2a9f42b43dbbc770ac41b931336b18c6b79dd1c/GenSymIO.chpl#L8

It seems all SymEntry types are considered arrays...

https://github.com/mhmerrill/arkouda/blob/332120e01dc029821ac4cca57637ed4a911e348b/MultiTypeSymEntry.chpl#L113-L121

Their domains do not restrict them to being 1D, so maybe I can begin there, and then add operands based on the type of the domain itself. Is there anything I'm missing? @mhmerrill

Make arkouda backwards-compatible with Chapel 1.18

In a discussion with @mhmerrill, he mentioned that (a) he was curious what it would take to make today's 1.19-compatible version of arkouda continue to compile with 1.18 and (b) how GitHub issues work. To that end, I opened up this issue as a sample to consider. I've thrown one of the pre-existing standard labels onto it, but note that new labels can be created and customized (and standard ones removed) to make the issue tracker exactly what you want (for example, we could create distinct versions of "bug", "enhanvement", etc. issues for things that belong on the Chapel team's side (e.g., "make this better in chpl 1.20.0" or on the arkouda side).

Integrate GenSymIO.chpl

The code in GenSymIO.chpl correctly and performantly (for BlockDist) reads in HDF5 files to a Block distributed array, but it is not callable from arkouda yet. Add ak.read_hdf(dsetName, filenames) and message processing machinery on the backend. Finally, create a python test script that connects to the server, reads in files specified on the command line into a pdarray, and prints it (with timings).

optimize ak.unique and ak.value_counts

possibly use PrivateDist or ReplicatedDist for ranges of values less than a million.
possibly use PrivateDist couples with assoc array for large ranges of values which are sparsely populated.

add where() function

proc where_helper(cond: [?D] bool, A: [D] ?t, B: [D] t): [D] t

and vector-scalar hybrids. Callable from python as c = ak.where(cond, a, b).

HyperNetX (HNX) and Arkouda

I believe it would be best if Arkouda could prioritize implementation and fine-tuning performance with regards to HNX. There was talk of a Chapel HyperGraph LIbrary (CHGL) - HyperNetX (HNX) pipeline, but Arkouda is a very promising alternative. HNX is implemented entirely via numpy and scipy, and so would massively benefit from pdarray, and Arkouda would benefit by stress-testing Arkouda and exposing any bottlenecks in its current design and stage.

add __pow__ method

We should only worry about the binary version.

  • int ** int returns int if exponent is non-negative, float if negative.
  • if base or exponent is float, return float
  • bool as base is a no-op
  • bool as exponent equivalent to where(exponent, base, 1)

IO features and fixes

Add features:

  • print dataset names from HDF5 file (issue #23 )
  • ability to save a pdarray to HDF5 file(s). I'm thinking the options will be: save to one big file or write one file per locale.
  • convenience function for re-loading a previously saved pdarray from HDF5

Bug fix: still some unhandled errors with file existence/permissions

Make 'pdarray' extend 'ndarray'?

I'm wondering whether or not pdarray should extend nparray so that any code checking to see if the type of an argument is a ndarray or not?

Consolidate thresholds for data structure selection

There are a few places in the code (e.g. argsort, unique) where we compute some statistic of the data and compare it to a hard-coded threshold to choose a data structure (e.g. dense histogram vs. associative array). I'm thinking we should:

  • define these thresholds in a separate module (like ServerConfig)
  • define them in more intuitive terms, like "max amount of scratch memory to use per locale", so that users can more easily tune them for different systems.

LSD Radix Sort

@mhmerrill has a distributed LSD radix sort on the refactor-modules branch in toys/RadixSortLSD.chpl that is really fast. In order to get this into master, we should:

  • comb through it to eliminate unnecessary memory usage
  • generalize it to work on tuples of ints and expose python function for multi-column argsort
  • generalize to real dtypes
  • define good tests on generated and live data

groupby

Implement grouping and grouped reductions, a la pandas.groupby().aggregate(). On the python side, this would look like

grouped = ak.groupby(key_array)
keys, reduced_values = grouped.aggregate(value_array, operator)

Under the hood, grouped would be a new python class with attributes:

  • key_array (pdarray): reference to the original array used to define the grouping
  • permutation (pdarray): the result of calling key_array.argsort()
  • segments (pdarray): the offsets in permutation for each group of equivalent keys
  • unique_keys (pdarray): the key that corresponds to each segment/group

On the chapel side, the groupby function would do two things:

  • Call argsort to get the permutation that sorts key_array
  • Traverse the sorted key_array to find the segment boundaries
  • Might as well return the unique_keys array along the way
    The return message would be a multi-array creation message a la ak.value_counts(). @mhmerrill has volunteered for the groupby code.

@reuster986 will take the aggregate section and all the python-side code. The .aggregate(value_array, operator) method of a groupby object would pass a message to the arkouda server of the form:

"segmentedReduction permuted_values segments operator"

where permuted_values = value_array[self.permutation]. The return syntax would be similar to ak.value_counts().

the segmented reduction would look like:

proc segmented_reduce(values:[] ?t, segments:[?D] int, reducer) {
    var res: [D] t;
    forall i in D {
        var hi = if (i==D.high) then values.domain.high else segments[i+1] - 1;
        res[i] = reducer(values[segments[i]..hi], t);
    }
    return res;
}

proc add_reducer(v:[] t, type t): t {
    return + reduce v;
}

The message handler would have to select the reduction operator and the element type. Look at argsort code for inspiration on the segmented reduction function.

Question: how do you specify the type of a function pointer as an argument in a signature?

Add ak.array() and pdarray.to_ndarray()

We need a function on the python side that accepts a list-like object and sends it to the chapel backend to create a pdarray. The complementary method, pdarray.to_ndarray(), would also be useful, but with some safety checks to disallow sending back huge arrays. Both of these functions will require a means of bulk data serialization over ZMQ between python and chapel.

Add chapel-side function to read multiple HDF5 datasets

Currently, the way to read multiple datasets from a collection of HDF5 files is to call ak.read_all(), which issues a readhdf message for each dataset. The problem with this approach is that, if the user passes a glob expression for the filenames argument, then each chapel-side readhdfMsg call evaluates the glob for itself, and if new files get added in the meantime, the resulting arrays will have mismatched sizes. What should really happen is there should be a chapel-side readAllHdfMsg function that evaluates the glob expression once, and then reads in each dataset from the same collection of files.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.