larray-project / larray Goto Github PK

View Code? Open in Web Editor NEW

8.0 4.0 6.0 9.27 MB

N-dimensional labelled arrays in Python

Home Page: https://larray.readthedocs.io/

License: GNU General Public License v3.0

Python 99.99% Batchfile 0.01%

python array ndarray labeled-data

larray's People

Contributors

Stargazers

Watchers

Forkers

alixdamman gdementen smritigambhir avasse vishalbelsare

larray's Issues

ptp is broken

TypeError: ptp() got an unexpected keyword argument 'keepdims'

check that ipfp totals have expected axes

Namely to raise a more meaningful error when the totals are swapped.

implement axes (nb_index) autodetection for read_excel using pandas/xlrd backend

sparse array support

by either

using a pd.MultiIndex
storing a pd.Dataframe in memory instead of np.ndarray (mostly done in pandasbased3 branch)
implementing our own MultiIndex-like object

implement view('filepath')

equivalent to:
view(Session('filepath'))

refactor viewer Model to include the concept of axes names

The goals are to cleanup the current code.
It would probably help to store/support LArrays directly in the model, instead of converting to np.ndarray, but I don't want the whole model to require the use of LArrays (because in that case we will not be able to send the code back to upstream Spyder). One option is to make a generic model and have a specific LArray model which would inherit from it. The goal would be to have as much functionality as possible/reasonable available in the generic model (ie plot, , copy & paste, filter -- but not by labels obviously). Unsure if that is reasonable though :)
One clear requirement is to keep the ability to view non-LArrays (np.ndarray, lists, tuples). The easiest way for this would be to convert them to LArray in the init of the Model, but I would rather avoid that for the above reason.
Another point to keep in mind is that it should be capable to handle Pandas Dataframes in the future without too much change.

several versions of labels and axes names

mostly for output/reporting
short/long
language
...

generalize/extend Session to be more LArray-like

It should be as close as possible to an LArray with an "array" axis.

# sum the age axis of all arrays *iff* they have such an axis
s.sum(x.age)
# sum all axes of each array present in the session
s.sum_by(x.array)

Simple extrapolation API

I.e. fill missing data points after non-missing data points.

See:
http://stackoverflow.com/questions/22491628/extrapolate-values-in-pandas-dataframe/35959909#35959909

Pandas supports interpolation natively (ie fill missing data points between non-missing data points).

http://pandas.pydata.org/pandas-docs/stable/missing_data.html#interpolation
http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.interpolate.html#pandas-dataframe-interpolate

implement d0

viewer colors are messed up in some cases

e.g. pycharm_minicourse.s_pop (when qx is not clipped)
s_pop = proj_pop.sum_by(x.time)[2040:]

Filtering on anonymous axes in viewer does not work

automatize update of larray package on conda forge

https://www.continuum.io/blog/developer-blog/community-conda-forge

Add title in all array creation functions.

new API for groups, ND groups, points selection, ...

should allow to create groups without axis (ie relying on guess axis).
eg.
G[2:7]

display array titles in viewer

add in LArray.init
session method to create codebook
display in session viewer

add unit tests for Excel I/O via open_excel

as a followup to #12, we also need tests for open_excel stuff.

to_clipboard is broken

The bug is in Pandas and/or in pyperclip (used by Pandas).
I suspect it is fixed in upstream pyperclip but not in the copy included in Pandas. So it might only needs a PR to pandas which simply updates pyperclip to the latest version.

split core.py

at a minimum all Axis related stuff should go to an "axis" module
Axis, AxisCollection
potential other modules:
group
expr

file extension(s) for larray-compatible files

so that we can register the extension(s) with the viewer/editor/future IDE
e.g. .lacsv .lah5 ou .lcsv et .lh5?

Change signature of functions sum, mean, ...

Replace *args and **kwargs by equivalent arguments of Numpy functions.

Implement a better syntax for initializing an array with "constant" values

eg. provide an alternative to:

>>> nat = Axis('nat', ['BE', 'FO'])
>>> sex = Axis('sex', ['M', 'F'])
>>> LArray([[0, 1], [2, 3]], [nat, sex])
nat\sex | M | F
     BE | 0 | 1
     FO | 2 | 3

because it is so error prone.

For a 1d array, stack works nicely, but for 2+, it quickly gets awful.

>>> stack([('M', 0), ('F', 1)], 'sex')

implementing a from_lists function would probably solve this nicely (though a better name might help):

>>> from_lists([['nat\sex', 'M', 'F'],
...             ['BE',        0,   1],
...             ['FO',        2,   3]])
nat\sex | M | F
     BE | 0 | 1
     FO | 2 | 3

add set operations to Session

via specialized methods (union, difference (or setdiff) and intersection)

= to name one-shot (ie string) groups

viewer: select column/row does not load all data

add docstrings (& examples) for aggregate methods

We might want to use a template for part of it.

cleanup unit tests

we should rewrite most LArray unit tests using small-ish arrays created using ndtest() instead of the current demo-related examples.

Axis('a', '-1..9') is broken

add XXX_by aggregate methods

eg a.sum_by(x.age)

which should be equivalent to

a.sum(a.axes - x.age)

(which does not work, because aggregate functions do not support an AxisCollection argument)

but this works:

a.sum(*(a.axes - x.age))

add release notes to the repository

The objective is double: keep a trace of them and write them as we go during development so that making a release is not as painful.

generalize stack to more than 1 dimension

Here are a few syntax experiments (but see also #30):

# 2D
stack([(('BE', 'M'), 1.0), (('BE', 'F'), 0.0),
       (('FO', 'M'), 1.0), (('FO', 'F'), 0.0)], ('nat', 'sex'))

# 3D
# a) flat list, label tuple
stack([(('BE', 1, 'M'), 1.0), (('BE', 1, 'F'), 0.0),
       (('BE', 2, 'M'), 1.0), (('BE', 2, 'F'), 0.0),
       (('BE', 3, 'M'), 1.0), (('BE', 3, 'F'), 0.0),
       (('FO', 1, 'M'), 1.0), (('FO', 1, 'F'), 0.0),
       (('FO', 2, 'M'), 1.0), (('FO', 2, 'F'), 0.0),
       (('FO', 3, 'M'), 1.0), (('FO', 3, 'F'), 0.0)],
      ('nat', 'type', 'sex'))

# b) recursive structure
stack([('BE', [(1, [('M', 1.0), ('F', 0.0)]),
               (2, [('M', 1.0), ('F', 0.0)]),
               (3, [('M', 1.0), ('F', 0.0)])]),
       ('FO', [(1, [('M', 1.0), ('F', 0.0)]),
               (2, [('M', 1.0), ('F', 0.0)]),
               (3, [('M', 1.0), ('F', 0.0)])])],
       ('nat', 'type', 'sex'))

with_axes should copy title

the fact that it does not breaks ndtest title argument

implement sep argument in combine_axes

xxx_by methods do not work with groups

Should add some tests also.

implement Axis.set[] to create LSet directly

axis.set[a, b]

should be equivalent to:

axis[a, b].set()

The goal is mostly that the __repr__ in #44 actually works, so one option might be to simply change LSet __repr__ to:

axis[a, b].set()

But the .set[] syntax would also be more efficient, so...

Add more concrete & complete examples in the tutorial that would interest our users

Something like "Python for Econometrics" should be an inspiration. It's a bit messy (order of chapters seems weird to me), but it covers a lot of stuff.

https://www.kevinsheppard.com/images/0/09/Python_introduction.pdf

implement replace_axis

to complement with_axes, we need a way to replace only one axis (or a few axes). The goal is to have something nicer than:

a.with_axes(a.axes.replace(x.products, industries))

e.g:

a.replace_axis(x.products, industries)
# or
a.with_axes(products=industries)

fix read_excel for sparse files

It works when using engine='xlrd' but since the default engine changed to 'xlwings' it does not work.
To fix this nicely would need a lot of work: support for sparse arrays (#28) and reindex (on sparse index), however we could use a temporary shortcut: read the data as (or convert it to) a pd.dataframe, reindex, convert back to larray. Far from optimal but much easier to implement.

explore implementing set operations on Axis and Group

setdiff1d (numpy) -- works
delete (numpy) -- works for int, slice or list of indices
list.remove (python) -- works for value (inplace)
list.pop (python) -- index

since LArray is more like numpy arrays than Python lists => not remove and pop
=> delete and idelete?
does the label version (delete?) returns only unique? ie set-like op?

union
intersect
setdiff
setxor
setin

possibly on LArray too (though 1d/flattened only in that case like numpy -- because otherwise that returns non cubic arrays).

position or index

We should pick one of the two terms and stick with it. Currently we use both (.i and .ipoints but PGroup and posarg*). We should either have:
.p[], .ppoints[], PGroup and posarg*
or
.i[], .ipoints[], IGroup and iarg* (or indarg*)

add unit tests for Excel I/O

This is a very important part and it is not tested at all currently, and I manage to break it every other release.

implement some way to escape special characters in labels

list of characters/patterns with special meaning:

current: , ; : .. [ ] >> name[] name.i[] {} numbers
whitespace could be considered a special character (because it is not kept as is) and we might want to make it "escapable"
planned (for automatic patterns): * ?
potential (for logic operators): | & !

we might want to reserve some or all other special characters just in case: # @ % / = + -
Or, we could define the precise list of characters a label can be made of which we can guarantee will not be interpreted.

provide larrayenv package

which depends on larray and all optional larray dependencies so that our users only need to do:

conda update larrayenv

and be sure to have all the functionalities installed.

>>> letters = Axis('letters', 'a..z')
>>> letters[':c'].set() & letters['b:d'].set()
letters.set[OrderedSet(['b', 'c'])]

It should rather be:

letters.set['b', 'c']

implement insert on LArray and Axis

and possibly Group

implement .reindex_axis

>>> arr = ndtest((2, 3))
>>> arr.reindex_axis(x.a, ['a1', 'a0', 'a1', 'a3'])
a\\b |  b0 |  b1 |  b2
  a1 |   3 |   4 |   5
  a0 |   0 |   1 |   2
  a1 |   3 |   4 |   5
  a3 | nan | nan | nan

It might be easy to implement using something vaguely looking like:

>>> new_axis = Axis(old_axis.name, new_labels)
>>> missing_value = missing[dtype]
>>> old_indices = old_axis.translate(new_labels, missing=-1)
>>> result = self.i[old_indices]
>>> # fix up those which were missing
>>> result[old_indices == -1] = missing_value

split unit tests

at a minimum, move Axis, AxisCollection and LGroup tests out of test_la.py

Window functions API

Rolling, expanding, ...

See
http://pandas.pydata.org/pandas-docs/stable/computation.html#window-functions
http://xarray.pydata.org/en/stable/generated/xarray.DataArray.rolling.html#xarray.DataArray.rolling

For numpy:
https://gist.github.com/seberg/3866040
http://www.rigtorp.se/2011/01/01/rolling-statistics-numpy.html

Bottleneck also supports move_*
https://pypi.python.org/pypi/Bottleneck
there are no built-in move functions in numpy, so it compares against its own implementation:
https://github.com/kwgoodman/bottleneck/blob/master/bottleneck/slow/move.py

But it seems like Pandas works well even with numpy arrays, so I guess I shouldn't bother and simply use Pandas which has a lot more features than all the other solutions anyway.

http://stackoverflow.com/a/30141358/288162