Giter VIP home page Giter VIP logo

modin's People

Contributors

amyskov avatar andreypavlenko avatar anmyachev avatar arunjose696 avatar billiam-wang avatar dchigarev avatar devin-petersohn avatar dorisjlee avatar eavidan avatar garra1980 avatar gshimansky avatar ienkovich avatar ipacheco-uy avatar itamarst avatar jbrockmendel avatar kunalgosar avatar mvashishtha avatar naren-ponder avatar noloerino avatar osalpekar avatar prutskov avatar rehansd avatar retribution98 avatar rubtsowa avatar simon-mo avatar todd-yu avatar vnlitvinov avatar williamma12 avatar wuisawesome avatar yarshev avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

modin's Issues

describe can error after read_csv

Describe the problem

describe gives an error after read_csv if not all columns are described.

Source code / logs

subprocess.call(['wget', 'https://s3.amazonaws.com/nyc-tlc/trip+data/green_tripdata_2017-01.csv',
                 '-O', '/tmp/green_tripdata_2017-01.csv'])
csv_data = pd.read_csv('/tmp/green_tripdata_2017-01.csv')
csv_data.describe()

groupby where by is a list of integers can cause problems based on internal index

Describe the problem

groupby can return incorrect or truncated results when by is a list of integers that are all contained in the index in each partition.

Source code / logs

In [1]: import modin.pandas as pd
Process STDOUT and STDERR is being redirected to /tmp/raylogs/.
Waiting for redis server at 127.0.0.1:49938 to respond...
Waiting for redis server at 127.0.0.1:39318 to respond...
Starting local scheduler with the following resources: {'CPU': 8, 'GPU': 0}.

======================================================================
View the web UI at http://localhost:8888/notebooks/ray_ui75359.ipynb?token=546e4e9fca85392d39d82072b5d206ef67d2753a5b3743e3
======================================================================

In [2]: import ray

In [3]: import pandas

In [4]: pandas_df = pandas.DataFrame({'col1': [0, 1, 2, 3],
                                      'col2': [4, 5, 6, 7],
                                      'col3': [3, 8, 12, 10],
                                      'col4': [17, 13, 16, 15],
                                      'col5': [-4, -5, -6, -7]})

In [5]: modin_df = pd.DataFrame(pandas_df)

In [6]: for k, v in modin_df.groupby(by=[1,2,1,2]):
    print(k)
    print(v)

Link to Pandas Documentation in our own Docs for usage

For reference it is useful to have links to the Pandas documentation page from our own documentation.

This should be possible with a script and hopefully would not require significant manual data entry.

We can discuss here whether to create a new page for the links, or to just link from the existing methods pages.

Not using all cores on large machines

Describe the problem

We only use a maximum of 8 partitions. We should automatically set the number of partitions instead of requiring users to set themselves.

Source code / logs

This is the only change we should need

import multiprocessing

DEFAULT_NPARTITIONS = multiprocessing.cpu_count()

Fix empty series return value for numeric functions

Describe the problem

Pandas 0.23.4 returns

0   NaN
1   NaN
2   NaN
3   NaN
dtype: float64

only when it cannot calculate the mean, median, and other similar functions when numeric_only=True or numeric_only=None (if possible) and axis=1. If axis=0, then pandas returns a empty series of type np.int64.

Source code / logs

import modin.pandas as pd
data = {
        "col1": [1, 'a', 3, 4],
        "col2": [4, 5, 6, 'd'],
        "col3": [8.0, 9.4, 'e', 11.3],
        "col4": ["a", "b", "c", "d"],
}
modin_df = pd.DataFrame(data)
modin_df.mean(axis = 'columns', skipna = False, numeric_only = None)

PyTest Warning

/usr/local/lib/python2.7/site-packages/_pytest/python.py:197: RemovedInPytest4Warning: Fixture "test_ndim" called directly. Fixtures are not meant to be called directly, are created automatically when test functions request them as parameters. See https://docs.pytest.org/en/latest/fixture.html for more information.

Our CI is giving us tons of these kind of warning. Worth investigating.

mean returns the wrong values

Describe the problem

mean returns the wrong values

Source code / logs

import modin.pandas as pd
data = {
        "col1": [0, 1, 2, 3],
        "col2": [4, 5, 6, 7],
        "col3": [8, 9, 10, 11],
        "col4": [12, 13, 14, 15],
        "col5": [0, 0, 0, 0],
}
modin_df = pd.DataFrame(data)
modin_df.mean(axis=1)

We expect to get:

0    4.8
1    5.6
2    6.4
3    7.2
dtype: float64

but we get

0    4.000000
1    4.666667
2    5.333333
3    6.000000
dtype: float64

Diff returns wrong values when axis='rows'

Describe the problem

Diff returns the wrong values when axis='rows'

Source code / logs

import modin.pandas as pd
data = {
        "col1": [0, 1, 2, 3],
        "col2": [4, 5, 6, 7],
        "col3": [8, 9, 10, 11],
        "col4": [12, 13, 14, 15],
        "col5": [0, 0, 0, 0],
}
modin_df = pd.DataFrame(data)
modin_df.diff(axis='rows')

We expect to get:

	col1	col2	col3	col4	col5
0	NaN	NaN	NaN	NaN	NaN
1	1.0	1.0	1.0	1.0	0.0
2	1.0	1.0	1.0	1.0	0.0
3	1.0	1.0	1.0	1.0	0.0

But get:

	col1	col2	col3	col4	col5
0	NaN	NaN	NaN	NaN	NaN
1	NaN	NaN	NaN	NaN	NaN
2	NaN	NaN	NaN	NaN	NaN
3	NaN	NaN	NaN	NaN	NaN

REFACTOR: Epic: Add mypy for all of modin

I'll start adding type hints to the source code starting from the re-write (#70)

Here will be my proposed approach:

  1. Add python3 type things gradually: def func(x: int) -> float
  2. Add mypy to CI checks
  3. For python2, we will not use the #type x: int -> float comment workaround, instead when we distribute the package, we will strip away all the type hints using strip-hints so the code is python2 compatible.

Any comments and suggestions welcomed!

[CI] Add Build for Ray Master

In recent PR (#78) we removed dependency on ray master. Let's add it back for a separate build on ray master to test our code to see if it's future proof.

Ray's master wheels are named to latest pypi version and on S3. (https://ray.readthedocs.io/en/latest/installation.html)

Here's how it can be installed without hardcoded it:

We just need to run the test on ray master for py3.6 and linux.

Implement modin.pandas.DataFrame.align

Describe the problem

Currently, align is not implemented for DataFrame objects.

Source code / logs

In [1]: import modin.pandas as pd
   ...: import numpy as np
   ...: 
   ...: frame_data = np.random.randint(0, 100, size=(2**12, 2**8))
   ...: df = pd.DataFrame(frame_data)
   ...: 
   ...: 
Process STDOUT and STDERR is being redirected to /tmp/raylogs/.
Waiting for redis server at 127.0.0.1:24882 to respond...
Waiting for redis server at 127.0.0.1:55764 to respond...
Starting the Plasma object store with 27.00 GB memory.
Starting local scheduler with the following resources: {'GPU': 0, 'CPU': 8}.

======================================================================
View the web UI at http://localhost:8888/notebooks/ray_ui89247.ipynb?token=ca5e30c50fb8d5a3bc873a8adf2539198dbfef8b003ff195
======================================================================


In [2]: df.align(df)
---------------------------------------------------------------------------
NotImplementedError                       Traceback (most recent call last)
<ipython-input-2-afa7038926eb> in <module>()
----> 1 df.align(df)

~/anaconda/lib/python3.5/site-packages/modin/pandas/dataframe.py in align(self, other, join, axis, level, copy, fill_value, method, limit, fill_axis, broadcast_axis)
   1094               broadcast_axis=None):
   1095         raise NotImplementedError(
-> 1096             "To contribute to Pandas on Ray, please visit "
   1097             "github.com/modin-project/modin.")
   1098 

NotImplementedError: To contribute to Pandas on Ray, please visit github.com/modin-project/modin.

Modin throws RayGetError when doing cumulative functions

Describe the problem

Throws RayGetError when doing cumulative functions of nonnumeric dtypes

Source code / logs

import modin.pandas as pd
data = {
        "col1": 1.0,
        "col2": np.datetime64("2011-06-15T00:00"),
        "col3": np.array([3] * 4, dtype="int32"),
        "col4": "foo",
        "col5": True,
}
modin_df = pd.DataFrame(data)
modin_df.cummax(axis=1)

Throws the following error:

RayGetError: Could not get objectid ObjectID(01000000f0cc325805f5c83895ed0532827d52de). It was created by remote function modin.data_management.partitioning.axis_partition.deploy_ray_axis_func which failed with:

Remote function modin.data_management.partitioning.axis_partition.deploy_ray_axis_func failed with:

Traceback (most recent call last):
  File "/Users/William/Documents/modin/modin/data_management/partitioning/axis_partition.py", line 188, in deploy_ray_axis_func
    result = func(dataframe, **kwargs)
  File "/Users/William/Documents/modin/modin/data_management/data_manager.py", line 158, in helper
    def helper(df, internal_indices=[]):
  File "/Users/William/Documents/modin/venv/lib/python3.6/site-packages/pandas/core/generic.py", line 9661, in cum_func
    result = accum_func(y, axis)
  File "/Users/William/Documents/modin/venv/lib/python3.6/site-packages/pandas/core/generic.py", line 8829, in <lambda>
    lambda y, axis: np.maximum.accumulate(y, axis), "max",
  File "pandas/_libs/tslibs/timestamps.pyx", line 170, in pandas._libs.tslibs.timestamps._Timestamp.__richcmp__
TypeError: Cannot compare type 'Timestamp' with type 'float'

__repr__ and __str__ have inefficient implementations

Describe the problem

Printing large dataframes is slow.

Also see #37. We are concatenating full columns, which is very slow for large dataframes. A solution similar to #39 will need to be implemented.

Source code / logs

import modin.pandas as pd

df = pd.read_csv(...)
print(df)

__repr__ dots sometimes do not match pandas

See code

In the test, the Modin DataFrame is returns two dots for the row separating the head and tail, but pandas returns three dots for that section:

E.g.
Modin

28   70  48  66  14  23  82  26   6   7  14 ...   8  44   8  28  60  38   1   
29   90  63  26  73  14  36  83  72  15   9 ...  40  84   6  44   2  54  94   
..   ..  ..  ..  ..  ..  ..  ..  ..  ..  .. ...  ..  ..  ..  ..  ..  ..  ..   
970  46  41  68  61  89   3  42  13  58   4 ...  30  11  86  58  99  77  86   
971  73  30  40  31  85  59  39  23  60  36 ...  47  66  90  46  23  82  69  

pandas

28     70   48   66   14   23   82   26    6    7   14  ...    8   44    8   28   60   38    1   
29     90   63   26   73   14   36   83   72   15    9  ...   40   84    6   44    2   54   94   
...   ...  ...  ...  ...  ...  ...  ...  ...  ...  ...  ...  ...  ...  ...  ...  ...  ...  ...   
970    46   41   68   61   89    3   42   13   58    4  ...   30   11   86   58   99   77   86   
971    73   30   40   31   85   59   39   23   60   36  ...   47   66   90   46   23   82   69  

We use pandas __repr__, but it is truncating a dot.

Fundamental memory leak in Modin

System information

  • OS Platform and Distribution: Linux Ubuntu 16.04
  • Modin installed from (source or binary): pip install modin
  • Modin version: 0.1.2
  • Python version: 3.6.0
  • Exact command to reproduce:

Run twice:

import modin.pandas as mpd
df = mpd.read_csv('4kk_lines.csv', sep=';')

Describe the problem

Modin doesn't free memory when a variable is reassigned. Concretely saying, the expected behavior is that in the process of reading a table from hard drive the memory usage grows up until the whole dataframe fits into RAM, and the memory drops down to the previous value when the variable containing the same dataframe before is rewritten. This is how Pandas (and any regular logic) works.

But in case of Modin for reading the dataframe the memory isn't freed when the variable is rewritten. Instead, it's doubled, so that any time I rerun this code in future the memory usage grows up meaning there's a memory leak somewhere.

I also tried to do some slicing with the loaded dataframe - it was expected that the memory isn't incremented when I don't copy the data, but it actually was. Here is the example:

df[df['id'] == 123].shape

In my table there is 4 000 000 lines with 14 columns, which takes about 3 Gb of RAM when loaded. Running the code above 50 times (to make performance test) I took all 110 Gb of RAM on my remote server.

Convert Formatting to black

I think it would be good to convert our formatter to black. Depending on the version, yapf can give different outcomes, and black formatting looks better IMO.

Additionally, we should add some git hooks to make sure that the formatting is always submitted matching correctly.

@simon-mo cc

Full reduce functions return empty dataframe

Describe the problem

Modin errors out when we try to return an empty dataframe with full reduce operations (operations that return a series such as all, any, count)

Source code / logs

import modin.pandas as pd
data = {
        "col1": [1, 'a', 3, 4],
        "col2": [4, 5, 6, 'd'],
        "col3": [8.0, 9.4, 'e', 11.3],
        "col4": ["a", "b", "c", "d"],
}
modin_df = pd.DataFrame(data)
modin_df.count(numeric_only=True)

Can't install Modin with Python 3.7

System information

  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04): MacOS 10.13.6
  • Modin installed from (source or binary): Both
  • Modin version: 0.1.1 and HEAD
  • Python version: 3.7.0
  • Exact command to reproduce: pip install modin

Describe the problem

Modin currently depends on Pandas 0.22, which depends on a version of Numpy that won't compile against Python 3.7. I had to downgrade my system Python from 3.7 to 3.6 to install Modin (i.e. brew switch python 3.6.5). If you could upgrade to the latest version of Pandas and add a Python 3.7 build, it would save others from having to figure out this workaround.

__repr__ after a transpose is slow

Describe the problem

Previously, transpose was extremely slow and copied the entire dataset. Now we store some metadata and do the transpose at the same time as another operation.

It is instant to do df.T, but when you repr(df.T) it will trigger the transpose. We need to debug why it takes so long on the repr.

Source code / logs

> %time x = df.T
> %time repr(x)

Improve memory use of read_csv

read_csv falls back to the Pandas implementation for certain situations. This is expensive in terms of memory due to data duplication; first, we create a Pandas dataframe using pandas.read_csv, and then convert it to a Modin dataframe.

The following events fall back to pandas.read_csv and should be fixed to be more memory efficient:

  • file does not exist on disk (e.g. located in S3).
  • filepath_or_buffer is not an instance of str, py.path.local or pathlib.Path
  • file is compressed (high priority)
  • as_recarray is True
  • chunksize is not None
  • skiprows is list-like or callable
  • nrows is not None

Most changes need to be done in io.py.

Thanks to @Bidek56 for reporting!

Error when inserting to Back of DataFrame

When inserting a new column to the back of a Modin DataFrame (or invoking __setitem__ with a new column name), Modin would error (Axis Length Mismatch) since the new column was not being inserted properly. Used the following script to reproduce:

import modin.pandas as pd
import numpy as np
frame_data = np.random.randint(0, 100, size=(2**20, 2**8))
df = pd.DataFrame(frame_data)
df['new'] = 0

Unable to perform head, tail,sample on dataframe

This is what I did to check out modin as a pandas replacement

import modin.pandas as pd
d = pd.read_csv('boston_housing.csv')
d.head()

and I got the following error:
Traceback (most recent call last):
File "/root/anaconda3/lib/python3.6/site-packages/modin/pandas/utils.py", line 380, in create_blocks
return _create_blocks_helper(df, npartitions, axis)
NameError: name '_create_blocks_helper' is not defined

any clue as to why this might have come

pd.describe() doesn't ignore Datetime/Timedelta columns when there are other numeric columns

pd.describe() should describe all the columns when there are no numeric columns in the Dataframe, otherwise it should describe only the numeric columns. Currently, pd.describe() isn't ignoring the non-numeric columns (for example, booleans and datetime/timedelta columns). This causes an index length mismatch during set-axis columns with different types have differently-sized indices for their descriptions.

[Ray] Secure Redis ports

Currently Ray's Redis ports are not secured by default which is a problem on systems exposed to the internet.

Once ray-project/ray#2952 is merged, I recommend securing Redis ports with ray.init(redis_password=password) where password is securely generated e.g. by using the secrets module.

Head/Tail implementation is slow

Someone posted some benchmarks on twitter for us against a couple of other tools. head and tail were performing incredibly slow.

The culprit is column_partitions. We wait until the entire columns are collected and then perform the head. This is incredibly inefficient.

The same is true with tail.

Mean sometimes produces `np.nan` values for the result incorrectly

Describe the problem

df.mean can produce np.nan values for the result incorrectly.

This is really only a problem on extremely small datasets, which is why it went unnoticed before.

Source code / logs

import modin.pandas as pd
frame_data = {
         "col1": [1, 2, 3, 4],
         "col2": [4, 5, 6, 7],
         "col3": [8.0, 9.4, 10.1, 11.3],
         "col4": ["a", "b", "c", "d"],
     }
df = pd.DataFrame(frame_data)
df.mean(skipna=False, axis='columns', numeric_only=None)

gitrevision errors

git_revision = _execute_cmd_in_temp_env(['git', 'rev-parse', 'HEAD'])

Current git revision tries to find the git commit by executing Popen directly from user process. This will start wherever user process start, which might not be a git repository. To fix it, we just need to run above line inside modin.__file__ or something like that.

NotImplementedError: Groupby with lists of columns not yet supported.

System information

  • OS Platform and Distribution (e.g., Linux centos):
  • Modin installed from pip:
  • Modin version 0.1.1:
  • Python version 3.6.6:
  • t1 = t1.groupby(["subid","label"])["cnt"].count().reset_index():

0.1.1

Describe the problem

Groupby with lists of columns not yet supported.

Source code / logs

Waiting for redis server at 127.0.0.1:59835 to respond...
Waiting for redis server at 127.0.0.1:16671 to respond...
Starting local scheduler with the following resources: {'CPU': 56, 'GPU': 4}.
Traceback (most recent call last):
File "gen_fea_online.py", line 318, in
df = get_data_log_hive(pre_time=(2018,8,7))
File "gen_fea_online.py", line 290, in get_data_log_hive
df_vod40 = gen_fea_active_log(df_vod40)
File "gen_fea_online.py", line 238, in gen_fea_active_log
t1 = t1.groupby(["subid","label"])["cnt"].count().reset_index()
File "/root/anaconda3/lib/python3.6/site-packages/modin/pandas/dataframe.py", line 823, in groupby
"Groupby with lists of columns not yet supported.")
NotImplementedError: Groupby with lists of columns not yet supported.

Consider Using Code Formatter like Black or Yapf

Running Flake8 and manually fix the lint hurts developer experience.

We should use an automated code formatter like Black or Yapf and have automated scripts to format the code before commit.

Flake8 can still be ran to check for code style issue like unused variables. But line-width and whitespaces should be done via automation.

mode returning RayGetError or wrong value

Describe the problem

Running mode returns either a wrong value or RayGetError

Source code / logs

import modin.pandas as pd
data = {'col1': [1, 2, 3, 4],
             'col2': [4, 5, 6, 7],
             'col3': [8.0, 9.4, 10.1, 11.3],
             'col4': ['a', 'b', 'c', 'd']}
modin_df = pd.DataFrame(data)
modin_df.mode(axis='rows', numeric_only=True)

returns a RayGetError

Update Black formatting

Describe the problem

During a recent pull request #88, we had a linting failure from black. We need to update the codebase with the most recent version of black.

`drop` does not always drop columns or rows on the same partition

Describe the problem

drop does not drop columns or rows from the same partition if they contain the same name. This required a large refactor of the overall implementation.

Source code / logs

import modin.pandas as pd
pd.DEFAULT_NPARTITIONS = 2

nu_df = pandas.DataFrame(pandas.compat.lzip(range(3), range(-3, 1),
                                                 list('abc')), columns=['a', 'a', 'b'])
ray_nu_df = pd.DataFrame(nu_df)
x = ray_nu_df.drop('a', axis=1)
y =  nu_df[['b']]
print(x)

Printing x gives an error because it did not drop all of the columns it should have. However,

x._col_partitions
print(x)

Works correctly.

Repr for DataFrameView is broken

When IndexMetadata and Block partitions doesn't match, getting _col_partitions and _row_partitions will raise error.

In [4]: df
Out[4]:
   col1  col2  col3  col4  col5
0     0     4     8    12     0
1     1     5     9    13     0
2     2     6    10    14     0
3     3     7    11    15     0

In [5]: df.iloc[:3]
Out[5]: ---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
~/anaconda3/lib/python3.6/site-packages/IPython/core/formatters.py in __call__(self, obj)
    700                 type_pprinters=self.type_printers,
    701                 deferred_pprinters=self.deferred_printers)
--> 702             printer.pretty(obj)
    703             printer.flush()
    704             return stream.getvalue()

~/anaconda3/lib/python3.6/site-packages/IPython/lib/pretty.py in pretty(self, obj)
    398                         if cls is not object \
    399                                 and callable(cls.__dict__.get('__repr__')):
--> 400                             return _repr_pprint(obj, self, cycle)
    401
    402             return _default_pprint(obj, self, cycle)

~/anaconda3/lib/python3.6/site-packages/IPython/lib/pretty.py in _repr_pprint(obj, p, cycle)
    693     """A pprint that just redirects to the normal repr function."""
    694     # Find newlines and replace them with p.break_()
--> 695     output = repr(obj)
    696     for idx,output_line in enumerate(output.splitlines()):
    697         if idx:

~/Desktop/modin/modin/modin/pandas/dataframe.py in __repr__(self)
    451         if len(self._row_metadata) <= 60 and \
    452            len(self._col_metadata) <= 20:
--> 453             return repr(self._repr_pandas_builder())
    454         # The split here is so that we don't repr pandas row lengths.
    455         result = self._repr_pandas_builder()

~/Desktop/modin/modin/modin/pandas/dataframe.py in _repr_pandas_builder(self)
    380         # If we don't exceed the maximum number of values on either dimension
    381         if len(self.index) <= 60 and len(self.columns) <= 20:
--> 382             return to_pandas(self)
    383
    384         if len(self.index) >= 60:

~/Desktop/modin/modin/modin/pandas/utils.py in to_pandas(df)
    225         A new pandas DataFrame.
    226     """
--> 227     pandas_df = pandas.concat(ray.get(df._row_partitions), copy=False)
    228     pandas_df.index = df.index
    229     pandas_df.columns = df.columns

~/Desktop/modin/modin/modin/pandas/dataframe.py in _get_row_partitions(self)
    200             self._row_metadata._lengths = \
    201                 self._row_metadata._lengths[empty_rows_mask]
--> 202             self._block_partitions = self._block_partitions[empty_rows_mask, :]
    203         return [_blocks_to_row.remote(*part)
    204                 for i, part in enumerate(self._block_partitions)]

IndexError: boolean index did not match indexed array along dimension 0; dimension is 3 but corresponding boolean dimension is 4

Implement modin.pandas.DataFrame.as_blocks

Describe the problem

Currently as_blocks is not implemented.

Source code / logs

In [1]: import modin.pandas as pd
   ...: import numpy as np
   ...: 
   ...: frame_data = np.random.randint(0, 100, size=(2**12, 2**8))
   ...: df = pd.DataFrame(frame_data)
   ...: 
   ...: 
Process STDOUT and STDERR is being redirected to /tmp/raylogs/.
Waiting for redis server at 127.0.0.1:59321 to respond...
Waiting for redis server at 127.0.0.1:36048 to respond...
Starting the Plasma object store with 27.00 GB memory.
Starting local scheduler with the following resources: {'GPU': 0, 'CPU': 8}.

======================================================================
View the web UI at http://localhost:8888/notebooks/ray_ui66732.ipynb?token=f0ef48a507f20daa610b2bd8da93e7e2f62955bec34e6bcb
======================================================================


In [2]: df.as_blocks()
---------------------------------------------------------------------------
NotImplementedError                       Traceback (most recent call last)
<ipython-input-2-0a8fd5d307e9> in <module>()
----> 1 df.as_blocks()

~/anaconda/lib/python3.5/site-packages/modin/pandas/dataframe.py in as_blocks(self, copy)
   1242     def as_blocks(self, copy=True):
   1243         raise NotImplementedError(
-> 1244             "To contribute to Pandas on Ray, please visit "
   1245             "github.com/modin-project/modin.")
   1246 

NotImplementedError: To contribute to Pandas on Ray, please visit github.com/modin-project/modin.

Fix numerical functions that can only use numeric inputs (var, skew, mean but not max, min)

Describe the problem

Uses only numeric values even when numeric_only=False. We should be consistent with pandas and throw TypeErrors

Source code / logs

import modin.pandas as pd
data = {
        "col1": 1.0,
        "col2": np.datetime64("2011-06-15T00:00"),
        "col3": np.array([3] * 4, dtype="int32"),
        "col4": "foo",
        "col5": True,
}
modin_df = pd.DataFrame(data)
modin_df.median(axis='rows', skipna = False, numeric_only = False)

Should throw a TypeError but returns

col1    1.0
col3    3.0
col5    1.0
dtype: float64

as if numeric_only=True

Fix functions with `bool_only` arguments

Describe the problem

When bool_only=True for all and any (the only functions that take in the bool_only argument), modin throws a ValueError

Source code / logs

import modin.pandas as pd
data = {
        "col1": 1.0,
        "col2": np.datetime64("2011-06-15T00:00"),
        "col3": np.array([3] * 4, dtype="int32"),
        "col4": "foo",
        "col5": True,
}
modin_df = pd.DataFrame(data)
modin_df.any(bool_only=True)

Throws the following error:
ValueError: Length mismatch: Expected axis has 1 elements, new values have 5 elements

TypeError: '>' not supported between instances of 'list' and 'int' in iloc

System information

  • OS Platform and Distribution: "Ubuntu 18.04 LTS"
  • Modin installed from (source or binary): binary
  • Modin version: 0.1.0
  • Python version: 3.6.5
  • Exact command to reproduce: df.iloc[0:5,:]

Calling: df.iloc[0:5,:] results in the following error:

Traceback (most recent call last):
File "bakeoffModin.py", line 15, in
print( f'Sample:\n {df.iloc[0:5,:]} ')
File "/usr/local/lib/python3.6/dist-packages/modin/pandas/dataframe.py", line 230, in str
return repr(self)
File "/usr/local/lib/python3.6/dist-packages/modin/pandas/dataframe.py", line 325, in repr
return repr(self.repr_helper())
File "/usr/local/lib/python3.6/dist-packages/modin/pandas/dataframe.py", line 235, in repr_helper
return to_pandas(self)
File "/usr/local/lib/python3.6/dist-packages/modin/pandas/utils.py", line 227, in to_pandas
pandas_df = pandas.concat(ray.get(df._row_partitions), copy=False)
File "/usr/local/lib/python3.6/dist-packages/modin/pandas/dataframe.py", line 198, in _get_row_partitions
empty_rows_mask = self._row_metadata._lengths > 0
TypeError: '>' not supported between instances of 'list' and 'int'

Pandas groupby.median casts to int when possible

groupby.median() will return floats dataframe in most cases; however, when possible, it might return ints dataframe as well:

In [40]: df = pd.DataFrame(np.random.randint(0, 8, size=(100, 4)),
    ...:                                      columns=list('ABCD'))
    ...: df.groupby(df['A'].tolist()).median()
    ...:
Out[40]:
     A    B    C    D
0  0.0  4.0  4.0  4.0
1  1.0  3.0  1.0  1.0
2  2.0  6.0  4.0  4.0
3  3.0  3.0  5.5  3.0
4  4.0  4.0  4.0  5.0
5  5.0  4.0  4.0  3.0
6  6.0  5.0  2.5  2.0
7  7.0  2.5  3.0  4.5

In [41]: df = pd.DataFrame(np.random.randint(0, 8, size=(100, 4)),
    ...:                                      columns=list('ABCD'))
    ...: df.groupby(df['A'].tolist()).median()
    ...:
Out[41]:
   A  B  C  D
0  0  3  4  2
1  1  3  4  2
2  2  2  2  3
3  3  4  3  2
4  4  1  4  4
5  5  1  3  2
6  6  2  4  5
7  7  5  4  4

We need to address this in our groupby.

Pandas `numeric_only` behavoir in full reduce in python2

by default, pandas numeric_only option in full_reduce like operation (e.g. max, min, mean, ..) will take an numeric_only argument. If will:

  • Try to operate on full axis if possible
  • If first option errors, operate on numeric_only values.

In Modin, we decided the following behavior:

numeric_only = True if axis else kwargs.get("numeric_only", False)

because the asynchronous nature of our computation model.

However, this will lead to the following behavior in python2. In python2:

In [1]: max([1,2,3,'a'])
Out[1]: 'a'

In a mixed type dataframe:

   col1  col2  col3 col4
0     1     4   8.0    a
1     2     5   9.4    b
2     3     6  10.1    c
3     4     7  11.3    d

taking max over rows will lead to

0    a
1    b
2    c
3    d
dtype: object

This is not expected behavior, therefore we choose to not following pandas behavior at this situation.

Efficient View Structure for reindexing

After the backend re-write, we no longer have a global view of global_index -> (blk_index, internal_index) exists anywhere. This makes re-index doing a full copy (it will be a bottleneck for distributed case). This also creates problem for DataManagerView. Take the following example:

In [6]: df
Out[6]:
   col1  col2  col3
0     1     6    10
1     2     7    11
2     3     8    12
3     4     9    13

In [7]: df.iloc[:, [1,2,0]]
Out[7]:
   col2  col3  col1
0     6    10     1
1     7    11     2
2     8    12     3
3     9    13     4

Say the dataframe is partitioned as (2,2) blocks. Block widths: [2,1]. So many rows of col1 and col2 shares RemotePartitions. The correct DataManagerView does not support operation like this. It is applying an iloc call to internal dataframes. This does not work in this case.

We need a way to shuffle without gathering data. Here's an idea. Whenever we need to do a shuffle, we brake the metadata down to each row/column and duplicate certain RemotePartition object, then attach an apply_func to select certain columns. Here's an example:

  • Blocks look like this:
    image

When we are going to reorder ['col1', 'col2', 'col3', 'col4'] into ['col3', 'col2', 'col4', 'col1'], blocks will be organized as follows to re-present the dataframe:

image

I'm not planning on implementing this in the re-write. But it would be great to include it in the repartition/efficient shuffle PR.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.