h2oai / datatable Goto Github PK

View Code? Open in Web Editor NEW

1.8K 94.0 154.0 15.02 MB

A Python package for manipulating 2-dimensional tabular data structures

Home Page: https://datatable.readthedocs.io

License: Mozilla Public License 2.0

Python 27.71% Makefile 0.06% C++ 72.22% Shell 0.01%

python data-analysis data-structure performance ftrl

datatable's Introduction

datatable

This is a Python package for manipulating 2-dimensional tabular data structures (aka data frames). It is close in spirit to pandas or SFrame; however we put specific emphasis on speed and big data support. As the name suggests, the package is closely related to R's data.table and attempts to mimic its core algorithms and API.

Requirements: Python 3.6+ (64 bit) and pip 20.3+.

Project goals

datatable started in 2017 as a toolkit for performing big data (up to 100GB) operations on a single-node machine, at the maximum speed possible. Such requirements are dictated by modern machine-learning applications, which need to process large volumes of data and generate many features in order to achieve the best model accuracy. The first user of datatable was Driverless.ai.

The set of features that we want to implement with datatable is at least the following:

Column-oriented data storage.
Native-C implementation for all datatypes, including strings. Packages such as pandas and numpy already do that for numeric columns, but not for strings.
Support for date-time and categorical types. Object type is also supported, but promotion into object discouraged.
All types should support null values, with as little overhead as possible.
Data should be stored on disk in the same format as in memory. This will allow us to memory-map data on disk and work on out-of-memory datasets transparently.
Work with memory-mapped datasets to avoid loading into memory more data than necessary for each particular operation.
Fast data reading from CSV and other formats.
Multi-threaded data processing: time-consuming operations should attempt to utilize all cores for maximum efficiency.
Efficient algorithms for sorting/grouping/joining.
Expressive query syntax (similar to data.table).
Minimal amount of data copying, copy-on-write semantics for shared data.
Use "rowindex" views in filtering/sorting/grouping/joining operators to avoid unnecessary data copying.
Interoperability with pandas / numpy / pyarrow / pure python: the users should have the ability to convert to another data-processing framework with ease.

Installation

On macOS, Linux and Windows systems installing datatable is as easy as

pip install datatable

On all other platforms a source distribution will be needed. For more information see Build instructions.

datatable's People

Contributors

Stargazers

Watchers

Forkers

knut0815 jlavileze anjaliambati 4n6strider fullstackenviormentss xinai57 whmnoe4j owenmcdonnell avkash solversa nikolayvoronchikhin yiyisan stjordanis mfrasco yuhonghong66 radical13 radu080 quetzalcohuatl yuhonghong7035 azririch88 avain dipu989 arunbaruah srisatish viktor-demin rafaslemos oleksiyskononenko imera88 gaybro8777 radovankavicky gapdata nfaustino seerampavan mbrukman sashikant123 panda4us parulnith dnovik decaff46 rserran abiraja2004 hhy5277 kenkenlin x-malet dominikzabron chi2liu tanviredu weih1121 samlex20 dvorka zzhuuh tarchanskyj tom-deng ktang2k useric preritg bpourhamzeh senseb eaceror skols vermosen jfaccioni lucther bigrlab a-pai deltasun vhadevheel d3v3l0 wilsonzyp rukhsaar305 jcohut wesnm pradkrish sighingnow satishy gitter-badger markderry antoinelb qhungnb sandy4321 mjdhasan ggservice007 yichtien e7dal raniereramos aniket328 1ira actuarial-tools ztsweet akashmavle5 theadityasam ashgen imvansh25 kbaaziz siddarthsreeni admariner mr-memorandum spencerx cattes abcdexter

datatable's Issues

Implement "rbind" functionality

Suggested syntax:

dt0.append(dt1, dt2, ..., force=False)

This modifies datatable dt0 by appending datatables dt1, dt2, ... from below. The force argument controls the behavior in case the datatables have incompatible columns: if force is False then column mismatch will raise an error. If the parameter is True, then columns that are missing in one of the sources will be filled with NAs.

Remove the notion of ViewColumn

Instead of referencing a column by number, we should just share the pointer between different datatables. We will need to explicitly keep track of reference counts, and then whichever datatable reduces this refcount to 0 is responsible for freeing the memory.

typedef struct Column {
    void   *data;
    MType   mtype;
    SType   stype;
    void   *meta;
    size_t  alloc_size;
    int refcount;
} Column;

This design will also solve the problem when a column that is being referenced in a view is removed from the source datatable.
The MT_VIEW type will be eliminated, and it would not be possible to have a datatable that mixes "data" and "view" columns (which is much saner anyways).

With this change, the field DataTable.source becomes obsolete and should be removed as well. The field DataTable.rowmapping remains, and applies to all columns. It is no longer possible to mix "view" and "data" columns (unless we add a parallel to columns bit-array inside the DataTable).

freeze when trying to create a datatable from [-1]

dt.DataTable([-1])

Any array starting with -1 will cause the freeze. Any other negative numbers seem to be ok

if selecting a column using `f.col` syntax then create a view to that column rather than copy

Merge memmap.py into nff.py

fread: implement reading large files directly into NFF dataset stored on disk

fread() allocates too much memory for integer/real-valued arrays,

as witnessed by .verify_integrity() function.

Simple datatable evaluation graph should avoid compilation

Currently even such a simple query as dt(0) would create an evalution graph, generate a C module, JIT-compile that module, and then execute the code to produce the result.
For sufficiently simple queries this adds a lot of unnecessary overhead, instead we should be able to create the result immediately for such simple cases.

Merge python exec/ and graph/ folders

(some of the code is duplicate)

comparison should resolve to False when one of the values is NA

import datatable as dt
f = dt.open("~/userdata")
f[lambda f: f.col1 > f.col2, :]

should not select rows where f.col1 is NA

Add a test to ensure that ST* type definitions are same in Python and C

In hexview make row indices look like hex offsets

Requirements for H₂Oªⁱ MLI

Implement multithreaded column mapping

Add `.nrows` attribute to Column struct

This will allow us to reliably check that no OOB access occurs when the same column is shared across multiple datatables. This property should be checked in datatable_check.c. In particular, see #50 for some of the checks that had to be removed when view columns were merged with data columns.

Do not allow True/False as rows/column selectors

Right now True/False are interpreted as 1/0, and select second/first row or column. This is very counter-intuitive... Perhaps it's better to forbid such selectors for now

dt0[True]
dt0[False]
dt0[True, :]
dt0[False, :]

The alternative would be to interpret them as selecting all/nothing, however usefulness/obviousness of such constructs is questionable (we have other means of selecting all/nothing).

Column selection using a slice is misbehaving

For example f[3:6] selects 0 columns, f[3:7] a single column, f[3:10] -- 4 columns, etc. Seems like start gets subtracted from the end before being interpreted as a slice?

Implement all parameters supported by fread in R

implement dt.isna()

The function should return True iff the underlying value is NA.

Add ability to convert DataTable to pandas DataFrame

Implement sorting for int32 columns

Linking external function does not work on Ubuntu

On Ubuntu, performing any datatable manipulation causes the python process to abort with error message

LLVM ERROR: Program used external function 'pydatatable_assemble_view' which could not be resolved!

The only remedy appears to be to link the external symbols manually on affected platforms (Windows might be having same problems).

Ensure there are no gaps in struct definitions

This will improve efficiency and allow us to restore warning -Wpadded.
Also it is advised to arrange the members of a struct in the order of their access frequency (per each cache line).

Segfault when trying to use a binary column as a row filter

import datatable as dt
f = dt.open("~/userdata")
f[f[0], :]

Add unit tests for fread

Floating point constants in filter expression are mangled

import datatable as dt
z = dt.open("~/userdata")
z(lambda f: f.col3 > 10.55, verbose=True)

Looking at the verbose output, that the actual filter being applied is col3 > 1055 !

Add test to verify data structure consistency between C and Python/llvm

Create tests for append (rbind) functionality

Applying row/column selectors to a view datatable should work properly

Change definition of VCHAR string column

Suggested changes are:

~~Add single byte 0xFF at the very beginning~~
The code should just define dataptr = column->data - 1 and then everything works as-is.
Add element equal to 1 as the very first element in the offsets region. The length of the offsets region correspondingly becomes nrows + 1. This is needed to avoid unnecessary branching in code at the very beginning of the array (eg: start_offset = i == 0? 1 : abs(offsets[i-1])).

When datatable is create from a list, string columns are not padded with FFs properly

Spec says that string data is stored in the buffer with the following structure:
[String data] [padding] [offsets]
where padding should be filled with 0xFF bytes.

Add ability to drop columns in a datatable

Suggested syntax:

del dt["colA"]
del dt[:, ("colC", "colD", "colZ")]
del dt[cols_to_drop]  # where `cols_to_drop` is any iterable (list, set, generator, etc.)

Make all code C++-compatible

Although we want to maintain mostly C-type code, we also want to start using templates to avoid unnecessary code duplication (macros are too cumbersome, and do not get check for code coverage).

This would require, at the minimum, the following:

All malloc functions should be replaced with dtmallocs, which should cast the result to the appropriate pointer type under C++. That is, under C malloc is called as T* ptr = malloc(size);, whereas under C++ it should become T* ptr = (T*)malloc(size);.
Arithmetic with void* is not allowed in C++ (why???), so should be wrapped in macros that cast pointers into char*. Once implemented, the warning -Wpointer-arith should be restored (currently excluded in setup.py). (~~#142~~)

Re-enable "unused-parameter" warning

Those places where the parameter is actually unused, it should be marked with __attribute__ ((unused)) using a macro.

Merge definitions of macros `dtmalloc` and `MALLOC`

Add tests for various column selectors

Creating an empty DataTable segfaults when you try to view it

Also: line self._stypes = tuple() in dt.py should be removed

The problem is that columns is not allocated in this case (even though it should).

Add tests for the DataTable class interface

Implement selection of expressions over the existing rows

For example, this should work:

import datatable as dt
f = dt.open("~/userdata")
f[lambda f: f.col2 + f.col3]

NaNs are misaligned in the datatable display widget

A real-valued column is displayed as

    col6
--------
 581.654
-405.503
 382.895
 nan    
 195.676
-156.03 
 nan    
  56.936
-899.825

The nans are aligned as if they have implicit decimal dot in the end -- which is not proper. Instead nans should be always right-aligned

fread: race condition between threads writing into a string buffer

The pushBuffer() call happens in-order, however outside of the "ordered" section. As a consequence, it is possible that several threads will execute their pushBuffer code at a different speed and when the time comes to compute location of where to write within the output buffer, they may already write out of order. Although this remains only a theoretical possibility so far, it is still better to fix this behavior.
Suggested fix: each thread has its own output string buffer; before the "ordered" section we insert a "pre-push" callback where each threads may copy over the input into its intermediate buffer, performing any transformations as necessary (for example, unescape double-double-quotes, or backslash-escapes, or decode string into utf8, etc). Then within the "ordered" section we increment offsets of each output buffer pointer. Finally, within the pushBuffer call that happens afterwards we do a mem-copy of the intermediate buffers into the final result buffer.

0-rows dataset cannot be displayed

import datatable as dt
f = dt.open("~/userdata")
f(lambda f: f.col3 > 100000)

produces exception ValueError: max() arg is an empty sequence in line 345 of widget.py

Make column selectors understand dicts

The following should work:

dt(select=lambda f: {"A": f.A, "B": f.B*2, "C": f.B - f.A }

Implement `dt.verify_integrity()` function

This function should check whether the data in the datatable is consistent, and correct it if possible. In particular, the following checks are needed:

all the checks already performed by the window() function
check that boolean columns contain only values 0, 1, and NA
check that string/categorical columns contain valid UTF-8 strings
check that string/categorical columns have offsets that are monotonically increasing (in abs value) and do not go out-of-range
check that string data column in a string section is padded with 0xFFs
check validity of all rollup values
(to be continued)

import datatable as dt
f = dt.open("~/userdata")
f["col7":"col2"]

should select columns col7, col6, col5, col4, col3 and col2.

import datatable as dt
f = dt.open("~/userdata")
f[lambda f: f.col1, :]

selects rows where f.col1 is NA.