Giter VIP home page Giter VIP logo

datatable's Introduction

datatable

PyPi version License Build Status Documentation Status Codacy Badge

This is a Python package for manipulating 2-dimensional tabular data structures (aka data frames). It is close in spirit to pandas or SFrame; however we put specific emphasis on speed and big data support. As the name suggests, the package is closely related to R's data.table and attempts to mimic its core algorithms and API.

Requirements: Python 3.6+ (64 bit) and pip 20.3+.

Project goals

datatable started in 2017 as a toolkit for performing big data (up to 100GB) operations on a single-node machine, at the maximum speed possible. Such requirements are dictated by modern machine-learning applications, which need to process large volumes of data and generate many features in order to achieve the best model accuracy. The first user of datatable was Driverless.ai.

The set of features that we want to implement with datatable is at least the following:

  • Column-oriented data storage.

  • Native-C implementation for all datatypes, including strings. Packages such as pandas and numpy already do that for numeric columns, but not for strings.

  • Support for date-time and categorical types. Object type is also supported, but promotion into object discouraged.

  • All types should support null values, with as little overhead as possible.

  • Data should be stored on disk in the same format as in memory. This will allow us to memory-map data on disk and work on out-of-memory datasets transparently.

  • Work with memory-mapped datasets to avoid loading into memory more data than necessary for each particular operation.

  • Fast data reading from CSV and other formats.

  • Multi-threaded data processing: time-consuming operations should attempt to utilize all cores for maximum efficiency.

  • Efficient algorithms for sorting/grouping/joining.

  • Expressive query syntax (similar to data.table).

  • Minimal amount of data copying, copy-on-write semantics for shared data.

  • Use "rowindex" views in filtering/sorting/grouping/joining operators to avoid unnecessary data copying.

  • Interoperability with pandas / numpy / pyarrow / pure python: the users should have the ability to convert to another data-processing framework with ease.

Installation

On macOS, Linux and Windows systems installing datatable is as easy as

pip install datatable

On all other platforms a source distribution will be needed. For more information see Build instructions.

See also

datatable's People

Contributors

abal5 avatar achraf-mer avatar arnocandel avatar bboe avatar chathurindaranasinghe avatar chi2liu avatar h2o-ops avatar hannah-tillman avatar hpretl avatar jangorecki avatar jfaccioni avatar jlavileze avatar lucasjamar avatar mathanraj-sharma avatar mattdowle avatar mfrasco avatar michal-raska avatar mmalohlava avatar nkalonia1 avatar oleksiyskononenko avatar pradkrish avatar quetzalcohuatl avatar rpeck avatar samukweku avatar sh1ng avatar siddhesh avatar st-pasha avatar tomkraljevic avatar vstinner avatar wesnm avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

datatable's Issues

Implement "rbind" functionality

Suggested syntax:

dt0.append(dt1, dt2, ..., force=False)

This modifies datatable dt0 by appending datatables dt1, dt2, ... from below. The force argument controls the behavior in case the datatables have incompatible columns: if force is False then column mismatch will raise an error. If the parameter is True, then columns that are missing in one of the sources will be filled with NAs.

Remove the notion of ViewColumn

Instead of referencing a column by number, we should just share the pointer between different datatables. We will need to explicitly keep track of reference counts, and then whichever datatable reduces this refcount to 0 is responsible for freeing the memory.

typedef struct Column {
    void   *data;
    MType   mtype;
    SType   stype;
    void   *meta;
    size_t  alloc_size;
    int refcount;
} Column;

This design will also solve the problem when a column that is being referenced in a view is removed from the source datatable.
The MT_VIEW type will be eliminated, and it would not be possible to have a datatable that mixes "data" and "view" columns (which is much saner anyways).

With this change, the field DataTable.source becomes obsolete and should be removed as well. The field DataTable.rowmapping remains, and applies to all columns. It is no longer possible to mix "view" and "data" columns (unless we add a parallel to columns bit-array inside the DataTable).

Simple datatable evaluation graph should avoid compilation

Currently even such a simple query as dt(0) would create an evalution graph, generate a C module, JIT-compile that module, and then execute the code to produce the result.
For sufficiently simple queries this adds a lot of unnecessary overhead, instead we should be able to create the result immediately for such simple cases.

Requirements for H₂Oªⁱ MLI

  • Appends/rbind (#41, #45, #63, #65, #66, #108, #109, #116, #121, #123)
  • Merge/cbind (#132)
  • Calculation of mean, median, mode by column (#276)
  • Imputation with mean, median, mode
  • Column-and-row-based updates, including setting to missing
  • Column-based updates, including setting to missing
  • Row-based updates, including setting to missing
  • Column drop (#38)
  • Column rename (#52)
  • Creating new frames and adding rows and columns to new frames
  • Row and column subsetting (#21)
  • Sorts (https://github.com/h2oai/datatable/projects/2)
  • Inner and left Joins

Add `.nrows` attribute to Column struct

This will allow us to reliably check that no OOB access occurs when the same column is shared across multiple datatables. This property should be checked in datatable_check.c. In particular, see #50 for some of the checks that had to be removed when view columns were merged with data columns.

Do not allow True/False as rows/column selectors

Right now True/False are interpreted as 1/0, and select second/first row or column. This is very counter-intuitive... Perhaps it's better to forbid such selectors for now

dt0[True]
dt0[False]
dt0[True, :]
dt0[False, :]

The alternative would be to interpret them as selecting all/nothing, however usefulness/obviousness of such constructs is questionable (we have other means of selecting all/nothing).

Implement all parameters supported by fread in R

  • input / file: file name to read, or input as a string
  • sep: field separator
  • nrows: maximum number of rows to read
  • header: does the first data line contain column names
  • na.strings: list of strings to be interpreted as NA values #95
  • stringsAsFactors: convert all strings to categoricals -- cannot support until we have categorical type
  • verbose: debugging output
  • skip: how many rows to skip, or skip to row containing this string #86
  • select / drop: Keep some and drop other columns in the file #83
  • colClasses: override column types #83
  • dec: decimal separator (for numbers)
  • col.names: override column names #83
  • check.names: check column names are valid and have no duplicates -- always on
  • encoding
  • quote: dubious, need to rethink. Perhaps this should be quoteRule instead?
  • strip.white: strip leading/trailing whitespace from unquoted fields
  • fill: allow ragged columns
  • blank.lines.skip: ignore blank lines
  • showProgress: display progress bar
  • nThread: number of threads to use.

Linking external function does not work on Ubuntu

On Ubuntu, performing any datatable manipulation causes the python process to abort with error message

LLVM ERROR: Program used external function 'pydatatable_assemble_view' which could not be resolved!

The only remedy appears to be to link the external symbols manually on affected platforms (Windows might be having same problems).

Ensure there are no gaps in struct definitions

This will improve efficiency and allow us to restore warning -Wpadded.
Also it is advised to arrange the members of a struct in the order of their access frequency (per each cache line).

Change definition of VCHAR string column

Suggested changes are:

  • Add single byte 0xFF at the very beginning
    The code should just define dataptr = column->data - 1 and then everything works as-is.
  • Add element equal to 1 as the very first element in the offsets region. The length of the offsets region correspondingly becomes nrows + 1. This is needed to avoid unnecessary branching in code at the very beginning of the array (eg: start_offset = i == 0? 1 : abs(offsets[i-1])).

Make all code C++-compatible

Although we want to maintain mostly C-type code, we also want to start using templates to avoid unnecessary code duplication (macros are too cumbersome, and do not get check for code coverage).

This would require, at the minimum, the following:

  • All malloc functions should be replaced with dtmallocs, which should cast the result to the appropriate pointer type under C++. That is, under C malloc is called as T* ptr = malloc(size);, whereas under C++ it should become T* ptr = (T*)malloc(size);.
  • Arithmetic with void* is not allowed in C++ (why???), so should be wrapped in macros that cast pointers into char*. Once implemented, the warning -Wpointer-arith should be restored (currently excluded in setup.py). (#142)

NaNs are misaligned in the datatable display widget

A real-valued column is displayed as

    col6
--------
 581.654
-405.503
 382.895
 nan    
 195.676
-156.03 
 nan    
  56.936
-899.825

The nans are aligned as if they have implicit decimal dot in the end -- which is not proper. Instead nans should be always right-aligned

fread: race condition between threads writing into a string buffer

The pushBuffer() call happens in-order, however outside of the "ordered" section. As a consequence, it is possible that several threads will execute their pushBuffer code at a different speed and when the time comes to compute location of where to write within the output buffer, they may already write out of order. Although this remains only a theoretical possibility so far, it is still better to fix this behavior.
Suggested fix: each thread has its own output string buffer; before the "ordered" section we insert a "pre-push" callback where each threads may copy over the input into its intermediate buffer, performing any transformations as necessary (for example, unescape double-double-quotes, or backslash-escapes, or decode string into utf8, etc). Then within the "ordered" section we increment offsets of each output buffer pointer. Finally, within the pushBuffer call that happens afterwards we do a mem-copy of the intermediate buffers into the final result buffer.

0-rows dataset cannot be displayed

import datatable as dt
f = dt.open("~/userdata")
f(lambda f: f.col3 > 100000)

produces exception ValueError: max() arg is an empty sequence in line 345 of widget.py

Implement `dt.verify_integrity()` function

This function should check whether the data in the datatable is consistent, and correct it if possible. In particular, the following checks are needed:

  • all the checks already performed by the window() function
  • check that boolean columns contain only values 0, 1, and NA
  • check that string/categorical columns contain valid UTF-8 strings
  • check that string/categorical columns have offsets that are monotonically increasing (in abs value) and do not go out-of-range
  • check that string data column in a string section is padded with 0xFFs
  • check validity of all rollup values
  • (to be continued)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.