Giter VIP home page Giter VIP logo

dataframes's People

Contributors

iamrecursion avatar kustosz avatar ljkania avatar magalame avatar mwu-tow avatar piotrmocz avatar sylwiabr avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

dataframes's Issues

Dataframes Tutorial

Pretty self-explanatory. Ideally, we should include sections for people transitioning from other platforms, like:

  • pandas
  • R
  • Alteryx (?)

Building C++ parts of library should not be unoptimized by default

Currently running cmake without any additional arguments shall yield an unoptimized build.
As the library relies heavily on compiler optimization to achieve sensible performance, it should either be optimized by default or require user to specify the build type.

Plotting int(int) chart gives floating axis labels on Y axis

Consider the following:

import Std.Base
import Std.Foreign.C.Value

import Dataframes.Internal.Utils
import Dataframes.Array
import Dataframes.Column
import Dataframes.Table
import Dataframes.Types
import Dataframes.Internal.Test.Test


def main:
    xCol = Column.fromList "x" Int64Type [1,2,3,4,5]
    yCol = Column.fromList "y" Int64Type [1,2,3,4,5]
    table1 = Table.fromColumns [xCol, yCol]
    chart1 = table1.chart
    plot1 = chart1.plot "x" "y"
    None

obraz

The Y axis is labeled using floating format (i.e. 1.0 instead of 1). The X axis is for some reason fine.

According to @sylwiabr this does not happen on Linux. Mac remains to be checked.

Overhaul of the Time-related types

This is in context of the time series:

  1. TimeInterval for expressing difference between two timestamps
  2. some kind of correspondence between Luna's Std.Time and the timestamps we use for time series.

LQuery support for chunked arrays

General Summary

Currently LQuery interpreter requires columns to be built from a single array. Chunked arrays don't happen "naturally" but user can manually create them or they are result of some operations (namely the recently added shift method).

Matplotlib integration

Summary

We should provide a matplotlib binding for the most common classes of plots.

Value

This allows users to create a variety of plots useful in EDA.

Specification

Bind several plots from matplotlib and seaborn:

  1. Scatter
  2. Histogram
  3. Heat matrix
  4. Several distribution plots from seaborn (KDE)

Each of these should be wrapped in a Luna–flavored (immutable) API, provide the most common configuration options. They should all work with Luna Studio and also be exportable to image files.
We should also provide an API for specifying axis/plot names and combining plots together both by stacking them and layouting them in a grid.

Acceptance Criteria & Test Cases

Tested by eyeballing the visualizations.

Handling missing values

Summary

We need to allow users to handle (filter/fill) missing values in a dataframe.

Value

It's a fundamental feature for any data library.

Specification

We need a 2x2 matrix of functions for filling/dropping NAs per the whole table or a single column:

  1. Table.dropNa – removes all rows where any of the values is missing.
  2. Table.fillNa x – changes all NA occurences in the table with x.
  3. Table.dropNaAt columnName – removes all rows where value in the given column is missing.
  4. Table.fillNaAt colName x – changes all NAs to x inside colName column.

Performance should be comparable to mapping/filtering a simple predicate over the table.

Acceptance Criteria & Test Cases

Will be tested manually on large dataframes, by comparing the outputs with equivalent pandas operations.

Revise the current API

Some points that have been raised:

  • max of a column is not a number, but a column (I understand why that is so, but it is not nice to work with)
  • The map over a column works in a way that gets me totally lost. I need to specify the type of the column: but I don't know it in the first place. That's a big win for pandas, where I can map like I know it.

Writing file with relative path is not working correctly

Repro steps:

import Dataframes.Column 
import Dataframes.Types
import Dataframes.Table 
l1 = [1,2,3,4,5]
l2 = [11,12,13,14,15]
col1 = Column.fromList "col1" Int64Type l1
col2 = Column.fromList "col2" Int64Type l2
table = Table.fromColumns [col1 , col2]
CSVGenerator . writeFile "./data/foo.csv" table

is not saving to ./data folder while giving full path is saving everything correctly

Required behavior:
We should not expand the relative paths magically. We should use the Current Working Directory env variable, which should be set to project root. We need to check if it is set so, or the problem is caused by something else.

Strange box (empty legend?) in the plot

Consider the following:

import Std.Base
import Std.Foreign.C.Value

import Dataframes.Internal.Utils
import Dataframes.Array
import Dataframes.Column
import Dataframes.Table
import Dataframes.Types
import Dataframes.Internal.Test.Test


def main:
    xCol = Column.fromList "x" Int64Type [1,2,3,4,5]
    yCol = Column.fromList "y" Int64Type [1,2,3,4,5]
    table1 = Table.fromColumns [xCol, yCol]
    chart1 = table1.chart
    plot1 = chart1.plot "x" "y"
    None

obraz

In the chart a small box next to 5.0 on Y axis can be seen. Likely this is an empty legend box.
When there is no legend, there should be no legend box.

Add new type of plot: violinPlot

Connect a violin plot from Seaborn to Luna's Dataframes library: https://seaborn.pydata.org/generated/seaborn.violinplot.html

Seaborn is a plotting library already connected with Dataframes. The violin plot should be connected just like kde plots.
The specific modification methods should be available for violin plots:

  • setLabel label
  • setColor color
  • setInner
  • split
  • setPalette
  • setLinewidth
  • setOrientation

The examples and implementations details will be provided when the issue will be picked up.

Is C++ required?

Since Luna can do pointer arithmetic and should be pretty fast on its own, why not implement the library in pure Luna, without using C?

shortRep method for all types

As reported:
2) [minor but annoying] most of the DataFrame-related types do not have shortReps (Histogram, Column, etc)

Review matplotlib-cpp and related code

The matplotlib-cpp code and our code in Plot.cpp / Learn.cpp in some regards doesn't meet our standards (conventions, error handling, polluting standard output and so on).

The code should be checked, actionable issues should be identified and either fixed or written down to our backlog.

Dataframe API docstrings

For every public function there needs to be a docstring outlining its general purposeand parameters. For key functions we need a more detailed description of how the functions work and some examples.

As a side-task, we need to come up with a template for a good docstring (the long one). Here is some inspiration: https://www.python.org/dev/peps/pep-0257/ and the pandas API is well documented, with lots of examples.

NOTE: this is a rather cumbersome task, but very important.

A true groupBy operation for Dataframes

@mwu-tow knows the details -- this task is just to keep track of the progress.

Short explanation: right now we can only do a groupBy operation followed by some aggregate. This task aims to expose a standalone groupBy functionality.

Appveyor Builds for C++

Summary

Currently it is proving difficult to know if the C++ components of dataframes are able to build successfully on all of our supported platforms. To ensure that they do, Appveyor CI (for windows support) should be set up to build these components.

Value

Having this set up will allow a faster development cadence as we can rely on the CI infrastructure to detect issues ahead of time, rather than finding the issues later on.

Specification

  • Set up the dataframes repo with Appveyor CI.
  • Have appveyor CI build the C++ components of this repo on Linux, MacOS and Windows.

Acceptance Criteria & Test Cases

  • Appveyor is able to build this code, and reports build success or failure for every commit.

Filtering and mapping facilities using a DSL

Summary

This concerns an implementation of each and filter functions on a Table.

Value

These are the basic functions for querying the frame and data exploration.

Specification

The API–level description is provided in this Gist: https://gist.github.com/kustosz/49e1c588de4c1513cf91b18dd6342c15

This library should use the specified JSON format for exchanging queries between Luna and the C++ engine.

Simple operations such as:

df.each v: v.at "NUM_INSTALMENT_VERSION" * 2 + 4
df.filter v: v.at "NUM_INSTALMENT_VERSION" > 2
df.filter v: v.at "NUM_INSTALMENT_VERSION" > v.at "NUM_INSTALMENT_NUMBER"

Should take no longer than 200ms (pandas takes ~130ms for each of these), where df is the installments_payments.csv file from the Credit Default Risk competition at Kaggle.

Acceptance Criteria & Test Cases

The provided functions need to return correct values. The returned values will be compared to pandas outputs on the same queries.

Support for custom operations on rolling windows

Right now we are restricted to a set of hard-coded operations. Ideally, we would provide some kind of an apply function that allows us to pass any function (or rather: any LQuery expression).

Hardcoded RSI function on the rolling window

As a workaround for not being able to apply an arbitrary function to a rolling window, we need the RSI to be hardcoded. The (Python) code is as follows:

def rsi(values):
    up = values[values>0].mean()
    down = -1*values[values<0].mean()
    return 100 * up / (up + down)

(see this kernel for more info)

Support for timestamp type

Add support for timestamp column field type. Timestamp shall be internally treated as int64_t with nanoseconds count since epoch. In future it is desired to allow other Arrow-specified units, but we ignore this for now.

List of things to make sure that work:

  • csv io
  • xlsx write
  • xlsx read
  • interpolate
  • lquery type
  • lquery ast literal value support
  • lquery relational ops
  • lquery adapt other existing functions
  • lquery add functions to decompose date
  • stats
  • matplotlib charts
  • luna support: accessors
  • luna support: builders
  • luna support: translating to Time type
  • luna support: lquery: translating from Time type

… more to come

reading/writing files

for reading/writing files we should have:

  1. CSV.read and CSV.write functions
  2. general method which will check file extension: Table.read and Table.write
    If the extension is txt or there is none it should try to read it with all available methods.

`sort` method for Dataframe

General Summary

We need to allow user sort the dataframe by values in column

Motivation

It is a basic feature for a data library

Specification

Table.sort colNames ascending naPosition - sort a values in dataframe. Returns new sorted dataframe.
colNames is a list of names to sort by;
ascending Sort ascending vs. descending. Specify list for multiple sort orders. This is a list of bools, it must match the length of the colNames;
naPosition {‘first’, ‘last’}, default ‘last’, first puts NaNs at the beginning, last puts NaNs at the end

Acceptance Criteria & Test Cases

Will be tested manually on large dataframes, by comparing the outputs with equivalent pandas operations.

Illegal instruction when saving CSV on Mac

I got this reported twice through Discord, once from @sylwiabr and once from @kustosz .
When writing CSV file the error mentioning illegal instruction happens:

[SUCCESS] column 2: [3, 6, 9, 12] == [3, 6, 9, 12]
Generate case 1
zsh: illegal hardware instruction  LUNA_LIBS_PATH=/Users/marcinkostrzewa/code/luna-core/stdlib  run --target

or

Running in interpreted mode.
Illegal instruction: 4

Apparently it is enough to run dataframes Luna tests to repro.
Issue was observed only on Mac.
As I have no Mac (and I don't imagine VM compiling Luna), I'd like to ask for help in diagnosing issue:

  • crashdump
  • CPU on which it happened
  • disassembly around the crashing instruction

`interpolate` method for Table and Column

General Summary

Add interpolate method that will fill missing values:

  • with linearly interpolated value when there are available values before and after nulls
  • with first/last non-null value for leading/trailing nulls

If column does not contain any valid values, then interpolate() does nothing.

RSI should not return non-normal values

RSI function can now return e.g. -Infinity when given only positive values.
RSI should always return either:
— a number from 0—100 interval whenever possible
— a null value otherwise

Dataframes 1.0 Epic

Summary

  • This section should summarise the work we want to accomplish during the epic.

Value

  • A description of the value this epic brings to users.
  • The motivation behind this epic.

Specification

  • The high-level requirements of the epic.
  • Any performance requirements for the epic.

Acceptance Criteria & Test Cases

  • The high-level acceptance criteria for the epic.
  • The test plan for the epic.

Simple statistics

Summary

This provides a set of simple statistical functions, useful in exploratory data analysis.

Value

Fundamental for providing a streamlined flow for EDA.

Specification

This comprises several different functionalities, listed below:

  1. Column.countValues – should return a 2 columns by n rows frame, where n is the number of different values in the column. Each row should be of the shape (value, count), where count is the number of occurrences of value in the column.
  2. Column.countMissing – return the count of missing values in the column.
  3. Table.{min, max, median, mean, std, var, sum, quantile n} – should be self explanatory. Consult documentation of pandas when in doubt. We are not accepting any special options here, just implement the simplest versions of each.
  4. Table.correlations – returns a nrows x nrows Table with pairwise correlation coefficients between rows. Consult pandas's corr documentation for details.

The performance of each operation should be comparable (within 50% margin) to its pandas counterpart.

Acceptance Criteria & Test Cases

Tested manually, by comparing to relevant pandas output on the same data source.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.