bluenote10 / nimdata Goto Github PK

View Code? Open in Web Editor NEW

332.0 18.0 21.0 426 KB

DataFrame API written in Nim, enabling fast out-of-core data processing

License: MIT License

Shell 1.20% Nim 87.38% Python 3.78% Scala 7.64%

dataframe nim

nimdata's Introduction

NimData

Overview

NimData is a data manipulation and analysis library for the Nim programming language. It combines Pandas-like syntax with the type-safe, lazy APIs of distributed frameworks like Spark/Flink/Thrill. Although NimData is currently non-distributed, it harnesses the power of Nim to perform out-of-core processing at native speed.

NimData's core data type is the generic DataFrame[T]. All DataFrame methods are based on the MapReduce paradigm and fall into two categories:

Transformations: Operations like map or filter transform one DataFrame into another. Transformations are lazy, meaning that they are not executed until an action is called. They can also be chained.
Actions: Operations like count, min, max, sum, reduce, fold, collect, or show perform an aggregation on a DataFrame. Calling an action triggers the processing pipeline.

For a complete list of NimData's supported operations, see the module docs.

Installation

Install Nim and ensure that both Nim and Nimble (Nim's package manager) are added to your PATH.
From the command line, run $ nimble install NimData (this will download NimData's source from GitHub to ~/.nimble/pkgs).

Quickstart

Hello, World!

Once NimData is installed, we'll write a simple program to test it. Create a new file named test.nim with the following contents:

import nimdata

echo DF.fromRange(0, 10).collect()

From the command line, use $ nim c -r test.nim to compile and run the program (c for compile, and -r to run directly after compilation). It should print this sequence:

# => @[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]

Pandas users: This is roughly equivalent to print(pd.DataFrame(range(10))[0].values)

Reading raw text data

Next we'll use this German soccer data set to explore NimData's main functionality.

To create a DataFrame which simply iterates over the raw text content of a file, we can use DF.fromFile():

let dfRawText = DF.fromFile("examples/Bundesliga.csv")

Note that fromFile is a lazy operation, meaning that NimData doesn't actually read the contents of the file yet. To read the file, we need to call an action on our dataframe. Calling count, for example, triggers a line-by-line reading of the file and returns the number of rows:

echo dfRawText.count()
# => 14018

We can chain multiple operations on dfRawText. For example, we can use take to filter the file down to its first five rows, and show to print the result:

dfRawText.take(5).show()
# =>
# "1","Werder Bremen","Borussia Dortmund",3,2,1,1963,1963-08-24 09:30:00
# "2","Hertha BSC Berlin","1. FC Nuernberg",1,1,1,1963,1963-08-24 09:30:00
# "3","Preussen Muenster","Hamburger SV",1,1,1,1963,1963-08-24 09:30:00
# "4","Eintracht Frankfurt","1. FC Kaiserslautern",1,1,1,1963,1963-08-24 09:30:00
# "5","Karlsruher SC","Meidericher SV",1,4,1,1963,1963-08-24 09:30:00

Pandas users: This is equivalent to print(dfRawText.head(5)).

Note, however, that every time an action is called, the file is read from scratch, which is inefficient. We'll improve on that in a moment.

Type-safe schema parsing

At this stage, dfRawText's data type is a plain DataFrame[string]. It also doesn't have any column headers, and the first field isn't a proper index, but rather contains string literals. Let's transform our dataframe into something more useful for analysis:

const schema = [
  strCol("index"),
  strCol("homeTeam"),
  strCol("awayTeam"),
  intCol("homeGoals"),
  intCol("awayGoals"),
  intCol("round"),
  intCol("year"),
  dateCol("date", format="yyyy-MM-dd hh:mm:ss")
]
let df = dfRawText.map(schemaParser(schema, ','))
                  .map(record => record.projectAway(index))
                  .cache()

This code does three things:

The schemaParser macro constructs a specialized parsing function for each field, which takes a string as input and returns a type-safe named tuple corresponding to the type definition in schema. For instance, dateCol("date") tells the parser that the last column is named "date" and contains datetime values. We can even specify the datetime format by passing a format string to dateCol() as a named parameter. A key benefit of defining the schema at compile time is that the parser produces highly optimized machine code, resulting in very fast performance.
The projectAway macro transforms the results of schemeParser into a new dataframe with the "index" column removed (Pandas users: this is roughly equivalent to dfRawText.drop(columns=['index'])). See also projectTo, which instead keeps certain fields, and addFields, which extends the schema by new fields.
The cache method stores the parsing result in memory. This allows us to perform multiple actions on the data without having to re-read the file contents every time. Spark users: In contrast to Spark, cache is currently implemented as an action.

Now we can perform the same operations as before, but this time our dataframe contains the parsed tuples:

echo df.count()
# => 14018

df.take(5).show()
# =>
# +------------+------------+------------+------------+------------+------------+------------+
# | homeTeam   | awayTeam   |  homeGoals |  awayGoals |      round |       year | date       |
# +------------+------------+------------+------------+------------+------------+------------+
# | "Werder B… | "Borussia… |          3 |          2 |          1 |       1963 | 1963-08-2… |
# | "Hertha B… | "1. FC Nu… |          1 |          1 |          1 |       1963 | 1963-08-2… |
# | "Preussen… | "Hamburge… |          1 |          1 |          1 |       1963 | 1963-08-2… |
# | "Eintrach… | "1. FC Ka… |          1 |          1 |          1 |       1963 | 1963-08-2… |
# | "Karlsruh… | "Meideric… |          1 |          4 |          1 |       1963 | 1963-08-2… |
# +------------+------------+------------+------------+------------+------------+------------+

Note that instead of starting the pipeline from dfRawText and using caching, we could always write the pipeline from scratch:

DF.fromFile("examples/Bundesliga.csv")
  .map(schemaParser(schema, ','))
  .map(record => record.projectAway(index))
  .take(5)
  .show()

Filter

Data can be filtered by using filter. For instance, we can filter the data to get games of a certain team only:

import strutils

df.filter(record =>
    record.homeTeam.contains("Freiburg") or
    record.awayTeam.contains("Freiburg")
  )
  .take(5)
  .show()
# =>
# +------------+------------+------------+------------+------------+------------+------------+
# | homeTeam   | awayTeam   |  homeGoals |  awayGoals |      round |       year | date       |
# +------------+------------+------------+------------+------------+------------+------------+
# | "Bayern M… | "SC Freib… |          3 |          1 |          1 |       1993 | 1993-08-0… |
# | "SC Freib… | "Wattensc… |          4 |          1 |          2 |       1993 | 1993-08-1… |
# | "Borussia… | "SC Freib… |          3 |          2 |          3 |       1993 | 1993-08-2… |
# | "SC Freib… | "Hamburge… |          0 |          1 |          4 |       1993 | 1993-08-2… |
# | "1. FC Ko… | "SC Freib… |          2 |          0 |          5 |       1993 | 1993-09-0… |
# +------------+------------+------------+------------+------------+------------+------------+

Note: Without the strutils module, contains will throw a type error here.

Or search for games with many home goals:

df.filter(record => record.homeGoals >= 10)
  .show()
# =>
# +------------+------------+------------+------------+------------+------------+------------+
# | homeTeam   | awayTeam   |  homeGoals |  awayGoals |      round |       year | date       |
# +------------+------------+------------+------------+------------+------------+------------+
# | "Borussia… | "Schalke … |         11 |          0 |         18 |       1966 | 1967-01-0… |
# | "Borussia… | "Borussia… |         10 |          0 |         12 |       1967 | 1967-11-0… |
# | "Bayern M… | "Borussia… |         11 |          1 |         16 |       1971 | 1971-11-2… |
# | "Borussia… | "Borussia… |         12 |          0 |         34 |       1977 | 1978-04-2… |
# | "Borussia… | "Arminia … |         11 |          1 |         12 |       1982 | 1982-11-0… |
# | "Borussia… | "Eintrach… |         10 |          0 |          8 |       1984 | 1984-10-1… |
# +------------+------------+------------+------------+------------+------------+------------+

Note that we can now fully benefit from type-safety: The compiler knows the exact fields and types of a record. No dynamic field lookup and/or type casting is required. Assumptions about the data structure are moved to the earliest possible step in the pipeline, allowing to fail early if they are wrong. After transitioning into the type-safe domain, the compiler helps to verify the correctness of even long processing pipelines, reducing the risk of runtime errors.

Other filter-like transformation are:

take, which takes the first N records as already seen.
drop, which discard the first N records.
filterWithIndex, which allows to define a filter function that take both the index and the elements as input.

Collecting data

A DataFrame[T] can be converted easily into a seq[T] (Nim's native dynamic arrays) by using collect:

echo df.map(record => record.homeGoals)
       .filter(goals => goals >= 10)
       .collect()
# => @[11, 10, 11, 12, 11, 10]

Numerical aggregation

A DataFrame of a numerical type allows to use functions like min/max/mean. This allows to get things like:

echo "Min date: ", df.map(record => record.year).min()
echo "Max date: ", df.map(record => record.year).max()
echo "Average home goals: ", df.map(record => record.homeGoals).mean()
echo "Average away goals: ", df.map(record => record.awayGoals).mean()
# =>
# Min date: 1963
# Max date: 2008
# Average home goals: 1.898130974461407
# Average away goals: 1.190754743900699

# Let's find the highest defeat
let maxDiff = df.map(record => (record.homeGoals - record.awayGoals).abs).max()
df.filter(record => (record.homeGoals - record.awayGoals) == maxDiff)
  .show()
# =>
# +------------+------------+------------+------------+------------+------------+------------+
# | homeTeam   | awayTeam   |  homeGoals |  awayGoals |      round |       year | date       |
# +------------+------------+------------+------------+------------+------------+------------+
# | "Borussia… | "Borussia… |         12 |          0 |         34 |       1977 | 1978-04-2… |
# +------------+------------+------------+------------+------------+------------+------------+

Sorting

A DataFrame can be transformed into a sorted DataFrame by the sort() method. Without specifying any arguments, the operation would sort using default comparison over all columns. By specifying a key function and the sort order, we can for instance rank the games by the number of away goals:

df.sort(record => record.awayGoals, SortOrder.Descending)
  .take(5)
  .show()
# =>
# +------------+------------+------------+------------+------------+------------+------------+
# | homeTeam   | awayTeam   |  homeGoals |  awayGoals |      round |       year | date       |
# +------------+------------+------------+------------+------------+------------+------------+
# | "Tasmania… | "Meideric… |          0 |          9 |         27 |       1965 | 1966-03-2… |
# | "Borussia… | "TSV 1860… |          1 |          9 |         29 |       1965 | 1966-04-1… |
# | "SSV Ulm"  | "Bayer Le… |          1 |          9 |         25 |       1999 | 2000-03-1… |
# | "Rot-Weis… | "Eintrach… |          1 |          8 |         32 |       1976 | 1977-05-0… |
# | "Borussia… | "Bayer Le… |          2 |          8 |         10 |       1998 | 1998-10-3… |
# +------------+------------+------------+------------+------------+------------+------------+

Unique values

The DataFrame[T].unique() transformation filters a DataFrame to unique elements. This can be used for instance to find the number of teams that appear in the data:

echo df.map(record => record.homeTeam).unique().count()
# => 52

Pandas user note: In contrast to Pandas, there is no differentiation between a one-dimensional series and multi-dimensional DataFrame (unique vs drop_duplicates). unique works the same in for any hashable type T, e.g., we might as well get a DataFrame of unique pairs:

df.map(record => record.projectTo(homeTeam, awayTeam))
  .unique()
  .take(5)
  .show()
# =>
# +------------+------------+
# | homeTeam   | awayTeam   |
# +------------+------------+
# | "Werder B… | "Borussia… |
# | "Hertha B… | "1. FC Nu… |
# | "Preussen… | "Hamburge… |
# | "Eintrach… | "1. FC Ka… |
# | "Karlsruh… | "Meideric… |
# +------------+------------+

Value counts

The DataFrame[T].valueCounts() transformation extends the functionality of unique() by returning the unique values and their respective counts. The type of the transformed DataFrame is a tuple of (key: T, count: int), where T is the original type.

In our example, we can use valueCounts() for instance to find the most frequent results in German soccer:

df.map(record => record.projectTo(homeGoals, awayGoals))
  .valueCounts()
  .sort(x => x.count, SortOrder.Descending)
  .map(x => (
    homeGoals: x.key.homeGoals,
    awayGoals: x.key.awayGoals,
    count: x.count
  ))
  .take(5)
  .show()
# =>
# +------------+------------+------------+
# |  homeGoals |  awayGoals |      count |
# +------------+------------+------------+
# |          1 |          1 |       1632 |
# |          2 |          1 |       1203 |
# |          1 |          0 |       1109 |
# |          2 |          0 |       1092 |
# |          0 |          0 |        914 |
# +------------+------------+------------+

This transformation first projects the data onto a named tuple of (homeGoals, awayGoals). After applying valueCounts() the data frame is sorted according to the counts. The final map() function is purely for cosmetics of the resulting table, projecting the nested (key: (homeGaols: int, awayGoals: int), counts: int) tuple back to a flat result.

`DataFrame` viewer

DataFrames can be opened and inspected in the browser by using df.openInBrowser(), which offers a simple Javascript based data browser:

Note that the viewer uses static HTML, so it should only be applied to small or heavily filtered DataFrames.

Benchmarks

More meaningful benchmarks are still on the todo list. This just shows a few first results. The benchmarks will be split into small (data which fits into memory so we can compare against Pandas or R easily) and big (where we can only compare against out-of-core frameworks).

All implementations are available in the benchmarks folder.

Basic operations (small data)

The test data set is 1 million rows CSV with two int and two float columns. The test tasks are:

Parse/Count: Just the most basic operations -- iterating the file, applying parsing, and return a count.
Column Averages: Same steps, plus an additional computation of all 4 column means.

The results are average runtime in seconds of three runs:

Task	NimData	Pandas	Spark (4 cores)	Dask (4 cores)
Parse/Count	0.165	0.321	1.606	0.182
Column Averages	0.259	0.340	1.179	0.622

Note that Spark internally caches the file over the three runs, so the first iteration is much slower (with > 3 sec) while it reaches run times of 0.6 sec in the last iterations (obviously the data is too small to justify the overhead anyway).

Next steps

More transformations:
- map
- filter
- flatMap
- sort
- unique
- valueCounts
- groupBy (reduce)
- groupBy (transform)
- join (inner)
- join (outer)
- concat/union
- window
More actions:
- numerical aggergations (count, min, max, sum, mean)
- collect
- show
- openInBrowser
More data formats/sources
- csv
- gzipped csv
- parquet
- S3
REPL or Jupyter kernel?
Plotting (maybe in the form of Bokeh bindings)?

License

This project is licensed under the terms of the MIT license.

nimdata's People

Contributors

Stargazers

Watchers

nimdata's Issues

Add other data sources (JSON, HDF5, Arrow) for loading / saving data frames

Hi! I'm just checking out NimData and would like to contribute something for Hacktoberfest and thought adding some other datatypes would be a good way to get started.

JSON would be very useful - it's the format I find myself using the most besides CSV, and I don't think I'd have that much trouble adding it. I think I'd look at adding a new DataFrame type (similar to FileRowsDataFrame, but yielding each block of text that is a JSON object, and then the mapping function would convert the blocks of text and create the dataframe as a sequence of tuples defined by the schema.

HDF5 might be a bit ambitious - I've used it in a side project with Nim and , but I did this by writing some convenience functions in C that called into libhdf5 and then calling them from Nim.

Apache Arrow is common data layer for columnar analytics, though I haven't looked at the C API layer, so I don't know how feasible this would be.

Transform field

Hi, I could not find a way to tranform a field (say for cleanup) in a map operation. Is there something?

Read from stdin

Does it make sense (or does it already exist) to add the capability to cat a file to a NimData Program? Something like
cat datafile | mynimdataprogram
Granted a file like the one above exists, and could be handled with the existing procs. But there are also situations where you might be following a file that has new data regularly added and you may want to peridiocally show the result.

median() fails for ints

This sample program will not compile.

import strutils
import nimdata
import nimdata/utils

let input = @[
  "Jon;22",
  "Bart;33",
  "Bob;49",
  "Jack;12",
  "Moe;58",
]
const schema = [
  strCol("name"),
  intCol("age")
]

let df = DF.fromSeq(input)
           .map(schemaParser(schema, ';'))
           .cache()

echo "count: ", df.count()

echo "min: ", df.map(x => x.age).min()
echo "max: ", df.map(x => x.age).max()
echo "mean: ", df.map(x => x.age).mean()
echo "median: ", df.map(x => x.age).median()

template/generic instantiation of `median` from here                                        
.../.nimble/pkgs/nimdata-0.1.0/nimdata.nim(680, 54) Error: type mismatch: got <int64, int literal(2)>
but expected one of:  
proc `/`(x, y: float32): float32
  first type mismatch at position: 1
  required type for x: float32
  but expression 'values[n div 2] + values[n div 2 - 1]' is of type: int64
proc `/`(x, y: int): float
  first type mismatch at position: 1
  required type for x: int
  but expression 'values[n div 2] + values[n div 2 - 1]' is of type: int64
proc `/`(x, y: float): float
  first type mismatch at position: 1
  required type for x: float
  but expression 'values[n div 2] + values[n div 2 - 1]' is of type: int64
proc `/`(head, tail: string): string
  first type mismatch at position: 1
  required type for head: string
  but expression 'values[n div 2] + values[n div 2 - 1]' is of type: int64

expression: values[n div 2] + values[n div 2 - 1] / 2

Implement group by

Support for Apache Arrow/Feather?

Is there support for the Apache Feather V2 format? I believe Python's Pandas supports them.

Support quotes and escape characters in CSV parsing

Add a size hint to DataFrame to allow for faster caching

Compile error with nim 1.2.0

I'm in the process of trying to upgrade some code to Nim 1.2.0 and ran into some problems. So I tried running the NimData testcases under 1.2.0 and got similar errors.

$ nimble test
# no errors

$ nimble tests
[...]

/Users/hitesh/dev/NimData/src/nimdata.nim(529, 16) Error: type mismatch: got <DataFrame[system.int]>
but expected one of: 
macro collect(init, body: untyped): untyped
  first type mismatch at position: 2
  missing parameter: body

expression: collect(df)
stack trace: (most recent call last)
/private/var/folders/f4/gzyr8mcn5yv0jvdv5l47zygr0000gn/T/nimblecache/nimscriptapi.nim(165, 16)
/Users/hitesh/dev/NimData/nimdata_97051.nims(52, 12) testsTask
/Users/hitesh/dev/NimData/nimdata_97051.nims(47, 8) runTest
/Users/hitesh/.choosenim/toolchains/nim-1.2.0/lib/system/nimscript.nim(260, 7) exec
/Users/hitesh/.choosenim/toolchains/nim-1.2.0/lib/system/nimscript.nim(260, 7) Error: unhandled exception: FAILED: nim c --cc:gcc --verbosity:0 -r -d:travis -d:testNimData --outdir:bin tests/all.nim [OSError]
     Error: Exception raised during nimble script execution

Most of test suite doesn't always run

I want to say how happy I am that you made this library. It's for that reason that I started looking at the testing and realized it might have a problem.

I don't know how the tests are configured, but I expected all of them to run each time. Instead, I only see a tiny subset run consistently. The full suite only runs after some code changes. You can ignore this if it's working as intended, but my perspective is that there should be a bigger chunk of the test suite that runs every time.

To reproduce:

$ git checkout master
$ nimble tests
# you'll see compilations and lots of test runs scroll by

# Without changing anything, run it a second time
$ nimble tests

# You'll see messages about the few tests listed below.  I've removed the test output for clarity.

  Executing task tests in .../NimData/nimdata.nimble              
 *** Running tests in: tests/all.nim                                            
 *** Running tests in: src/nimdata/tuples.nim
 *** Running tests in: examples/example_01.nim
 *** Running tests in: examples/example_02.nim

Quickly looking at tests/all.nim I am starting to think it's a configuration problem. I think it's only compiling a portion of the test suite as any code changes. And while we see these messages that tests are running in the second run, I don't see the [Suite] xxxx and [OK] yyyy messages that usually tell me that the test ran successfully.

I'm expecting to see this kind of output on each run:

[Suite] Reduce/Fold Actions
  [OK] reduce
  [OK] fold

[Suite] Numerical Actions
  [OK] sum
  [OK] mean
  [OK] min
  [OK] max
  [OK] population stdev
  [OK] sample stdev

[Suite] Specific types
  [OK] DataFrame[string]

[Suite] Immutability
  [OK] Scalars
  [OK] Ref types 1
  [OK] Ref types 2
  [OK] Tuples

Provide macro to define a type from a schema

Maybe with a syntax like:

type
  MyRowType = schemaType(schema)

Allow reading a subset of columns in CSV parsing

Currently it is necessary to specify the full schema of a CSV. Sometimes it might be preferred to parse only a subset of the available columns.

Idea 1:

Define the schema as:

schema = [
  dummyCol,
  dummyCol,
  col(IntCol, "required"),
  dummyCol,
  col(StrCol, "required_too")
]

Drawback: Schema definition is verbose, but could be mitigated by helper functions.

Idea 2:

Define the schema as:

schema = [
  col(IntCol, "required", fromCol = 2),
  col(StrCol, "required_too", fromCol = 4)
]

or even:

schema = [
  col(IntCol, "required", fromCol = "col_name_required"),
  col(StrCol, "required_too", fromCol = "another_name")
]

where fromCol refers to names in the CSV header.

slower than anaconda python with MKL

ok, maybe this is not a bug

I use Windows 7 64 bits to run the benchmark in your NimData's benchmarks

Python 3.4.4 |Anaconda 2.3.0 (64-bit)| (default, Feb 16 2016, 09:54:04) [MSC v.1600 64 bit (AMD64)] on win32, and I changed all xrange to range

nim is cloned and built yesterday as Nim Compiler Version 0.17.1 (2017-06-17) [Windows: amd64].

 *** Summary:
Count (no parsing, pure Python)             min:  1.070    mean:  1.089    max:  1.100
Count (Pandas)                              min:  1.244    mean:  1.335    max:  1.493
Column averages                             min:  1.257    mean:  1.273    max:  1.289
Unique values 1                             min:  1.276    mean:  1.318    max:  1.364
Unique values 2                             min:  1.364    mean:  1.447    max:  1.584
Join                                        min: 13.702    mean: 14.427    max: 15.473

$ nim c -r -d:release basic_tests.nim
...
Running: Pure iteration                          min:  0.398    mean:  0.445    max:  0.480
Running: With parsing                            min:  3.274    mean:  3.293    max:  3.315
Running: With dummy parsing                      min:  0.446    mean:  0.467    max:  0.478
Running: With parsing (using manual parser 1)    min:  4.014    mean:  4.129    max:  4.237
Running: With parsing (using manual parser 2)    min:  3.211    mean:  3.269    max:  3.355
Running: With parsing + 1 dummy map              min:  3.301    mean:  3.346    max:  3.401
Running: With parsing + 2 dummy map              min:  3.261    mean:  3.328    max:  3.374
Running: With parsing + 1 dummy filter           min:  3.238    mean:  3.295    max:  3.339
Running: With parsing + 2 dummy filter           min:  3.311    mean:  3.362    max:  3.391
Running: With caching                            min:  3.373    mean:  3.433    max:  3.493
Running: Column averages                         min:  3.456    mean:  3.491    max:  3.557
Running: Unique values 1                         min:  3.272    mean:  3.308    max:  3.343
Running: Unique values 2                         min:  3.553    mean:  3.616    max:  3.695
Running: Join                                    6.788816779836687e-005
6.788816779836687e-005
6.788816779836687e-005
min: 21.818    mean: 22.074    max: 22.295

Could not load zlib1.dll - Noob question

I'm sorry to ask this question, both because it has already been up in issues #43 but also because I'm sure this is not a problem with Nimdata but is most likely just me not being able to figure out Windows

New to Nim and haven't used Windows for the last 12 years so total noob in every way

choosenim 0.8.4
Nim 1.6.6
Windows 11

I hope a kind soul will help me out here

Support reshape

It would be nice to add reshape to wide/long data format like R data.table::melt data.table::dcast or pandas pivot/unpivot and stack/unstack.
Details: https://pandas.pydata.org/docs/user_guide/reshaping.html

Design ideas

Interesting article on data table design.

https://uxdesign.cc/design-better-data-tables-4ecc99d23356

Feel free to close.

How to append a record (row) to a DF?

Sorry for the silly question here,

How to append a row to an existing dataframe?

Thanks :)

import nimdata

let dfA = DF.fromSeq(@[
  (name: "A", age: 99)
])

let newrecord = (name: "B", age: 100)

...

Make column-based DataFrame ?

What do you think about creating a DataFrame type that's more column-based (list of Column types) rather than row-based (seq[T] of tuples based on schema)?

Taking some of the example operations from example_01.nim:

# Existing
df.map(x => x.age).mean()

# New
df["age"].mean()

This (and other numeric operations on an individual column) should be faster because we could return a shallowCopy view of the DF's column rather than creating a new MappedDataFrame (copying the needed data) and then running the summary function on the new dataframe.

I'd like to try and do this as part of Hacktoberfest.

Add spill-to-disk for sort, groupby, etc.

Support gzipped text input

Type error when running filter example in documentation

Hi, I'm new to Nim (coming from Python/Pandas) and I'm trying to follow the docs but get a type error at the filter example. The example passes a string to contains for the second arg, but contains seems to expect an arg of type T?

My code:

import nimdata

let df_raw = DF.fromFile("Bundesliga.csv")

const schema = [
  strCol("index"),
  strCol("homeTeam"),
  strCol("awayTeam"),
  intCol("homeGoals"),
  intCol("awayGoals"),
  intCol("round"),
  intCol("year"),
  dateCol("date", format="yyyy-MM-dd hh:mm:ss")
]

let df = df_raw.map(schemaParser(schema, ','))
    .map(record => record.projectAway(index))
    .cache()

df.filter(record =>
    record.homeTeam.contains("Freiburg") or
    record.awayTeam.contains("Freiburg")
  )
  .take(5)
  .show()

Error:

.../.choosenim/toolchains/nim-1.4.4/lib/pure/times.nim(2110, 39) Hint: 'parse' cannot raise 'Defect' [XCannotRaiseY]
.../data.nim(18, 5) template/generic instantiation of `cache` from here
.../.nimble/pkgs/nimdata-0.1.0/nimdata.nim(545, 8) Warning: method has lock level 0, but another method has <unknown> [LockLevel]
.../data.nim(21, 20) Error: type mismatch: got <string, string>
but expected one of: 
proc contains[T](a: openArray[T]; item: T): bool
  first type mismatch at position: 2
  required type for item: T
  but expression '"Freiburg"' is of type: string
2 other mismatching symbols have been suppressed; compile with --showAllMismatches:on to see them

expression: contains(record.homeTeam, "Freiburg")

I'd be happy to help update the docs once I understand the issue here. Thanks!

Implement joins

`groupBy` with a `DataFrame`

Is it possible to pass in a DataFrame that can be aligned/joined with the original Dataframe to allow for a list of values to group by?

Thanks for you wonderful package

Feature request: new column kind - seq

I do not know Nim so well, help me, please.

import macros

import strutils
import parseutils
import times

type
  ColKind* = enum
    StrCol,
    IntCol,
    FloatCol,
    DateCol,
    SeqCol

  ColIntBase* = enum
    baseBin,
    baseOct,
    baseDec,
    baseHex

  Column* = object # TODO: this should get documented: https://forum.nim-lang.org/t/196
    name*: string
    case kind*: ColKind
    of IntCol:
      base: ColIntBase
    of StrCol:
      stripQuotes: bool
    of DateCol:
      format: string
    of SeqCol:
      seqKind: ColKind
      seqIntBase: ColIntBase
      seps: set[char]
    else:
      discard

proc strCol*(name: string): Column =
  Column(kind: StrCol, name: name)

proc intCol*(name: string, base: ColIntBase = baseDec): Column =
  Column(kind: IntCol, name: name, base: base)

proc floatCol*(name: string): Column =
  Column(kind: FloatCol, name: name)

proc dateCol*(name: string, format: string = "yyyy-MM-dd"): Column =
  Column(name: name, kind: DateCol, format: format)

proc seqCol*(name: string, kind: ColKind, base: ColIntBase, seps: set[char] = {' ', ','}): Column =
  Column(name: name, kind: SeqCol, seqKind: kind, seqIntBase: base, seps: seps)

proc parseBin[T: SomeSignedInt](s: string, number: var T, start = 0): int  {.
  noSideEffect.} =
  var i = start
  var foundDigit = false
  if s[i] == '0' and (s[i+1] == 'b' or s[i+1] == 'B'): inc(i, 2)
  while true:
    case s[i]
    of '_': discard
    of '0'..'1':
      number = number shl 1 or (ord(s[i]) - ord('0'))
      foundDigit = true
    else: break
    inc(i)
  if foundDigit: result = i-start

proc parseOct[T: SomeSignedInt](s: string, number: var T, start = 0): int  {.
  noSideEffect.} =
  var i = start
  var foundDigit = false
  if s[i] == '0' and (s[i+1] == 'o' or s[i+1] == 'O'): inc(i, 2)
  while true:
    case s[i]
    of '_': discard
    of '0'..'7':
      number = number shl 3 or (ord(s[i]) - ord('0'))
      foundDigit = true
    else: break
    inc(i)
  if foundDigit: result = i-start

proc parseHex[T: SomeSignedInt](s: string, number: var T, start = 0; maxLen = 0): int {.
  noSideEffect.}  =
  var i = start
  var foundDigit = false
  if s[i] == '0' and (s[i+1] == 'x' or s[i+1] == 'X'): inc(i, 2)
  elif s[i] == '#': inc(i)
  let last = if maxLen == 0: s.len else: i+maxLen
  while i < last:
    case s[i]
    of '_': discard
    of '0'..'9':
      number = number shl 4 or (ord(s[i]) - ord('0'))
      foundDigit = true
    of 'a'..'f':
      number = number shl 4 or (ord(s[i]) - ord('a') + 10)
      foundDigit = true
    of 'A'..'F':
      number = number shl 4 or (ord(s[i]) - ord('A') + 10)
      foundDigit = true
    else: break
    inc(i)
  if foundDigit: result = i-start
  
template skipPastSep*(s: untyped, i: untyped, hitEnd: untyped, sep: char) =
  while s[i] != sep and i < s.len:
    i += 1
  if i == s.len:
    hitEnd = true
  else:
    i += 1

template skipOverWhitespace*(s: untyped, i: untyped) =
  while (s[i] == ' ' or s[i] == '\t') and i < s.len:
    i += 1


macro schemaType*(schema: static[openarray[Column]]): untyped =
  ## Creates a type corresponding to a given schema (the return
  ## type of the generated ``schemaParser`` proc).
  result = newNimNode(nnkTupleTy)
  for col in schema:
    # TODO: This can probably done using true types + type.getType.name
    let typ = case col.kind
      of StrCol: bindSym"string"
      of IntCol: bindSym"int64"
      of FloatCol: bindSym"float"
      of DateCol: bindSym"Time"
      of SeqCol:
        case col.seqKind
          of StrCol: bindSym"seq"
          of IntCol: bindSym"seq"
          of FloatCol: bindSym"seq"
          of DateCol: bindSym"seq"
          of SeqCol: bindSym"seq"
    result.add(
      newIdentDefs(name = newIdentNode(col.name), kind = typ)
    )


macro schemaParser*(schema: static[openarray[Column]], sep: static[char]): untyped =
  ## Creates a schema parser proc, which takes a ``string`` as input and
  ## returns a the parsing result as a tuple, with types corresponding to
  ## the given ``schema``
  # Adding `extraArgs: varargs[untyped]` doesn't seem to work :(

  # TODO: Why can't I just use:
  # var returnType = schemaType(schema)
  # /home/fabian/github/NimData/src/nimdata/schema_parser.nim(58, 30) Error: type mismatch: got (openarray[Column])
  # but expected one of:
  # macro schemaType[](schema: static[openArray[Column]]): untyped

  var returnType = newNimNode(nnkTupleTy)
  for col in schema:
    # TODO: This can probably done using true types + type.getType.name
    let typ = case col.kind
      of StrCol: bindSym"string"
      of IntCol: bindSym"int64"
      of FloatCol: bindSym"float"
      of DateCol: bindSym"Time"
      of SeqCol:
        case col.seqKind
          of StrCol: bindSym"seq"
          of IntCol: bindSym"seq"
          of FloatCol: bindSym"seq"
          of DateCol: bindSym"seq"
          of SeqCol: bindSym"seq"
    returnType.add(
      newIdentDefs(name = newIdentNode(col.name), kind = typ)
    )
  when defined(checkMacros):
    #echo returnType.treeRepr
    echo returnType.repr

  template fragmentSkipPastSep(sep: char) =
    skipPastSep(s, i, hitEnd, sep)

  template fragmentReadStr(field: untyped, sep: char) =
    ## read string
    copyFrom = i
    skipPastSep(s, i, hitEnd, sep)
    if not hitEnd:
      field = substr(s, copyFrom, i-2)
    else:
      field = substr(s, copyFrom, s.len)

  template fragmentReadDate(field: untyped, sep: char, format: string) =
    ## read string
    copyFrom = i
    skipPastSep(s, i, hitEnd, sep)
    let s =
      if not hitEnd:
        substr(s, copyFrom, i-2)
      else:
        substr(s, copyFrom, s.len)
    try:
      field = times.toTime(times.parse(s, format))
    except ValueError:
      # TODO: more systematic logging/error reporting system
      let e = getCurrentException()
      field = times.Time(0)
      echo "[WARNING] Failed to parse '" & s & "' as a time (" & e.msg & "). Setting value to " & times.`$`(field)

  template fragmentReadIntBin(field: untyped) =
    ## read binary int
    i += parseBin(s, field, start=i)

  template fragmentReadIntOct(field: untyped) =
    ## read octal int
    i += parseOct(s, field, start=i)

  template fragmentReadIntDec(field: untyped) =
    ## read decimal int
    i += parseBiggestInt(s, field, start=i)

  template fragmentReadIntHex(field: untyped) =
    ## read hexadecimal int
    i += parseHex(s, field, start=i)

  template fragmentReadFloat(field: untyped) =
    ## read float
    skipOverWhitespace(s, i)
    i += parseBiggestFloat(s, field, start=i)

  template fragmentReadSeq(field: untyped) =
    ## read seq
    discard

  template bodyHeader() {.dirty.} =
    var i = 0
    var hitEnd = false
    var copyFrom = 0

  var body = getAst(bodyHeader())

  for i, col in schema.pairs:

    let fieldExpr = newDotExpr(ident("result"), ident(col.name)) # the `result.columnBlah` expression
    let sepExpr = newLit(sep)

    var requiresAdvancePastSep = true

    case col.kind
    of StrCol:
      let call = getAst(fragmentReadStr(fieldExpr, sepExpr))
      body.add(call)
      # for a StrCol we don't need the call to fragmentSkipPastSep, because
      # the string extraction already advances past the separator
      requiresAdvancePastSep = false
    of DateCol:
      let call = getAst(fragmentReadDate(fieldExpr, sepExpr, col.format))
      body.add(call)
      # for a DateCol we don't need the call to fragmentSkipPastSep, because
      # the string extraction already advances past the separator
      requiresAdvancePastSep = false
    of IntCol:
      case col.base
      of baseBin:
        let call = getAst(fragmentReadIntBin(fieldExpr))
        body.add(call)
      of baseOct:
        let call = getAst(fragmentReadIntOct(fieldExpr))
        body.add(call)
      of baseDec:
        let call = getAst(fragmentReadIntDec(fieldExpr))
        body.add(call)
      of baseHex:
        let call = getAst(fragmentReadIntHex(fieldExpr))
        body.add(call)
      requiresAdvancePastSep = true
    of FloatCol:
      let call = getAst(fragmentReadFloat(fieldExpr))
      body.add(call)
      requiresAdvancePastSep = true
    of SeqCol:
      let call = getAst(fragmentReadSeq(fieldExpr))
      body.add(call)
      requiresAdvancePastSep = false

    # If it is not the last column and dvancing past sep is required
    if requiresAdvancePastSep and i < schema.len - 1:
      let call = getAst(fragmentSkipPastSep(sepExpr))
      body.add(call)

  let params = [
    returnType,
    newIdentDefs(name = newIdentNode("s"), kind = newIdentNode("string"))
  ]
  result = newProc(params=params, body=body, procType=nnkLambda)
  when defined(checkMacros):
    #echo result.treerepr
    echo result.repr

Add tests to ensure immutability

Optional Supression of Date Parse Warning?

Does this seem like a sensible option to add?

I have a date column where there are tens of thousands of 0 values. I like the warning intially, but very quickly it fills up the terminal.

To avoid the warnings I could (should I?) try to handle this prior to passing the data to the schema parser using a regex? Maybe just use an intCol and create a new column that is parsed? Any suggestions are appreciated.

Add a dynamic `Row` type

Implement: Split-Apply-Combine with LINQ

First of all: This is awesome! I'm willing to test the hell out-of-this library.

This is related to: #13

The split-apply-combine pattern to group data, and apply transformations on the original data based on aggregate functions on the grouped data (sum, count, max for each group for example) on the original dataframe is extremely useful for datascience.

Regarding the syntax: Python's pandas and R's dplyr uses a custom syntax. Julia's DataFramesMeta uses LINQ.

I believe LINQ approach is much more powerful in term of extensibility and could have a very nice Pipe syntax in Nim. Also this could expand well to multiple backends (SQL, Hadoop, Feather ...)

Here are two sample Scikit Transformers to compute the Ticket frequency on the Kaggle Titanic dataset to showcase LINQ split-apply-combine vs Pandas

Python

class PP_TicketFreqTransformer(BaseEstimator, TransformerMixin):
    def __init__(self):
        self.df_Ticket = pd.DataFrame()

    def fit(self, X, y=None, **fit_params):
        self.df_Ticket = pd.DataFrame(X)
        return self

    def transform(self, X):
        dfX = pd.DataFrame(X)
        df = dfX
        if not df.equals(self.df_Ticket):
            df = pd.concat([self.df_Ticket, df]).reset_index()
        return dfX.assign(
            CptTicketFreq = df.groupby('Ticket')['Ticket']
                            .transform('count')
        )

Julia -- Piping is done with |>

type PP_TicketFreqTransformer <: ScikitLearnBase.BaseEstimator
    df_TicketFreq::DataFrame
    PP_TicketFreqTransformer() = new()
end

@declare_hyperparameters(PP_TicketFreqTransformer, Symbol[])

function ScikitLearnBase.fit!(self::PP_TicketFreqTransformer, X::DataFrame, y=nothing)
    self.df_TicketFreq = X
    return self
end

function ScikitLearnBase.transform(self::PP_TicketFreqTransformer, X::DataFrame)
    df = ifelse(isequal(X,self.df_TicketFreq), X, vcat(X,self.df_TicketFreq))
    @linq  df |>  by(:PassengerId,CptTicketFreq = length(:Ticket)) |>
        join(X, on=:PassengerId)
end

Full codes, in Python, and Julia

Here is a link to a LINQ idea on Nim's forum as well

[feature] Transpose rows/columns of DataFrame

I have some data in a different format than what DateFrame expects. As I started looking around I see that something like pandas transpose would be what I would need. It would be handy in NimData.

For example:

Name	jan	feb	mar
james	1	2	3
sally	4	5	6
wendy	7	8	9

into

Date	james	sally	wendy
jan	1	4	7
feb	2	5	8
mar	3	6	9

Could not load zlib1.dll

Hi, I used nimble to install nimdata. Then I tried the most basic example suggested

import nimdata

echo DF.fromRange(0, 10).collect()

I got an error Could not load zlib1.dll
Is this by any chance a common and easily fixable error?

... oh! I just found that reinstallation via choosenim made this error go away.

Is there a way to represent MISSING VALUE in NimData?

missing float can be represented by NaN
then how can int and string type to be expressed as MISSING? using SPACE or ZERO maybe not a good idea? then is there a general way to express the MISSING VALUE in NimData as in Pandas?

Update docs

In order to update https://bluenote10.github.io/NimData/nimdata.html I tried running build_docs.sh, but ran into the following Nim doc gen issues:

The following command is somewhat working, besides the missing dochack.js and with the git.commit hack to inject src/ directory to make github Source links work.

nim doc --project --git.commit:master/src --git.url:https://github.com/bluenote10/NimData -o:./docs/ src/nimdata.nim

Nimdata and the JavaScript target

Thank you for your excellent library! I try to use NimData with the JavaScript backend (compiling via nim js ...), but I get an compile error in one of the dependencies.

Any way to avoid this?

Implement sort

Fix project structure

The project structure is currently not following the Nim standards.

Avoid `split` based CSV parsing

Provide a basic `show` method

May look something like Spark's show:

// Displays the content of the DataFrame to stdout
df.show()
// +----+-------+
// | age|   name|
// +----+-------+
// |null|Michael|
// |  30|   Andy|
// |  19| Justin|
// +----+-------+

Mean and other operations over columns

I just spent a couple of hours trying to do this and there appears to be no way of achieving this :\

I basically have something like this:

1,34,13,2.5,129
1,56,34,34.2,19

And I want to take the mean (and median) of each column (and get a single row with this data for each column)

[feature] Handle empty sequences

If an empty sequence gets passed when creating a data frame, unexpected results can occur. It would be nice if the code handled this well so the client didn't have to do an empty sequence check on their side.

count: 2
min: 1
max: 17
mean: 9.0
stdev: 8.0
-------------------------------------------
count: 0
min: 9223372036854775807
max: -9223372036854775808
mean: nan
.../nimdata_example_2.nim(36) nimdata_example_2
.../nimdata_example_2.nim(31) analyze
.../.nimble/pkgs/nimdata-0.1.0/nimdata.nim(697) stdev
Error: unhandled exception: stdev requires at least two data points [ValueError]
Error: execution of an external program failed: '.../nimdata_example_2 '

Sample program

import strutils, times
import nimdata
import nimdata/utils

type
  Foo = object
    fname: string
    bar_at: DateTime
    baz_at: DateTime

let input1: seq[Foo] = @[
  Foo(fname: "sally", bar_at: initDateTime(1, mJan, 2019, 8, 14, 23),
      baz_at: initDateTime(18, mJan, 2019, 8, 14, 23))
  ,Foo(fname: "wendy", bar_at: initDateTime(4, mJan, 2019, 10, 09, 14),
       baz_at: initDateTime(6, mJan, 2019, 8, 00, 45))
]

let input2: seq[Foo] = @[]

proc analyze(xs: seq[Foo]) =
  let df = DF.fromSeq(xs)
             .cache()

  echo "count: ", df.count()

  let df1 = df.map(x => inDays(x.baz_at - x.bar_at))
              .cache()
  echo "min: ", df1.min()
  echo "max: ", df1.max()
  echo "mean: ", df1.mean()
  echo "stdev: ", df1.stdev()

when isMainModule:
  analyze(input1)
  echo "-------------------------------------------"
  analyze(input2)

BTW, I was pleasantly surprised that the code handled my custom type without me having to do anything extra in generating a DataFrame. Really nice use of Nim's generics!

Support date/time columns in CSV parsing

Nim version check fails on 0.19.0

The nim version check when NimMinor >= 18 and NimPatch > 0
in https://github.com/bluenote10/NimData/blob/master/src/nimdata/schema_parser.nim#L176-L179
doesn't work correctly on 0.19.0 (as NimPatch is 0 again) so it falls back on old times code which causes the compilation error:

../../../.nimble/pkgs/nimdata-0.1.0/nimdata/schema_parser.nim(179, 27)
Error: type mismatch: got <int literal(0)> but expected 'Time = object'

A better solution would be to use:
when (NimMajor, NimMinor, NimPatch) > (0, 18, 0)

Note that this same version check also happens on some other location throughout the codebase and would also cause compilation failures for different code paths.