scinim / datamancer Goto Github PK

View Code? Open in Web Editor NEW

124.0 4.0 6.0 922 KB

A dataframe library with a dplyr like API

Home Page: https://scinim.github.io/Datamancer/datamancer.html

License: MIT License

Nim 100.00%

nim nim-lang dataframe dplyr hacktoberfest

datamancer's Introduction

scinim

The core types and functions of the SciNim ecosystem

datamancer's People

Contributors

Stargazers

Watchers

Forkers

pietroppeter ringabout angelezquerra hugogranstrom ire4ever1190 quimt

datamancer's Issues

Parsing CSV files without header is problematic

When trying to parse a CSV file that has no header, we still attempt to read a header, resulting in problems. Using colNames doesn't help (although it does assign the correct column names).

Maybe need an option that allows to say "no header" or something.

Hash collisions in large DFs

Our initial assumption of just using simple stdlib hashing algorithms (nowadays murmur hash), becomes problematic if we deal with discrete classes in the order of O(100,000) elements.

Initially this was never an intended use case, but playing around I did find it useful. The following demonstrates the problem:

proc hashtest(num: int) =
  var ids = newSeqOfCap[string](num * num)
  for i in 0 ..< num:
    for j in 0 ..< num:
      if i != j:
        ids.add $i & "/" & $j
  let c = toColumn ids
  var s = newSeq[Hash](c.len)
  s.hashColumn(c, finish = true)
  var aset = initTable[int, string]()
  for i, x in s:
    if x notin aset:
      aset[x] = ids[i]
    else:
      echo "Already contained! ", x
      echo "with value ", aset[x]
      echo "trying to insert ", ids[i]
      echo "Compute hash manually ", hash(ids[i]), "  vs  ", hash(aset[x])
  echo "length ", s.len
  echo "set length ", s.toHashSet().card
  echo "ids to hash set ", ids.toHashSet().card
hashtest(400)

If we run this, we get 4 hash collisions.

The underlying problem is that we use hashes internally for uniqueness, but don't take into account actual equality of values. Plain and simply, because the hashing approach is used precisely to work around multiple types.

We'll replace it with a Value approach that we make a bit smarter than our initial implementation.

--panics:on causes dataframe test failure

I need panics:on because I'm using CPS but when I enable it, a test fails to pass:

Traceback (most recent call last)
/home/adavidoff/git/datamancer/tests/testDf.nim(1508) testDf
/home/adavidoff/git/datamancer/src/datamancer/column.nim(1048) toNativeColumn
/home/adavidoff/nims/lib/std/assertions.nim(46) failedAssertImpl
/home/adavidoff/nims/lib/std/assertions.nim(36) raiseAssert
/home/adavidoff/nims/lib/system/fatal.nim(49) sysFatal
/home/adavidoff/nims/lib/system/fatal.nim(37) sysFatal
Error: unhandled exception: /home/adavidoff/git/datamancer/src/datamancer/column.nim(1048, 9) `cValue[i].kind == vKind` Column contains actual multiple datatypes! VString and VInt! [AssertionDefect]

I'm using v0.3.11 in commit 6d63164.

Nim Compiler Version 1.9.3 [Linux: amd64]
Compiled at 2023-04-01
Copyright (c) 2006-2023 by Andreas Rumpf

git hash: 1c7fd717206c79be400f81a05eee771823b880ca
active boot switches: -d:release

Lifting operations out of loops not working on accented quotes

Initially I didn't even want to support lifting operations that use accented quotes. Now though I'm not entirely sure what my reasoning was. I suppose detecting if something needs to be lifted is simpler if it's an explicit col access.

It would be great if at least if we cannot support it, we raise a CT error if such usage is seen.

edit: right, there's even still a comment about exactly this issue, oops.

parsing CSV file with 2 empty newlines at end fails

If there's two empty newlines at the end (so one fully empty line), our CSV parser chokes.

It's a bit difficult, as either we need to check for this in the first line counting pass (slows it down) or reduce the size of the final tensor afterwards. Need to think about the most elegant solution.

To reproduce just take a working CSV file and 2 newlines at the end and run.

readCsv replaces missing values in last column with 0

It is expected to use NaN, as it currently does when missing values happen in inner columns.

datamancer breaks nim CI

@Vindaar
how come nimble search datamancer doesn't list this?
also:
https://github.com/nim-lang/Nim/pull/18384/checks?check_run_id=2980455174

PASS: [7/35] fusion c                                                     (20.13 sec)
  FAIL: [8/35] ggplotnim c                                                  (59.81 sec)
  Test "ggplotnim" in category "nimble-packages"
  Failure: reBuildFailed
  nim c -d:noCairo -r tests/tests.nim
  Hint: used config file '/Users/runner/work/Nim/Nim/config/nim.cfg' [Conf]
  Hint: used config file '/Users/runner/work/Nim/Nim/config/config.nims' [Conf]
  Hint: used config file '/Users/runner/work/Nim/Nim/pkgstemp/ggplotnim/tests/config.nims' [Conf]
  .........................................................................................................................
  Compiling draw as a dummy proc
  .......
  /Users/runner/.nimble/pkgs/nimblas-0.2.2/nimblas/private/common.nim(50, 7) Hint: Using BLAS library with name: libblas.dylib [User]
  ........................................................................................................................
  /Users/runner/.nimble/pkgs/nimlapack-0.2.0/nimlapack.nim(19, 7) Hint: Using LAPACK library with name: liblapack.dylib [User]
  ........................................................................
  /Users/runner/.nimble/pkgs/datamancer-0.1.7/datamancer/dataframe.nim(1435, 29) template/generic instantiation of `{}` from here
  /Users/runner/.nimble/pkgs/datamancer-0.1.7/datamancer/formula.nim(818, 12) Warning: What kind? nnkIdent in node false [User]
  /Users/runner/.nimble/pkgs/datamancer-0.1.7/datamancer/dataframe.nim(1435, 29) template/generic instantiation of `{}` from here
  /Users/runner/.nimble/pkgs/datamancer-0.1.7/datamancer/formula.nim(818, 12) Warning: What kind? nnkBracketExpr in node df[localCol][idx] [User]
  FormulaNode(name: "(== isNull(df[localCol][idx]).toBool false)",
              colName: "(== isNull(df[localCol][idx]).toBool false)",
              kind: fkVector, resType: toColKind(type(bool)), fnV: proc (
      df: DataFrame): Column =
    let localColT = df[localCol, Value]
    var res = newTensor[bool](df.len)
    for idx in 0 ..< df.len:
      res[idx] = isNull(localColT[idx]).toBool == false
    result = toColumn res)
  .....
  /Users/runner/work/Nim/Nim/pkgstemp/ggplotnim/src/ggplotnim/ggplot_types.nim(509, 3) Error: not all cases are covered; missing: {fkNone}
  
  PASS: [9/35] httpauth c                                                   ( 7.15 sec)

readCsv does not consider quotes for values as expected

Hi first of all thank you for your wonderful work on Datamancer, really nice library. This is a glitch (at least for me) I've found.
I'm testing on Datamancer 0.3.17 / Windows 10.
If a CSV fileld is surrounded by double quotes, I would expect that:

Double quotes are "consumed", i.e. the extracted string will not contain double quotes anymore
They should allows to ignore all characters that would otherwise act as separators, line breaks, ... within these as described in the documentation

This is what does happen with std/parsecsv, but apparently does not happen with datamancer's readCsv for values (while it's ok for column names).
I've enclosed a very small snippet to show the concept for point 1. One obvious drawback (maybe it's wanted behaviour, but it would be nice to give the user the option to chose parseCsv-like behaviour) it is that digit-only columns (like "Two" and "Four" in the example) cannot be converted to float or int without applying a string modification before. While CSV like this may appear exotic, they are not.. if someone decides to export a DB table to a csv file and He/She is not sure about end-user regional settings or multi-line fields (e.g. "Note"), the enclose in double quotes every field is a robust approach.

import std/[parsecsv,strutils], datamancer
var p: CsvParser
let content = """"One","Two","Three","Four"
"a1","2","a3","4"
"b10","20","b30","40"
"c100","200","c300","400"
"""
writeFile("temp.csv", content)
p.open("temp.csv")
#p.readHeaderRow()
while p.readRow():
  echo(join(p.row,","))
p.close()

#[double quotes are consumed when using std/parsecsv, echo display this:

One,Two,Three,Four
a1,2,a3,4
b10,20,b30,40
c100,200,c300,400

]#


let csvfile = "temp.csv"
let df = readCsv(csvfile, quote = '"') #no better result with '\"'
echo(df.pretty())

#[double quotes are not consumed as expected, df.pretty() show this for values, not only for digit-only strings 

DataFrame with 4 columns and 3 rows:
     Idx          One          Two        Three         Four
  dtype:       string       string       string       string
       0         "a1"          "2"         "a3"          "4"
       1        "b10"         "20"        "b30"         "40"
       2       "c100"        "200"       "c300"        "400"
]#

# FURTHERMORE, If a quoted field is containing newline i.e. \n, parsing will fail

Error: ambiguous call

Hi @Vindaar , I came across the following error:

/home/luis/Documents/OneDrive/Coding/Nim/Programs/allocator/testingmancer.nim(5
, 18) Error: ambiguous call; both io.readCsv(fname: string, sep: char, header: 
string, skipLines: int, maxLines: int, toSkip: set[char], colNames: seq[string]
, skipInitialSpace: bool, quote: char, maxGuesses: int, lineBreak: char, eat: c
har) [proc declared in /home/luis/.nimble/pkgs/datamancer-0.3.11/datamancer/io.
nim(615, 6)] and io_csv.read_csv(csvPath: string, skipHeader: bool, separator: 
char, quote: char) [proc declared in /home/luis/.nimble/pkgs/arraymancer-0.7.19
/arraymancer/io/io_csv.nim(52, 6)] match for: (string)

While attempting to read the csv file:

date,investpercent,expenses,savings,low,high,yrate
2021-04-20,40.00,0.00,0.00,0.00,0.00,0.00
2021-05-23,40.00,0.00,0.00,0, 0, 0

Using the code:

import datamancer
import arraymancer


let df1 = readCsv("finances.csv")
echo df1

With nim version:

Nim Compiler Version 1.6.10 [Linux: amd64]
Compiled at 2022-11-21
Copyright (c) 2006-2021 by Andreas Rumpf

git hash: f1519259f85cbdf2d5ff617c6a5534fcd2ff6942
active boot switches: -d:release

Tried with read.csv in R and it reads correctly.

What I am doing wrong?

rename docs and order

I have a header line that starts with '#' so I want to do:

var df = readCsv(tsv, sep='\t').rename(f{"mode" <- "#mode"})

this works, but the docs say to use ~ which does not work.

Another issue is that it changes the column ordering by adding the new column at the end (as expected using OrderedTable), but I would expect it to retain the column order but keep a new name.

I know it's early days, but just wanted to flag this as I saw it.

Thanks for the dataframe lib!

js target support?

Currently attempts to import datamancer when compiling to javascript fail. This is due to the way the package relies on arraymancer:


.nimble/pkgs2/arraymancer-0.7.27-7af6e290b723aead93067e1a52b0b369dd49cfbf/arraymancer/tensor/init_cpu.nim(235, 18) template/generic instantiation of `randomTensorCpu` from here
.nimble/pkgs2/arraymancer-0.7.27-7af6e290b723aead93067e1a52b0b369dd49cfbf/arraymancer/tensor/init_cpu.nim(208, 18) template/generic instantiation of `allocCpuStorage` from here
.nimble/pkgs2/arraymancer-0.7.27-7af6e290b723aead93067e1a52b0b369dd49cfbf/arraymancer/laser/tensor/datatypes.nim(93, 29) template/generic instantiation of `finalizer` from here
.nimble/pkgs2/arraymancer-0.7.27-7af6e290b723aead93067e1a52b0b369dd49cfbf/arraymancer/laser/tensor/datatypes.nim(71, 23) Error: attempting to call undeclared routine: 'deallocShared'

It's a shame that the DataFrame is not available for js. The API is nice and should be available independent of platform-specific concepts like memory management.

Constant column, object columns and native columns need better interop

Currently we are a bit too quick to convert constant columns into object columns. In many cases a native column as an extension would certainly be enough, e.g. stacking two constant int columns on top of another should yield an int column and not an object column.

The whole column assignment procedures need a thorough set of tests + possibly some rewrite (there's duplication in there etc).

Type deduction causes formula CT error

Reported by KosKosynsky on the #science channel (added steps for a minimal repro):

import datamancer, strutils

var flota = toDf({"x" : @["a", "b"], "y" : @["1", "2"], "z" : @[1, 3]})
let flota_cols = flota.getKeys()
for i_col in flota_cols:
  if (flota[i_col].nativeColKind == colString):
    flota = flota.mutate(f{string -> string: i_col ~ strip(idx(i_col))})

echo flota

which fails in the determineTypeFromProc helper, which tries to look at the default arguments for type info from the latest argument to strip (we shouldn't even have to look at that, but in any case shouldn't fail!).

The issue is that we assume that the last field of a default nnkIdentDefs is a symbol, but it's not in this case. The offending node is:

IdentDefs
  Sym "leading"
  Empty
  Ident "true"

from the params sequence:

  IdentDefs
    Sym "s"
    Sym "string"
    Empty
  IdentDefs
    Sym "leading"
    Empty
    Ident "true"
  IdentDefs
    Sym "trailing"
    Empty
    Ident "true"
  IdentDefs
    Sym "chars"
    BracketExpr
      Sym "set"
      Sym "char"
    Ident "Whitespace"

Instead of

        typ = pArg[pArg.len - 1].getType # use the default values type

we should check if the node is a type first before calling getType.

Question: what to do for idents?

Just find a way to fix it... :)