Giter VIP home page Giter VIP logo

Comments (6)

Vindaar avatar Vindaar commented on May 20, 2024 1

Yes, I understand that.

The documentation is on the one hand clearly still lacking and on the other everything related to object columns is not the most intuitive. The name Value is older than the object column (which I adopted for familiarity with pandas & numpy). And %~ comes from a similarity to the % operator for Nim's JsonNode.
Neither is great. I'm not sure about better ways though. I'm all ears for proposals. toValue/toObject etc. could of course be added, but are more verbose.

For the specific use cases of comparisons, I suppose I could add overloads to == for Value with native types. That at least would hide some of the behind the scenes stuff. I didn't want to jump headlong into using converters etc. though, because I might regret it later.

It's a good idea for an additional section in the data wrangling tutorial for sure though! I've opened an issue:

SciNim/getting-started#37

for that.

from datamancer.

smurgle avatar smurgle commented on May 20, 2024

Apparently there are other situations where this happens... I mean the unexpected 0. Not only in the last column.
For instance if I load on df a CSV like this:
w,x,y,z
1,10,0.1,100
2,ERR,inf,200
NaN,N/A,0.3,300
4,40,0.4,400

echo df.pretty()

Dataframe with 4 columns and 4 rows:
Idx w x y z
dtype: object object object int
0 1 10 0.1 100
1 2 0 inf 200
2 NaN N/A 0.3 300
3 4 40 0.4 400

i.e. ERR gets replaced with 0 (or better, it stays to init'ed int value if I got your point correctly), while NaN or N/A are "just fine".
So possibly the issue to address is a bit broader, and not related to just the last column missing values. If some values, for whatever reason, are not mapped (ERR is "special" case for the parser I guess... possibly making something to fail?), are kept to... 0. I faced this issue because I was trying to understand how to perform "datacleaning" with datamancer.
I know how to "replace" inf or ERR with NaN in pandas... (ERR maybe is uncommon, but 'inf' or 'NaN' are not).
I would do something like this to remap ERR on NaN on 'x' column:

df = df.replace("ERR","NaN")
df["x"] = df["x"].astype(float) #NaN is float type for Python, unless using Int64 (with "I" capital case)

I have still to figure it out IF and HOW I can do the same on datamancer.
Thank you.

from datamancer.

Vindaar avatar Vindaar commented on May 20, 2024

So, I've addressed the things that are broken in my opinion.

What I mean is: explicit appearing of NaN and Inf should be parsed correctly into NaN or Inf. Missing values are parsed as NaN (note: this implies an integer column with missing values, will be turned into a float column!).

However, the case of ERR and N/A you bring up, are irrelevant in my opinion. These are just human ways to indicate something. They are turned into string values of the corresponding entry.

Note that indeed, the value of ERR in particular was buggy, due to starting with E (as E can indicate an exponent in a float). That bug is fixed. So your given DF is now turned into:

Dataframe with 4 columns and 4 rows:
       Idx         w         x         y         z
    dtype:     float    object     float       int
         0         1        10       0.1       100
         1         2       ERR       inf       200
         2       nan       N/A       0.3       300
         3         4        40       0.4       400

I think this is reasonable.

To be honest, I'm not the biggest fan of having to convert the first column here into a float column, but it is what it is.

And sure, data cleaning is a very common task. But keep in mind that there is a reason that in the R community there is a whole package just for this, tidyr:
https://tidyr.tidyverse.org/

In any case, with the DF as it stands now, you can apply rules to the object column to clean it up as you see fit.

from datamancer.

smurgle avatar smurgle commented on May 20, 2024

Hello Vindaar, that's excellent (it's more than reasonable). The new layout you show is exactly the one I expected and quite the same Python/Pandas (I'm not an R guy... shame on me) would have produced.
NaN and INF implies float in Pandas too (Pandas introduced Int64 "nullable integer" at certain point... but also Pandas' read_csv doesn't dare to use it as default and I understood that using it could pose risk of breaking some code / causing weird stuff).
ERR and N/A are definitely expected to force the whole DF column (in Pandas is called Series instead of Tensor) to be an object datatype (basically a string), Pandas does make the same. Here I have one newby question on datamancer: then what would be the most idiomatic / efficient way to replace (mutate in place if it makes sense) e.g. 'N/A' and 'ERR' with '0' and then turn the whole DF column from str to int or float? Thank you. You are doing a great job.

from datamancer.

Vindaar avatar Vindaar commented on May 20, 2024

then what would be the most idiomatic / efficient way to replace (mutate in place if it makes sense) e.g. 'N/A' and 'ERR' with '0' and then turn the whole DF column from str to int or float?

What I would personally do here is the following:

df = df.mutate(f{Value -> int: "x" ~
  (if `x` == %~ "ERR" or `x` == %~ "N/A":
    0
  else:
    `x`.toInt)})

An explanation:

  • mutate takes a formula that either adds a new column or overwrites an existing one
  • we give it information about the type Value -> int, meaning input columns are read as Value (that's what an object column actually is) and the output shall be int
  • we overwrite the x column, hence "x" is left of the ~
  • ~ indicates we have a formula that creates a new full column
  • then comes the actual body of the formula. Important note here: because we have a more complex expression, we need the parenthesis around the if. Otherwise the compiler will complain
  • in the body we simply refer to the column we access using back ticks
  • because it's a formula with ~, it means the code will run for each element of the column, hence the back ticked x is each element of the column x
  • we compare using the %~ operator, which is the (maybe a bit weird) operator to convert a regular type into a Value
  • if it's either of our "bad" values we return 0
  • else we return the existing value, but take out the existing integer from the Value by using toInt

See the documentation of the Value here:
https://scinim.github.io/Datamancer/value.html

There are other ways to do this of course, but this would be the most "idiomatic" if you will. Given the use case though, doing it manually by getting the tensor using df["x", Value] and using map_inline or something and overwriting manually via df["x"] = df["x", Value].map_inline(...) or similar could be more intuitive however.

from datamancer.

smurgle avatar smurgle commented on May 20, 2024

Thank you Vindaar. Illuminating. I tried some initial experiments in the same direction, but I struggled with the type to use with object (Value... I see... not string) and I've would never tought to use ~ or %~ that way. Explained examples like this are a goldmine for datamancer newcomers. This one to me is worth of a further chapter in datamancer data wrangling tutorial (maybe is there and I've just missed), even because replacement (even if in casis like this, often "filtering out" could be more appropriate) and data type conversion are routinary operations when "cleaning" data.

from datamancer.

Related Issues (12)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.