Comments (6)
Yes, I understand that.
The documentation is on the one hand clearly still lacking and on the other everything related to object
columns is not the most intuitive. The name Value
is older than the object
column (which I adopted for familiarity with pandas & numpy). And %~
comes from a similarity to the %
operator for Nim's JsonNode
.
Neither is great. I'm not sure about better ways though. I'm all ears for proposals. toValue/toObject
etc. could of course be added, but are more verbose.
For the specific use cases of comparisons, I suppose I could add overloads to ==
for Value
with native types. That at least would hide some of the behind the scenes stuff. I didn't want to jump headlong into using converters
etc. though, because I might regret it later.
It's a good idea for an additional section in the data wrangling tutorial for sure though! I've opened an issue:
for that.
from datamancer.
Apparently there are other situations where this happens... I mean the unexpected 0. Not only in the last column.
For instance if I load on df a CSV like this:
w,x,y,z
1,10,0.1,100
2,ERR,inf,200
NaN,N/A,0.3,300
4,40,0.4,400
echo df.pretty()
Dataframe with 4 columns and 4 rows:
Idx w x y z
dtype: object object object int
0 1 10 0.1 100
1 2 0 inf 200
2 NaN N/A 0.3 300
3 4 40 0.4 400
i.e. ERR gets replaced with 0 (or better, it stays to init'ed int value if I got your point correctly), while NaN or N/A are "just fine".
So possibly the issue to address is a bit broader, and not related to just the last column missing values. If some values, for whatever reason, are not mapped (ERR is "special" case for the parser I guess... possibly making something to fail?), are kept to... 0. I faced this issue because I was trying to understand how to perform "datacleaning" with datamancer.
I know how to "replace" inf or ERR with NaN in pandas... (ERR maybe is uncommon, but 'inf' or 'NaN' are not).
I would do something like this to remap ERR on NaN on 'x' column:
df = df.replace("ERR","NaN")
df["x"] = df["x"].astype(float) #NaN is float type for Python, unless using Int64 (with "I" capital case)
I have still to figure it out IF and HOW I can do the same on datamancer.
Thank you.
from datamancer.
So, I've addressed the things that are broken in my opinion.
What I mean is: explicit appearing of NaN
and Inf
should be parsed correctly into NaN
or Inf
. Missing values are parsed as NaN
(note: this implies an integer column with missing values, will be turned into a float column!).
However, the case of ERR
and N/A
you bring up, are irrelevant in my opinion. These are just human ways to indicate something. They are turned into string values of the corresponding entry.
Note that indeed, the value of ERR
in particular was buggy, due to starting with E
(as E
can indicate an exponent in a float). That bug is fixed. So your given DF is now turned into:
Dataframe with 4 columns and 4 rows:
Idx w x y z
dtype: float object float int
0 1 10 0.1 100
1 2 ERR inf 200
2 nan N/A 0.3 300
3 4 40 0.4 400
I think this is reasonable.
To be honest, I'm not the biggest fan of having to convert the first column here into a float column, but it is what it is.
And sure, data cleaning is a very common task. But keep in mind that there is a reason that in the R community there is a whole package just for this, tidyr:
https://tidyr.tidyverse.org/
In any case, with the DF as it stands now, you can apply rules to the object
column to clean it up as you see fit.
from datamancer.
Hello Vindaar, that's excellent (it's more than reasonable). The new layout you show is exactly the one I expected and quite the same Python/Pandas (I'm not an R guy... shame on me) would have produced.
NaN and INF implies float in Pandas too (Pandas introduced Int64 "nullable integer" at certain point... but also Pandas' read_csv doesn't dare to use it as default and I understood that using it could pose risk of breaking some code / causing weird stuff).
ERR and N/A are definitely expected to force the whole DF column (in Pandas is called Series instead of Tensor) to be an object datatype (basically a string), Pandas does make the same. Here I have one newby question on datamancer: then what would be the most idiomatic / efficient way to replace (mutate in place if it makes sense) e.g. 'N/A' and 'ERR' with '0' and then turn the whole DF column from str to int or float? Thank you. You are doing a great job.
from datamancer.
then what would be the most idiomatic / efficient way to replace (mutate in place if it makes sense) e.g. 'N/A' and 'ERR' with '0' and then turn the whole DF column from str to int or float?
What I would personally do here is the following:
df = df.mutate(f{Value -> int: "x" ~
(if `x` == %~ "ERR" or `x` == %~ "N/A":
0
else:
`x`.toInt)})
An explanation:
mutate
takes a formula that either adds a new column or overwrites an existing one- we give it information about the type
Value -> int
, meaning input columns are read asValue
(that's what anobject
column actually is) and the output shall beint
- we overwrite the
x
column, hence"x"
is left of the~
~
indicates we have a formula that creates a new full column- then comes the actual body of the formula. Important note here: because we have a more complex expression, we need the parenthesis around the
if
. Otherwise the compiler will complain - in the body we simply refer to the column we access using back ticks
- because it's a formula with
~
, it means the code will run for each element of the column, hence the back tickedx
is each element of the columnx
- we compare using the
%~
operator, which is the (maybe a bit weird) operator to convert a regular type into aValue
- if it's either of our "bad" values we return 0
- else we return the existing value, but take out the existing integer from the
Value
by usingtoInt
See the documentation of the Value
here:
https://scinim.github.io/Datamancer/value.html
There are other ways to do this of course, but this would be the most "idiomatic" if you will. Given the use case though, doing it manually by getting the tensor using df["x", Value]
and using map_inline
or something and overwriting manually via df["x"] = df["x", Value].map_inline(...)
or similar could be more intuitive however.
from datamancer.
Thank you Vindaar. Illuminating. I tried some initial experiments in the same direction, but I struggled with the type to use with object (Value... I see... not string) and I've would never tought to use ~ or %~ that way. Explained examples like this are a goldmine for datamancer newcomers. This one to me is worth of a further chapter in datamancer data wrangling tutorial (maybe is there and I've just missed), even because replacement (even if in casis like this, often "filtering out" could be more appropriate) and data type conversion are routinary operations when "cleaning" data.
from datamancer.
Related Issues (12)
- datamancer breaks nim CI HOT 4
- Constant column, object columns and native columns need better interop HOT 1
- Hash collisions in large DFs
- Parsing CSV files without header is problematic
- Type deduction causes formula CT error HOT 2
- Lifting operations out of loops not working on accented quotes
- Error: ambiguous call HOT 1
- --panics:on causes dataframe test failure HOT 1
- readCsv does not consider quotes for values as expected HOT 9
- rename docs and order HOT 5
- parsing CSV file with 2 empty newlines at end fails HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from datamancer.