Comments (4)
@jinglinpeng I am sorry for my delayed response. I am also confronted with the bug yesterday evening. And I am going to fix this bug next week. Thanks!
from dataprep.
I have located the two causes to this bug:
In /eda/missing/compute.py line 130, under missing_impact_1vn()
function.
for col in cols:
range = None # pylint: disable=redefined-builtin
if is_numerical(df0[col].dtype):
range = (df0[col].min(axis=0), df0[col].max(axis=0))
hists[col] = [
histogram(df[col], bins=bins, return_edges=True, range=range)
for df in [df0, df1]
]
The is_numerical(df0[col].dtype)
will return False when we cast one column type to "object"(its type was int64) and this will give us the wrong range=None
.
In /eda/missing/compute.py line 34, under histogram()
function.
if is_numerical(srs.dtype):
if range is not None:
minimum, maximum = range
else:
minimum, maximum = srs.min(axis=0), srs.max(axis=0)
minimum, maximum = dask.compute(minimum, maximum)
assert (
bins is not None
), "num_bins cannot be None if calculating numerical histograms"
counts, edges = da.histogram(
srs.to_dask_array(), bins, range=[minimum, maximum]
)
centers = (edges[:-1] + edges[1:]) / 2
if not return_edges:
return counts, centers
return counts, centers, edges
elif is_categorical(srs.dtype):
value_counts = srs.value_counts()
counts = value_counts.to_dask_array()
The if is_numerical(srs.dtype)
here will also return False so that the column will be wrongly treated as categorical data. If we fix those two if statements, the will no longer be key error in render.py.
However there is an interesting thing with is_numerical()
and to_dask()
.
Without to_dask()
, is_numerical()
can detect dtype.object
contains numbers or not. But if we use dask.from_pandas()
to convert our pd.df to dd.df, is_numerical()
fails.
In addition, the print result of dtypes with dd.df and pd.df for object type are the same:
from dataprep.
When we do df['PassengerId'] = df['PassengerId'].astype("object")
, we change its type from int64 to object. In our code this casted feature PassengerId
will be treated as a categorical data and generate the corresponding itmdt in compute.py
. However when we pass this itmdt into render.py
:
if is_categorical(df["x"].dtype):
radius = 0.99
x_range = FactorRange(*df["x"].unique())
else:
radius = df["x"][1] - df["x"][0]
x_range = Range1d(df["x"].min() - radius, df["x"].max() + radius)
if is_categorical(df["x"].dtype)
actually return False. If we print out df["x"].dtype
we can see:
PassengerId int64
Survived float64
Pclass float64
Name object
Sex object
SibSp float64
Parch float64
Ticket object
Fare float64
Cabin object
Embarked object
So, the PassengerId
has been restored to int64 somehow. This caused the categorical feature PassengerId
is treated as a numerical feature again, radius = df["x"][1] - df["x"][0]
will generate a key error in this case as the itmdt is made for a categorical feature but used as a numerical feature here.
We trace back to compute.py
, under function missing_impact_1vn
for col in cols:
range = None # pylint: disable=redefined-builtin
if is_numerical(df0[col].dtype):
range = (df0[col].min(axis=0), df0[col].max(axis=0))
hists[col] = [
histogram(df[col], bins=bins, return_edges=True, range=range)
for df in [df0, df1]
]
(hists,) = dd.compute(hists)
hists['PassengerId'][0][1].dtype
is int64
which is wrong. However if we get rid of the dd.compute(hists) ,which is the dask function similar to .collect() in spark, the hists['PassengerId'][0][1].dtype
is Object
which is correct.
I have double-checked this with histogram()
function, in the histogram function we don't have any dd.compute so the centers.dtype(center is a part of the itmdt) is Object.
You can try with the following code to see how the dd.compute() restore an object(changed type from int64) to int64
df = pd.read_csv('/Users/zhixuanchi/dataprep/dataprep/eda/missing/titanic/train.csv')
df['PassengerId'] = df['PassengerId'].astype(object)
df = to_dask(df)
result = []
value_counts = df['PassengerId'].value_counts()
counts = value_counts.to_dask_array()
# print(counts.dtype)
if is_pandas_categorical(value_counts.index.dtype):
centers = value_counts.index.astype("str").to_dask_array()
else:
centers = value_counts.index.to_dask_array()
print(dd.compute(centers)[0].dtype)
print(centers.dtype)
result.append((counts, centers))
from dataprep.
@dylanzxc Good catch! Can we cast the type to the original type before dd.compute
?
from dataprep.
Related Issues (20)
- Add the option to pass a target varaible when creating the EDA report
- Latest dataprep versions seem to give incompatible requirements to pip-compile HOT 1
- Bonaire not possible to validate
- Support for pandas v2 HOT 3
- dataprep won't pip install under Python 3.11 due to wordcloud HOT 1
- How to Support Arrow Dataframe or Duckdb in EDA.
- create_report crashes for almost empty string columns HOT 1
- replacing np.int and similar aliases to support latest np versions HOT 1
- Only supported for TrueType fonts
- Date Cleaning (clean_date) falied to clean dates with 'August' HOT 1
- Seperate out clean so it doesn't rely on pandas/dask
- Export to HTML not working
- It appears Keyerror when I use the Dataprep module HOT 1
- cannot import name 'soft_unicode' from 'markupsafe' HOT 2
- pip installation failed HOT 2
- can I get the data in the plot directly HOT 2
- clean_au_abn returns an empty frame
- Tag new releases with every version bump
- Installation failed on MacOS 14.4.1 with Python 3.12.2
- pydantic v2 support
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from dataprep.