I try the training data of Titanic, which could be download in <a href="https://www.ka

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

I have located the two causes to this bug: <p dir=

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

eda.plot_missing: error when change column type about dataprep HOT 4 CLOSED

jinglinpeng commented on June 29, 2024

eda.plot_missing: error when change column type

from dataprep.

Comments (4)

Waterpine commented on June 29, 2024

@jinglinpeng I am sorry for my delayed response. I am also confronted with the bug yesterday evening. And I am going to fix this bug next week. Thanks!

from dataprep.

dylanzxc commented on June 29, 2024

I have located the two causes to this bug:

In /eda/missing/compute.py line 130, under missing_impact_1vn() function.

for col in cols:
    range = None  # pylint: disable=redefined-builtin
    if is_numerical(df0[col].dtype):
        range = (df0[col].min(axis=0), df0[col].max(axis=0))

    hists[col] = [
        histogram(df[col], bins=bins, return_edges=True, range=range)
        for df in [df0, df1]
    ]

The is_numerical(df0[col].dtype) will return False when we cast one column type to "object"(its type was int64) and this will give us the wrong range=None.

In /eda/missing/compute.py line 34, under histogram() function.

if is_numerical(srs.dtype):
    if range is not None:
        minimum, maximum = range
    else:
        minimum, maximum = srs.min(axis=0), srs.max(axis=0)
    minimum, maximum = dask.compute(minimum, maximum)

    assert (
        bins is not None
    ), "num_bins cannot be None if calculating numerical histograms"

    counts, edges = da.histogram(
        srs.to_dask_array(), bins, range=[minimum, maximum]
    )
    centers = (edges[:-1] + edges[1:]) / 2

    if not return_edges:
        return counts, centers
    return counts, centers, edges

elif is_categorical(srs.dtype):
    value_counts = srs.value_counts()
    counts = value_counts.to_dask_array()

The if is_numerical(srs.dtype) here will also return False so that the column will be wrongly treated as categorical data. If we fix those two if statements, the will no longer be key error in render.py.

However there is an interesting thing with is_numerical() and to_dask().

Without to_dask(), is_numerical() can detect dtype.object contains numbers or not. But if we use dask.from_pandas() to convert our pd.df to dd.df, is_numerical() fails.

In addition, the print result of dtypes with dd.df and pd.df for object type are the same:

from dataprep.

dylanzxc commented on June 29, 2024

When we do df['PassengerId'] = df['PassengerId'].astype("object") , we change its type from int64 to object. In our code this casted feature PassengerId will be treated as a categorical data and generate the corresponding itmdt in compute.py. However when we pass this itmdt into render.py:

if is_categorical(df["x"].dtype):
      radius = 0.99
      x_range = FactorRange(*df["x"].unique())
else:
      radius = df["x"][1] - df["x"][0]
      x_range = Range1d(df["x"].min() - radius, df["x"].max() + radius)

if is_categorical(df["x"].dtype) actually return False. If we print out df["x"].dtype we can see:

PassengerId int64
Survived float64
Pclass float64
Name object
Sex object
SibSp float64
Parch float64
Ticket object
Fare float64
Cabin object
Embarked object

So, the PassengerId has been restored to int64 somehow. This caused the categorical feature PassengerId is treated as a numerical feature again, radius = df["x"][1] - df["x"][0] will generate a key error in this case as the itmdt is made for a categorical feature but used as a numerical feature here.

We trace back to compute.py, under function missing_impact_1vn

 for col in cols:
        range = None  # pylint: disable=redefined-builtin
        if is_numerical(df0[col].dtype):
            range = (df0[col].min(axis=0), df0[col].max(axis=0))

        hists[col] = [
            histogram(df[col], bins=bins, return_edges=True, range=range)
            for df in [df0, df1]
        ]
    (hists,) = dd.compute(hists)

hists['PassengerId'][0][1].dtype is int64 which is wrong. However if we get rid of the dd.compute(hists) ,which is the dask function similar to .collect() in spark, the hists['PassengerId'][0][1].dtype is Object which is correct.

I have double-checked this with histogram() function, in the histogram function we don't have any dd.compute so the centers.dtype(center is a part of the itmdt) is Object.

You can try with the following code to see how the dd.compute() restore an object(changed type from int64) to int64

df = pd.read_csv('/Users/zhixuanchi/dataprep/dataprep/eda/missing/titanic/train.csv')
df['PassengerId'] = df['PassengerId'].astype(object)
df = to_dask(df)

result = []
value_counts = df['PassengerId'].value_counts()
counts = value_counts.to_dask_array()
# print(counts.dtype)

if is_pandas_categorical(value_counts.index.dtype):
        centers = value_counts.index.astype("str").to_dask_array()

else:
    centers = value_counts.index.to_dask_array()
    print(dd.compute(centers)[0].dtype)
    print(centers.dtype)
result.append((counts, centers))

The result is like this:

from dataprep.

jinglinpeng commented on June 29, 2024

@dylanzxc Good catch! Can we cast the type to the original type before dd.compute?

from dataprep.

eda.plot_missing: error when change column type about dataprep HOT 4 CLOSED

Comments (4)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent