Giter VIP home page Giter VIP logo

Comments (4)

Waterpine avatar Waterpine commented on June 29, 2024

@jinglinpeng I am sorry for my delayed response. I am also confronted with the bug yesterday evening. And I am going to fix this bug next week. Thanks!

from dataprep.

dylanzxc avatar dylanzxc commented on June 29, 2024

I have located the two causes to this bug:

In /eda/missing/compute.py line 130, under missing_impact_1vn() function.

for col in cols:
    range = None  # pylint: disable=redefined-builtin
    if is_numerical(df0[col].dtype):
        range = (df0[col].min(axis=0), df0[col].max(axis=0))

    hists[col] = [
        histogram(df[col], bins=bins, return_edges=True, range=range)
        for df in [df0, df1]
    ]

The is_numerical(df0[col].dtype) will return False when we cast one column type to "object"(its type was int64) and this will give us the wrong range=None.

In /eda/missing/compute.py line 34, under histogram() function.

if is_numerical(srs.dtype):
    if range is not None:
        minimum, maximum = range
    else:
        minimum, maximum = srs.min(axis=0), srs.max(axis=0)
    minimum, maximum = dask.compute(minimum, maximum)

    assert (
        bins is not None
    ), "num_bins cannot be None if calculating numerical histograms"

    counts, edges = da.histogram(
        srs.to_dask_array(), bins, range=[minimum, maximum]
    )
    centers = (edges[:-1] + edges[1:]) / 2

    if not return_edges:
        return counts, centers
    return counts, centers, edges

elif is_categorical(srs.dtype):
    value_counts = srs.value_counts()
    counts = value_counts.to_dask_array()

The if is_numerical(srs.dtype) here will also return False so that the column will be wrongly treated as categorical data. If we fix those two if statements, the will no longer be key error in render.py.

However there is an interesting thing with is_numerical() and to_dask().
image

image

Without to_dask(), is_numerical() can detect dtype.object contains numbers or not. But if we use dask.from_pandas() to convert our pd.df to dd.df, is_numerical() fails.

In addition, the print result of dtypes with dd.df and pd.df for object type are the same:
image

from dataprep.

dylanzxc avatar dylanzxc commented on June 29, 2024

When we do df['PassengerId'] = df['PassengerId'].astype("object") , we change its type from int64 to object. In our code this casted feature PassengerId will be treated as a categorical data and generate the corresponding itmdt in compute.py. However when we pass this itmdt into render.py:

if is_categorical(df["x"].dtype):
      radius = 0.99
      x_range = FactorRange(*df["x"].unique())
else:
      radius = df["x"][1] - df["x"][0]
      x_range = Range1d(df["x"].min() - radius, df["x"].max() + radius)

if is_categorical(df["x"].dtype) actually return False. If we print out df["x"].dtype we can see:

PassengerId int64
Survived float64
Pclass float64
Name object
Sex object
SibSp float64
Parch float64
Ticket object
Fare float64
Cabin object
Embarked object

So, the PassengerId has been restored to int64 somehow. This caused the categorical feature PassengerId is treated as a numerical feature again, radius = df["x"][1] - df["x"][0] will generate a key error in this case as the itmdt is made for a categorical feature but used as a numerical feature here.

We trace back to compute.py, under function missing_impact_1vn

 for col in cols:
        range = None  # pylint: disable=redefined-builtin
        if is_numerical(df0[col].dtype):
            range = (df0[col].min(axis=0), df0[col].max(axis=0))

        hists[col] = [
            histogram(df[col], bins=bins, return_edges=True, range=range)
            for df in [df0, df1]
        ]
    (hists,) = dd.compute(hists)


hists['PassengerId'][0][1].dtype is int64 which is wrong. However if we get rid of the dd.compute(hists) ,which is the dask function similar to .collect() in spark, the hists['PassengerId'][0][1].dtype is Object which is correct.
image
image

I have double-checked this with histogram() function, in the histogram function we don't have any dd.compute so the centers.dtype(center is a part of the itmdt) is Object.
image

You can try with the following code to see how the dd.compute() restore an object(changed type from int64) to int64

df = pd.read_csv('/Users/zhixuanchi/dataprep/dataprep/eda/missing/titanic/train.csv')
df['PassengerId'] = df['PassengerId'].astype(object)
df = to_dask(df)

result = []
value_counts = df['PassengerId'].value_counts()
counts = value_counts.to_dask_array()
# print(counts.dtype)

if is_pandas_categorical(value_counts.index.dtype):
        centers = value_counts.index.astype("str").to_dask_array()

else:
    centers = value_counts.index.to_dask_array()
    print(dd.compute(centers)[0].dtype)
    print(centers.dtype)
result.append((counts, centers))

The result is like this:
image

from dataprep.

jinglinpeng avatar jinglinpeng commented on June 29, 2024

@dylanzxc Good catch! Can we cast the type to the original type before dd.compute?

from dataprep.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.