Describe the bug If a dataframe contains null value

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

After hacking around a little bit and reading through this <a href="https://github.com

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Dataframe null check doesn't provide df name if error occurs about pandera HOT 6 CLOSED

unionai-oss commented on May 14, 2024

Dataframe null check doesn't provide df name if error occurs

from pandera.

Comments (6)

cosmicBboy commented on May 14, 2024 1

cool, this I think can be handled easily at the decorator level. Will do this weekend

from pandera.

cosmicBboy commented on May 14, 2024 1

@mastersplinter this should do it: https://github.com/cosmicBboy/pandera/pull/29

from pandera.

cosmicBboy commented on May 14, 2024

The error message leaves a lot to be desired, among them the fact that it's handled multiple times via python's error chaining feature.

~~I'd support including the dataframe.name in the error message if it's not None~~, though it would be nice if the error message were a bit cleaner it would be clear where it's coming from (the offending dataframe is in the last traceback):

<ipython-input-29-dc85ea979a2b> in <module>
----> 1 schema.validate(df) <<< this dataframe

At least the way I use dataframes, I often don't name them... is this common practice? In any case, I think two things would help:

figuring out how to raise only a single Error (I think I have an idea on how to do this)
~~including the dataframe.name in the error message if it's set.~~

Edit

I assumed that dataframes had an optional name attribute, but it turns out they don't. I'd prefer not to rely on a user-provided custom attribute... otherwise I'm unaware of getting the variable name of the dataframe object.

I was, however, able to refactor the error-handling logic to produce a cleaner error message:

In [1]: import pandas as pd
   ...: import numpy as np
   ...: from pandera import Column, DataFrameSchema, PandasDtype, SeriesSchema, Check, Int
   ...:
   ...: schema = DataFrameSchema(
   ...:     {
   ...:         "a": Column(PandasDtype.Int, Check(lambda x: x > 0))
   ...:     })
   ...:
   ...: df = pd.DataFrame({
   ...:     "a": [1, 2, np.nan]
   ...: })
   ...:
   ...: schema.validate(df)
---------------------------------------------------------------------------
SchemaError                               Traceback (most recent call last)
<ipython-input-1-f4098e24342e> in <module>
     12 })
     13
---> 14 schema.validate(df)

~/git/pandera/pandera/pandera.py in validate(self, dataframe)
    180             return self.schema.validate(dataframe)
    181         except SchemaError as e:
--> 182             raise SchemaError(str(e)) from None
    183
    184

SchemaError: non-nullable series 'a' contains null values: {2: nan}

From this it's obvious that line 14 contains the offending dataframe, which I think gets to the spirit of this issue. The problem with this is that it obfuscates the traceback to where the error originally occurred in the source.

from pandera.

cosmicBboy commented on May 14, 2024

After hacking around a little bit and reading through this pandas issue it seems like including the dataframe name to the error message is not the way to go, and explicitly providing some sort of identifier to the validate call seems like quite a burden on the user, especially since the last traceback message does point to the invalidated dataframe:

SchemaError                               Traceback (most recent call last)
<ipython-input-29-dc85ea979a2b> in <module>
----> 1 schema.validate(df)

/panderadev/pandera/pandera.py in validate(self, dataframe)
    176         if not isinstance(dataframe, pd.DataFrame):
    177             raise TypeError("expected dataframe, got %s" % type(dataframe))
--> 178         return self.schema.validate(dataframe)
    179 
    180 

/mycondaenv/lib/python3.6/site-packages/schema.py in validate(self, data)
    393                 return s.validate(data)
    394             except SchemaError as x:
--> 395                 raise SchemaError([None] + x.autos, [e] + x.errors)
    396             except BaseException as x:
    397                 message = "%r.validate(%r) raised %r" % (s, data, x)

SchemaError: expected series 'a' to have type int64, got float64 and non-nullable series contains null values: {2: nan}

I do agree it's kinda sucky 'cause it's like searching for a needle in a small haystack :/

It also seems like schema takes a lot of trouble catching SchemaErrors (as it should), and to me the best solution would be to produce a clear error message, which would mean using schema only at the Check level (or decoupling pandera from schema altogether)

from pandera.

cosmicBboy commented on May 14, 2024

@mastersplinter it was easier to remove the schema dependency that I thought :)

after https://github.com/cosmicBboy/pandera/pull/27, the error message (executing the code in the issue description) is much easier to read:

---------------------------------------------------------------------------
SchemaError                               Traceback (most recent call last)
<ipython-input-1-f4098e24342e> in <module>
     12 })
     13
---> 14 schema.validate(df)

~/git/pandera/pandera/pandera.py in validate(self, dataframe)
    172         if self.index is not None:
    173             schema_elements += [self.index]
--> 174         assert all(s(dataframe) for s in schema_elements)
    175         if self.transformer is not None:
    176             dataframe = self.transformer(dataframe)

~/git/pandera/pandera/pandera.py in <genexpr>(.0)
    172         if self.index is not None:
    173             schema_elements += [self.index]
--> 174         assert all(s(dataframe) for s in schema_elements)
    175         if self.transformer is not None:
    176             dataframe = self.transformer(dataframe)

~/git/pandera/pandera/pandera.py in __call__(self, df)
    332             raise RuntimeError(
    333                 "need to `set_name` of column before calling it.")
--> 334         return super(Column, self).__call__(df[self._name])
    335
    336     def __repr__(self):

~/git/pandera/pandera/pandera.py in __call__(self, series)
    222                 raise SchemaError(
    223                     "non-nullable series contains null values: %s" %
--> 224                     series[nulls].head(N_FAILURE_CASES).to_dict()
    225                 )
    226

SchemaError: non-nullable series contains null values: {2: nan}

from pandera.

mastersplinter commented on May 14, 2024

@cosmicBboy this is great. Your changes enable faster debugging when using schema.validate.

An issue still remains when using the check_input and check_output decorators on a function.

Take this example:

import pandas as pd
import numpy as np
from pandera import Column, DataFrameSchema, PandasDtype, SeriesSchema, Check, Int

# Define input and output schemas
schema_in = DataFrameSchema(
    {
        "a": Column(PandasDtype.Int, Check(lambda x: x > 0))
    })
schema_out = DataFrameSchema(
    {
        "a": Column(PandasDtype.Float, Check(lambda x: x > 0))
    })

# This dataframe will fail schema_input, but pass schema_out
df = pd.DataFrame({
    "a": [1, 2, np.nan]
})

@check_input(schema_in, 0)
@check_output(schema_out)
def some_function(df):
    return df

function_return = some_function(df)

With your latest changes from #27 and mine from #25, the error message doesn't help the user isolate which schema check/decorator raised the issue:

---------------------------------------------------------------------------
SchemaError                               Traceback (most recent call last)
<ipython-input-24-de765c22ff69> in <module>
----> 1 function_return = some_function(df)

/panderadev/pandera/pandera.py in _wrapper(fn, instance, args, kwargs)
    376         args = list(args)
    377         if isinstance(obj_getter, int):
--> 378             args[obj_getter] = schema.validate(args[obj_getter])
    379         elif isinstance(obj_getter, str):
    380             if obj_getter in kwargs:

/panderadev/pandera/pandera.py in validate(self, dataframe)
    171         if self.index is not None:
    172             schema_elements += [self.index]
--> 173         assert all(s(dataframe) for s in schema_elements)
    174         if self.transformer is not None:
    175             dataframe = self.transformer(dataframe)

/panderadev/pandera/pandera.py in <genexpr>(.0)
    171         if self.index is not None:
    172             schema_elements += [self.index]
--> 173         assert all(s(dataframe) for s in schema_elements)
    174         if self.transformer is not None:
    175             dataframe = self.transformer(dataframe)

/panderadev/pandera/pandera.py in __call__(self, df)
    339             raise RuntimeError(
    340                 "need to `set_name` of column before calling it.")
--> 341         return super(Column, self).__call__(df[self._name])
    342 
    343     def __repr__(self):

/panderadev/pandera/pandera.py in __call__(self, series)
    225                         "non-nullable series contains null values: %s" %
    226                         (series.name, self._pandas_dtype.value, series.dtype,
--> 227                          series[nulls].head(N_FAILURE_CASES).to_dict()))
    228                 else:
    229                     raise SchemaError(

SchemaError: expected series 'a' to have type int64, got float64 and non-nullable series contains null values: {2: nan}

In the traceback it would assist to have some way of finding the decorator which triggered the schema failure.

from pandera.

Dataframe null check doesn't provide df name if error occurs about pandera HOT 6 CLOSED

Comments (6)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent