Giter VIP home page Giter VIP logo

Comments (6)

cosmicBboy avatar cosmicBboy commented on May 14, 2024 1

cool, this I think can be handled easily at the decorator level. Will do this weekend

from pandera.

cosmicBboy avatar cosmicBboy commented on May 14, 2024 1

@mastersplinter this should do it: https://github.com/cosmicBboy/pandera/pull/29

from pandera.

cosmicBboy avatar cosmicBboy commented on May 14, 2024

The error message leaves a lot to be desired, among them the fact that it's handled multiple times via python's error chaining feature.

I'd support including the dataframe.name in the error message if it's not None, though it would be nice if the error message were a bit cleaner it would be clear where it's coming from (the offending dataframe is in the last traceback):

<ipython-input-29-dc85ea979a2b> in <module>
----> 1 schema.validate(df) <<< this dataframe

At least the way I use dataframes, I often don't name them... is this common practice? In any case, I think two things would help:

  • figuring out how to raise only a single Error (I think I have an idea on how to do this)
  • including the dataframe.name in the error message if it's set.

Edit

I assumed that dataframes had an optional name attribute, but it turns out they don't. I'd prefer not to rely on a user-provided custom attribute... otherwise I'm unaware of getting the variable name of the dataframe object.

I was, however, able to refactor the error-handling logic to produce a cleaner error message:

In [1]: import pandas as pd
   ...: import numpy as np
   ...: from pandera import Column, DataFrameSchema, PandasDtype, SeriesSchema, Check, Int
   ...:
   ...: schema = DataFrameSchema(
   ...:     {
   ...:         "a": Column(PandasDtype.Int, Check(lambda x: x > 0))
   ...:     })
   ...:
   ...: df = pd.DataFrame({
   ...:     "a": [1, 2, np.nan]
   ...: })
   ...:
   ...: schema.validate(df)
---------------------------------------------------------------------------
SchemaError                               Traceback (most recent call last)
<ipython-input-1-f4098e24342e> in <module>
     12 })
     13
---> 14 schema.validate(df)

~/git/pandera/pandera/pandera.py in validate(self, dataframe)
    180             return self.schema.validate(dataframe)
    181         except SchemaError as e:
--> 182             raise SchemaError(str(e)) from None
    183
    184

SchemaError: non-nullable series 'a' contains null values: {2: nan}

From this it's obvious that line 14 contains the offending dataframe, which I think gets to the spirit of this issue. The problem with this is that it obfuscates the traceback to where the error originally occurred in the source.

from pandera.

cosmicBboy avatar cosmicBboy commented on May 14, 2024

After hacking around a little bit and reading through this pandas issue it seems like including the dataframe name to the error message is not the way to go, and explicitly providing some sort of identifier to the validate call seems like quite a burden on the user, especially since the last traceback message does point to the invalidated dataframe:

SchemaError                               Traceback (most recent call last)
<ipython-input-29-dc85ea979a2b> in <module>
----> 1 schema.validate(df)

/panderadev/pandera/pandera.py in validate(self, dataframe)
    176         if not isinstance(dataframe, pd.DataFrame):
    177             raise TypeError("expected dataframe, got %s" % type(dataframe))
--> 178         return self.schema.validate(dataframe)
    179 
    180 

/mycondaenv/lib/python3.6/site-packages/schema.py in validate(self, data)
    393                 return s.validate(data)
    394             except SchemaError as x:
--> 395                 raise SchemaError([None] + x.autos, [e] + x.errors)
    396             except BaseException as x:
    397                 message = "%r.validate(%r) raised %r" % (s, data, x)

SchemaError: expected series 'a' to have type int64, got float64 and non-nullable series contains null values: {2: nan}

I do agree it's kinda sucky 'cause it's like searching for a needle in a small haystack :/

It also seems like schema takes a lot of trouble catching SchemaErrors (as it should), and to me the best solution would be to produce a clear error message, which would mean using schema only at the Check level (or decoupling pandera from schema altogether)

from pandera.

cosmicBboy avatar cosmicBboy commented on May 14, 2024

@mastersplinter it was easier to remove the schema dependency that I thought :)

after https://github.com/cosmicBboy/pandera/pull/27, the error message (executing the code in the issue description) is much easier to read:

---------------------------------------------------------------------------
SchemaError                               Traceback (most recent call last)
<ipython-input-1-f4098e24342e> in <module>
     12 })
     13
---> 14 schema.validate(df)

~/git/pandera/pandera/pandera.py in validate(self, dataframe)
    172         if self.index is not None:
    173             schema_elements += [self.index]
--> 174         assert all(s(dataframe) for s in schema_elements)
    175         if self.transformer is not None:
    176             dataframe = self.transformer(dataframe)

~/git/pandera/pandera/pandera.py in <genexpr>(.0)
    172         if self.index is not None:
    173             schema_elements += [self.index]
--> 174         assert all(s(dataframe) for s in schema_elements)
    175         if self.transformer is not None:
    176             dataframe = self.transformer(dataframe)

~/git/pandera/pandera/pandera.py in __call__(self, df)
    332             raise RuntimeError(
    333                 "need to `set_name` of column before calling it.")
--> 334         return super(Column, self).__call__(df[self._name])
    335
    336     def __repr__(self):

~/git/pandera/pandera/pandera.py in __call__(self, series)
    222                 raise SchemaError(
    223                     "non-nullable series contains null values: %s" %
--> 224                     series[nulls].head(N_FAILURE_CASES).to_dict()
    225                 )
    226

SchemaError: non-nullable series contains null values: {2: nan}

from pandera.

mastersplinter avatar mastersplinter commented on May 14, 2024

@cosmicBboy this is great. Your changes enable faster debugging when using schema.validate.

An issue still remains when using the check_input and check_output decorators on a function.

Take this example:

import pandas as pd
import numpy as np
from pandera import Column, DataFrameSchema, PandasDtype, SeriesSchema, Check, Int

# Define input and output schemas
schema_in = DataFrameSchema(
    {
        "a": Column(PandasDtype.Int, Check(lambda x: x > 0))
    })
schema_out = DataFrameSchema(
    {
        "a": Column(PandasDtype.Float, Check(lambda x: x > 0))
    })

# This dataframe will fail schema_input, but pass schema_out
df = pd.DataFrame({
    "a": [1, 2, np.nan]
})

@check_input(schema_in, 0)
@check_output(schema_out)
def some_function(df):
    return df

function_return = some_function(df)

With your latest changes from #27 and mine from #25, the error message doesn't help the user isolate which schema check/decorator raised the issue:

---------------------------------------------------------------------------
SchemaError                               Traceback (most recent call last)
<ipython-input-24-de765c22ff69> in <module>
----> 1 function_return = some_function(df)

/panderadev/pandera/pandera.py in _wrapper(fn, instance, args, kwargs)
    376         args = list(args)
    377         if isinstance(obj_getter, int):
--> 378             args[obj_getter] = schema.validate(args[obj_getter])
    379         elif isinstance(obj_getter, str):
    380             if obj_getter in kwargs:

/panderadev/pandera/pandera.py in validate(self, dataframe)
    171         if self.index is not None:
    172             schema_elements += [self.index]
--> 173         assert all(s(dataframe) for s in schema_elements)
    174         if self.transformer is not None:
    175             dataframe = self.transformer(dataframe)

/panderadev/pandera/pandera.py in <genexpr>(.0)
    171         if self.index is not None:
    172             schema_elements += [self.index]
--> 173         assert all(s(dataframe) for s in schema_elements)
    174         if self.transformer is not None:
    175             dataframe = self.transformer(dataframe)

/panderadev/pandera/pandera.py in __call__(self, df)
    339             raise RuntimeError(
    340                 "need to `set_name` of column before calling it.")
--> 341         return super(Column, self).__call__(df[self._name])
    342 
    343     def __repr__(self):

/panderadev/pandera/pandera.py in __call__(self, series)
    225                         "non-nullable series contains null values: %s" %
    226                         (series.name, self._pandas_dtype.value, series.dtype,
--> 227                          series[nulls].head(N_FAILURE_CASES).to_dict()))
    228                 else:
    229                     raise SchemaError(

SchemaError: expected series 'a' to have type int64, got float64 and non-nullable series contains null values: {2: nan}

In the traceback it would assist to have some way of finding the decorator which triggered the schema failure.

from pandera.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.