The user enables "debug" mode (maybe as an ENV variable) in order to enter the scope o

had some free time and got this on master now :) <a class="issue-link js-issue-link" d

not sure how useful this is really, general pdb /<code

Thanks <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-u

hey <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url=

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

add debug mode about pandera HOT 12 CLOSED

unionai-oss commented on May 15, 2024

add debug mode

from pandera.

Comments (12)

iyedg commented on May 15, 2024 1

Generally I deal with relatively long chains of transformation on a number of data frames at once.

This is a simple contrived example:

import numpy as np
import pandas as pd

import pandera as pa
from pandera import Column, DataFrameSchema, check_output

df = pd.DataFrame(np.random.normal(size=20).reshape((2, 10)))

schema = DataFrameSchema(
    columns={
        "x": Column(pa.Float, nullable=False),
        "y": Column(pa.Float, nullable=False),
    }
)


def to_long(df):
    return df.transpose().rename(columns=dict(zip(df.columns, ["x", "y"])))


@check_output(schema)
def square_root_values(df):
    return df.apply(lambda row: row ** 0.5)


def sum_by_row(df):
    return df.sum(axis="columns")

Typically I expect some data frames to violate some of my assumptions in intermediate transformations. The workflow of debugging such violations using the following code, does not allow me to explore the transformed data frame up until the validation error occurred.

try:
    df.pipe(to_long).pipe(square_root_values).pipe(sum_by_row)
except pa.errors.SchemaError as e:
    import pdb

    pdb.set_trace()

Ideally, instead of the try .. except block which gives me the initial data frame, I'd have the ability to debug the data frame as it is in the scope of where the error occurred.

the API could look like this:

@check_output(schema, debug=True)
def square_root_values(df):
    return df.apply(lambda row: row ** 0.5)

A possible alternative could be using a callback on validation error which could then be used in a more flexible manner where the variables in the current scope would be passed to said callback function.

@check_output(schema, on_error=function_that_deals_with_error)
def square_root_values(df):
    return df.apply(lambda row: row ** 0.5)

from pandera.

cosmicBboy commented on May 15, 2024 1

Cool, I like the on_error callback idea, as it provides flexibility to the user, with a special convenience "debug" value for quick debugging purposes.

Solution 1: On Error Callback

# callback function should have signature (df, failure_cases)
# df is the input to `schema.validate` and failure cases is
# a tidy dataframe of error cases per column.
def function_that_deals_with_error(df, failure_cases):
    # do stuff
    ...

@check_output(schema, on_error=function_that_deals_with_error)
def square_root_values(df):
    return df.apply(lambda row: row ** 0.5)

# and for convenience
@check_output(schema, on_error="debug")
def square_root_values(df):
    return df.apply(lambda row: row ** 0.5)

# where the "debug" option delegates to a function like:
def debug_schema(df, failure_cases):
    try:
        import ipdb as pdb
    except:
        import pdb
    pdb.set_trace()

Also, I think the schema.{validate, __call__} methods should also have this argument, basically check_input/check_output would pass it into the validate call.

schema.validate(df, on_error=...)

Solution 2: Inject Error Context Variables into `SchemaError` Object

Another approach that I've considered in the past is to inject the offending dataframe and failure cases into the SchemaError object, such that they can be accessed like so:

try:
    schema.validate(df)
except SchemaError as e:
    df, failure_cases = e.get_error_context()

Seems like these two solutions aren't mutually exclusive, (1) gives finer grained control over where the runtime looks for errors, (2) would catch the first SchemeError in an arbitrary code block.

from pandera.

cosmicBboy commented on May 15, 2024 1

had some free time and got this on master now :) #210. should be available on next release

from pandera.

cosmicBboy commented on May 15, 2024

not sure how useful this is really, general pdb/ipdb workflow may be enough

from pandera.

cosmicBboy commented on May 15, 2024

closing this issue, feature doesn't seem that useful.

from pandera.

iyedg commented on May 15, 2024

Firstly thank you for this wonderful work !

I think this feature could very well be useful especially in long method chains. The usual pdb workflow would not capture the changes made the dataframes up until the point where the error occurred, instead one would get the input dataframe. I think a sensible approach to this would be adding a flag to check_input and check_output where this issue arises clearly.

from pandera.

cosmicBboy commented on May 15, 2024

Thanks @iyedg! Glad you're finding this package helpful.

I'll re-open this issue and let's see if we can't get a solution proposal going. Flags in check_input and check_output seems reasonable, but just to get a better understanding of your use case, do you have an example of how you're integrating pandera checks into your workflow?

from pandera.

cosmicBboy commented on May 15, 2024

hey @iyedg solution 2 above should now be implemented as part of lazy validation: https://pandera.readthedocs.io/en/v0.4.0/lazy_validation.html

If validate(..., lazy=False), which is the default, you can catch the SchemaError exception and access the data attribute to get the dataframe/series that was passed into the schema.validate call.

Let me know if this doesn't serves your use-case and you still need the on-error callback solution.

from pandera.

iyedg commented on May 15, 2024

@cosmicBboy Thank you so much for your time and effort ! This is exactly what I needed.
But I couldn't help but notice that the lazy keyword is not available for use in the check_input and check_output and I was wondering if that is intentional. As far as I understand the idea behind those decorators so far is that they mirror the schema.validate signature, is that wrong ??

Again, thank you for your time and effort.

from pandera.

cosmicBboy commented on May 15, 2024

ah, this was not intentional and I over-looked those functions... I can get to this in the next few weeks, but if you have the capacity/willingness you can make a PR to make this change :)

from pandera.

cosmicBboy commented on May 15, 2024

btw even without lazy validation you can still catch SchemaError, which is raised when lazy=False (SchemaErrors is raised on lazy validation) and access err.data and err.failure_cases.

from pandera.

cosmicBboy commented on May 15, 2024

closing this issue for now, will may revisit the on_error callback if the use case comes up again

from pandera.

add debug mode about pandera HOT 12 CLOSED

Comments (12)

Solution 1: On Error Callback

Solution 2: Inject Error Context Variables into `SchemaError` Object

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Comments (12)

Solution 1: On Error Callback

Solution 2: Inject Error Context Variables into SchemaError Object

Related Issues (20)

Recommend Projects

Recommend Topics

Recommend Org

Solution 2: Inject Error Context Variables into `SchemaError` Object