Giter VIP home page Giter VIP logo

Comments (12)

iyedg avatar iyedg commented on May 15, 2024 1

Generally I deal with relatively long chains of transformation on a number of data frames at once.

This is a simple contrived example:

import numpy as np
import pandas as pd

import pandera as pa
from pandera import Column, DataFrameSchema, check_output

df = pd.DataFrame(np.random.normal(size=20).reshape((2, 10)))

schema = DataFrameSchema(
    columns={
        "x": Column(pa.Float, nullable=False),
        "y": Column(pa.Float, nullable=False),
    }
)


def to_long(df):
    return df.transpose().rename(columns=dict(zip(df.columns, ["x", "y"])))


@check_output(schema)
def square_root_values(df):
    return df.apply(lambda row: row ** 0.5)


def sum_by_row(df):
    return df.sum(axis="columns")

Typically I expect some data frames to violate some of my assumptions in intermediate transformations. The workflow of debugging such violations using the following code, does not allow me to explore the transformed data frame up until the validation error occurred.

try:
    df.pipe(to_long).pipe(square_root_values).pipe(sum_by_row)
except pa.errors.SchemaError as e:
    import pdb

    pdb.set_trace()

Ideally, instead of the try .. except block which gives me the initial data frame, I'd have the ability to debug the data frame as it is in the scope of where the error occurred.

the API could look like this:

@check_output(schema, debug=True)
def square_root_values(df):
    return df.apply(lambda row: row ** 0.5)

A possible alternative could be using a callback on validation error which could then be used in a more flexible manner where the variables in the current scope would be passed to said callback function.

@check_output(schema, on_error=function_that_deals_with_error)
def square_root_values(df):
    return df.apply(lambda row: row ** 0.5)

from pandera.

cosmicBboy avatar cosmicBboy commented on May 15, 2024 1

Cool, I like the on_error callback idea, as it provides flexibility to the user, with a special convenience "debug" value for quick debugging purposes.

Solution 1: On Error Callback

# callback function should have signature (df, failure_cases)
# df is the input to `schema.validate` and failure cases is
# a tidy dataframe of error cases per column.
def function_that_deals_with_error(df, failure_cases):
    # do stuff
    ...

@check_output(schema, on_error=function_that_deals_with_error)
def square_root_values(df):
    return df.apply(lambda row: row ** 0.5)

# and for convenience
@check_output(schema, on_error="debug")
def square_root_values(df):
    return df.apply(lambda row: row ** 0.5)

# where the "debug" option delegates to a function like:
def debug_schema(df, failure_cases):
    try:
        import ipdb as pdb
    except:
        import pdb
    pdb.set_trace()

Also, I think the schema.{validate, __call__} methods should also have this argument, basically check_input/check_output would pass it into the validate call.

schema.validate(df, on_error=...)

Solution 2: Inject Error Context Variables into SchemaError Object

Another approach that I've considered in the past is to inject the offending dataframe and failure cases into the SchemaError object, such that they can be accessed like so:

try:
    schema.validate(df)
except SchemaError as e:
    df, failure_cases = e.get_error_context()

Seems like these two solutions aren't mutually exclusive, (1) gives finer grained control over where the runtime looks for errors, (2) would catch the first SchemeError in an arbitrary code block.

from pandera.

cosmicBboy avatar cosmicBboy commented on May 15, 2024 1

had some free time and got this on master now :) #210. should be available on next release

from pandera.

cosmicBboy avatar cosmicBboy commented on May 15, 2024

not sure how useful this is really, general pdb/ipdb workflow may be enough

from pandera.

cosmicBboy avatar cosmicBboy commented on May 15, 2024

closing this issue, feature doesn't seem that useful.

from pandera.

iyedg avatar iyedg commented on May 15, 2024

Firstly thank you for this wonderful work !

I think this feature could very well be useful especially in long method chains. The usual pdb workflow would not capture the changes made the dataframes up until the point where the error occurred, instead one would get the input dataframe. I think a sensible approach to this would be adding a flag to check_input and check_output where this issue arises clearly.

from pandera.

cosmicBboy avatar cosmicBboy commented on May 15, 2024

Thanks @iyedg! Glad you're finding this package helpful.

I'll re-open this issue and let's see if we can't get a solution proposal going. Flags in check_input and check_output seems reasonable, but just to get a better understanding of your use case, do you have an example of how you're integrating pandera checks into your workflow?

from pandera.

cosmicBboy avatar cosmicBboy commented on May 15, 2024

hey @iyedg solution 2 above should now be implemented as part of lazy validation: https://pandera.readthedocs.io/en/v0.4.0/lazy_validation.html

If validate(..., lazy=False), which is the default, you can catch the SchemaError exception and access the data attribute to get the dataframe/series that was passed into the schema.validate call.

Let me know if this doesn't serves your use-case and you still need the on-error callback solution.

from pandera.

iyedg avatar iyedg commented on May 15, 2024

@cosmicBboy Thank you so much for your time and effort ! This is exactly what I needed.
But I couldn't help but notice that the lazy keyword is not available for use in the check_input and check_output and I was wondering if that is intentional. As far as I understand the idea behind those decorators so far is that they mirror the schema.validate signature, is that wrong ??

Again, thank you for your time and effort.

from pandera.

cosmicBboy avatar cosmicBboy commented on May 15, 2024

ah, this was not intentional and I over-looked those functions... I can get to this in the next few weeks, but if you have the capacity/willingness you can make a PR to make this change :)

from pandera.

cosmicBboy avatar cosmicBboy commented on May 15, 2024

btw even without lazy validation you can still catch SchemaError, which is raised when lazy=False (SchemaErrors is raised on lazy validation) and access err.data and err.failure_cases.

from pandera.

cosmicBboy avatar cosmicBboy commented on May 15, 2024

closing this issue for now, will may revisit the on_error callback if the use case comes up again

from pandera.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.