Comments (12)
Generally I deal with relatively long chains of transformation on a number of data frames at once.
This is a simple contrived example:
import numpy as np
import pandas as pd
import pandera as pa
from pandera import Column, DataFrameSchema, check_output
df = pd.DataFrame(np.random.normal(size=20).reshape((2, 10)))
schema = DataFrameSchema(
columns={
"x": Column(pa.Float, nullable=False),
"y": Column(pa.Float, nullable=False),
}
)
def to_long(df):
return df.transpose().rename(columns=dict(zip(df.columns, ["x", "y"])))
@check_output(schema)
def square_root_values(df):
return df.apply(lambda row: row ** 0.5)
def sum_by_row(df):
return df.sum(axis="columns")
Typically I expect some data frames to violate some of my assumptions in intermediate transformations. The workflow of debugging such violations using the following code, does not allow me to explore the transformed data frame up until the validation error occurred.
try:
df.pipe(to_long).pipe(square_root_values).pipe(sum_by_row)
except pa.errors.SchemaError as e:
import pdb
pdb.set_trace()
Ideally, instead of the try .. except
block which gives me the initial data frame, I'd have the ability to debug the data frame as it is in the scope of where the error occurred.
the API could look like this:
@check_output(schema, debug=True)
def square_root_values(df):
return df.apply(lambda row: row ** 0.5)
A possible alternative could be using a callback on validation error which could then be used in a more flexible manner where the variables in the current scope would be passed to said callback function.
@check_output(schema, on_error=function_that_deals_with_error)
def square_root_values(df):
return df.apply(lambda row: row ** 0.5)
from pandera.
Cool, I like the on_error
callback idea, as it provides flexibility to the user, with a special convenience "debug" value for quick debugging purposes.
Solution 1: On Error Callback
# callback function should have signature (df, failure_cases)
# df is the input to `schema.validate` and failure cases is
# a tidy dataframe of error cases per column.
def function_that_deals_with_error(df, failure_cases):
# do stuff
...
@check_output(schema, on_error=function_that_deals_with_error)
def square_root_values(df):
return df.apply(lambda row: row ** 0.5)
# and for convenience
@check_output(schema, on_error="debug")
def square_root_values(df):
return df.apply(lambda row: row ** 0.5)
# where the "debug" option delegates to a function like:
def debug_schema(df, failure_cases):
try:
import ipdb as pdb
except:
import pdb
pdb.set_trace()
Also, I think the schema.{validate, __call__}
methods should also have this argument, basically check_input/check_output
would pass it into the validate call.
schema.validate(df, on_error=...)
Solution 2: Inject Error Context Variables into SchemaError
Object
Another approach that I've considered in the past is to inject the offending dataframe and failure cases into the SchemaError
object, such that they can be accessed like so:
try:
schema.validate(df)
except SchemaError as e:
df, failure_cases = e.get_error_context()
Seems like these two solutions aren't mutually exclusive, (1) gives finer grained control over where the runtime looks for errors, (2) would catch the first SchemeError in an arbitrary code block.
from pandera.
had some free time and got this on master now :) #210. should be available on next release
from pandera.
not sure how useful this is really, general pdb
/ipdb
workflow may be enough
from pandera.
closing this issue, feature doesn't seem that useful.
from pandera.
Firstly thank you for this wonderful work !
I think this feature could very well be useful especially in long method chains. The usual pdb
workflow would not capture the changes made the dataframes up until the point where the error occurred, instead one would get the input dataframe. I think a sensible approach to this would be adding a flag to check_input
and check_output
where this issue arises clearly.
from pandera.
Thanks @iyedg! Glad you're finding this package helpful.
I'll re-open this issue and let's see if we can't get a solution proposal going. Flags in check_input
and check_output
seems reasonable, but just to get a better understanding of your use case, do you have an example of how you're integrating pandera checks into your workflow?
from pandera.
hey @iyedg solution 2 above should now be implemented as part of lazy validation: https://pandera.readthedocs.io/en/v0.4.0/lazy_validation.html
If validate(..., lazy=False)
, which is the default, you can catch the SchemaError
exception and access the data
attribute to get the dataframe/series that was passed into the schema.validate
call.
Let me know if this doesn't serves your use-case and you still need the on-error callback solution.
from pandera.
@cosmicBboy Thank you so much for your time and effort ! This is exactly what I needed.
But I couldn't help but notice that the lazy
keyword is not available for use in the check_input
and check_output
and I was wondering if that is intentional. As far as I understand the idea behind those decorators so far is that they mirror the schema.validate
signature, is that wrong ??
Again, thank you for your time and effort.
from pandera.
ah, this was not intentional and I over-looked those functions... I can get to this in the next few weeks, but if you have the capacity/willingness you can make a PR to make this change :)
from pandera.
btw even without lazy validation you can still catch SchemaError
, which is raised when lazy=False
(SchemaErrors
is raised on lazy validation) and access err.data
and err.failure_cases
.
from pandera.
closing this issue for now, will may revisit the on_error
callback if the use case comes up again
from pandera.
Related Issues (20)
- Is there a built in way to drop not defined columns at the Schema validation process? HOT 3
- FastAPI OpenAPI spec of an endpoint using Pandera not working with Pydantic v2 HOT 3
- OpenAPI is wrong and Input / Output schemas are missing HOT 4
- Support polars LazyFrame *actual* lazy validation
- Index has no name in SchemaModel when using SchemaModel.strategy
- Joint uniqueness unsatisfiable for data synthesis
- An element_wise Check on a datetime column of an empty DataFrame fails since 0.14
- str_length check raises an DispatchError exception if min_value or max_value are not set
- [PySpark] Performance issues during validation HOT 5
- Schema inference from Dask dataframe or series
- check-type decorators do not work in parallel HOT 2
- Pandera example generation seems to to much slower than building dataframes with lists
- Date type not exported
- Add built-in checks for Polars schemas support
- Keeping track of Polars DataTypes for Polars schemas support
- Series[list[TypedDict] fails in Python 3.11 but not in Python 3.12
- feature(pandas): Support string column validation for pandas 2.1.3 HOT 1
- Use DataFrameModel for validating multiindex columns
- DataFrameSchema <NA> column_ordered
- Implement polars LazyFrame backend and core checks
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from pandera.