Comments (6)
Hey @cosmicBboy, I came across this SO answer of yours which is related to this issue.
Is having some kind of drop_invalid=<bool>
kwarg to the Schema something which would be inside the scope of what pandera is meant for? If so I'd be interested in opening a PR to give it a shot.
I'm using pandera to build "custom data frames" as an ORM of sorts to map parquet files to pandas dfs. I've achieved the drop_invalid
behaviour by setting a validate
method on the Base class which all the custom data frames inherit from which looks for this kwarg:
class BaseCustomDataFrame(pd.DataFrame):
SCHEMA = None
def __init__(self, *args, **kwargs):
super().__init__(*args, **kwargs)
if not self.SCHEMA:
raise SchemaNotProvidedError
self.validate()
def validate(self, drop_invalid_rows=False):
if drop_invalid_rows:
fail_index = []
try:
self.SCHEMA.validate(self)
except pa.errors.SchemaError as ex:
fail_index = ex.failure_cases["index"]
self = self[~self.index.isin(fail_index)]
else:
self.SCHEMA.validate(self)
return self
Moving this behaviour into the library would be useful and remove try/catch
logic which others could also be needing to implement.
from pandera.
I think drop_invalid makes sense!
from pandera.
the current solution for this would be:
try:
schema.validate(df, lazy=True)
except pa.errors.SchemaErrors as err:
invalid_data = err.data.loc[err.failure_cases["index"]]
valid_data = err.data.loc[~err.data.index.isin(err.failure_cases["index"])]
from pandera.
Giving users more info about failure cases would be great
I feel like it adds complexity to let the user just use .failure_cases - If I encountered this scenario I'd probably just write what I needed in pandas, but the concept makes sense for SchemaErrors alone.
from pandera.
I also want to partition the data to valid / invalid.
Use case is to continue working with the valid data, and not dropping everything. As well as pushing failed cases to another place for inspection.
However above method still does not work very well, a lot of times I get index reported as None
, which is a big problem for this.
Related: #1080
from pandera.
@sebastian-heinz basically rows in the err.failure_cases
dataframe where index
is None
means that the error applies to the entire column.
You could ignore those rows with notna()
try:
schema.validate(df, lazy=True)
except pa.errors.SchemaErrors as err:
invalid_data = err.data.loc[err.failure_cases["index"].notna()]
valid_data = err.data.loc[~err.data.index.isin(err.failure_cases["index"])]
from pandera.
Related Issues (20)
- Validating datetime columns regardless of timezone HOT 9
- Concise typing.List type annotation throws error in DataFrameModel.to_schema() HOT 3
- `import pandera` breaks SparkSession in AWS EMR
- Should DataFrameModel.validate raise SchemaInitError when using pydantic v2 `model_config`? HOT 2
- Pandera import fails due to DatetimeAccessor issue in Python 3.11.9 HOT 1
- Pandera Series generic argument does not allow the `typing.List` etc. types HOT 3
- no exception raise after pipe on 3.11.9 HOT 1
- How to Avoid Pandera Doc Injection? HOT 4
- Custom check fails with `pl.DataFrame` HOT 3
- Custom check erroneously passes when validating `pl.LazyFrame` HOT 8
- Remove dependencies on `wrapt`, `packaging`, and potentially `multimethod`
- Make `pydantic` and `typeguard` extras for pandas generic type support
- Is there a way to get a `DataFrameModel` from existing `pandas.DataFrame`? HOT 2
- Hypothesis examples are all the same HOT 7
- Parametrized type annotations are broken for polars DataFrameModels HOT 1
- Piping pandas with pandera schema doesn't raise SchemaError ( python 3.11.9 ) HOT 1
- Lazy schema validation does not raise expected errors with polars dataframes HOT 3
- Update branch name mentioned within bug report template HOT 1
- Custom DTypes With Polars HOT 3
- Error Importing Pandera with Polars extra HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from pandera.