Comments (4)
Talking about modularization I have a question there:
It seems that the Check class (and possible subclasses) is hard to unit test because the __call__
and _vectorized_check
method take parent_schema
and check_index
as mandatory input. Is there another reason for this else than only error formatting?
Because if there isn't I would suggest to detangle the classes from each other.
That's the _vectorized_check
function for example:
def _vectorized_check(self, parent_schema, check_index, check_obj):
"""Perform a vectorized check on a series.
:param parent_schema: The schema object that is being checked and that
was inherited from the parent class.
:param check_index: The validator to check the series for
:param dict check_obj: a dictionary of pd.Series to be used by
`_check_fn` and `_vectorized_series_check`
"""
val_result = self.fn(check_obj)
if isinstance(val_result, pd.Series):
if not val_result.dtype == PandasDtype.Bool.value:
raise TypeError(
"validator %d: %s must return bool or Series of type "
"bool, found %s" %
(check_index, self.fn.__name__, val_result.dtype))
if val_result.all():
return True
elif isinstance(check_obj, dict) or \
check_obj.shape[0] != val_result.shape[0] or \
(check_obj.index != val_result.index).all():
raise SchemaError(
self.generic_error_message(parent_schema, check_index))
else:
raise SchemaError(self.vectorized_error_message(
parent_schema, check_index, check_obj[~val_result]))
else:
if val_result:
return True
raise SchemaError(
self.generic_error_message(parent_schema, check_index))
It seems that it essentially returns either True
or raises an error indicating where something went wrong. Wouldn't it be more straight-forward to return either True
or the pandas index values for the failures? Then the failure handling could be done in the calling schema objects.
I see another benefit then: the schema objects could be configured to either throw errors as is or instead return the erroneous rows very easily.
Just an idea. I don't want to pull the library apart. It is just that these are internal methods anyway and maybe this could help detangle the classes, allow for separated testing and improve clarity a bit.
from pandera.
yes, I've been meaning to clean up this part of the internal API for some time now.
As you suggest, I think we can separate the error formatting and reporting functionality currently in the Check
class into a utility module, and which would allow Check.__call__
to not rely on the parent_schema
or check_index
argument, but I'd consider this another ticket. You're welcome to create an issue (you can just reference your comment above) and we can tackle that as a separate problem.
The scope of this issue is really to get out of single-file-hell :)
from pandera.
I created such an issue, not expecting that we will work on it very soon.
Are you actively working on the modularization right now? I am a bit afraid that branches are moving too far appart if we don't take care. I am currently working on some built-in checks and on the yaml loading topic. I will try to keep it in separate files for now to avoid conflicts.
from pandera.
Are you actively working on the modularization right now?
Should be pushing something up by the end of this weekend
I am currently working on some built-in checks
Let's move this discussion over to here #74, as there are a couple of decision we should discuss re: API design that I'd like your and @mastersplinter feedback on.
from pandera.
Related Issues (20)
- Custom check fails with `pl.DataFrame` HOT 3
- Custom check erroneously passes when validating `pl.LazyFrame` HOT 8
- Remove dependencies on `wrapt`, `packaging`, and potentially `multimethod`
- Make `pydantic` and `typeguard` extras for pandas generic type support
- Is there a way to get a `DataFrameModel` from existing `pandas.DataFrame`? HOT 2
- Hypothesis examples are all the same HOT 7
- Parametrized type annotations are broken for polars DataFrameModels HOT 1
- Piping pandas with pandera schema doesn't raise SchemaError ( python 3.11.9 ) HOT 1
- Lazy schema validation does not raise expected errors with polars dataframes HOT 3
- Update branch name mentioned within bug report template HOT 1
- Custom DTypes With Polars HOT 3
- Error Importing Pandera with Polars extra HOT 2
- Add a polars `Series` type HOT 10
- Allow check type HOT 1
- How to load schema from pyspark struct or avro format from schema registry ? HOT 1
- How to correctly install a release v0.19.0b3 HOT 2
- Support Series generation with serial dependence HOT 1
- Incorrect validation passes pandera=0.19.0b3 HOT 1
- failure_case conversion failed : polars.exceptions.ComputeError - pandera(0.19.0b3) with polars HOT 5
- Incorrect Pandera Polars DataFrameModel Type Coercion Logic HOT 5
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from pandera.