Comments (6)
cool, this I think can be handled easily at the decorator level. Will do this weekend
from pandera.
@mastersplinter this should do it: https://github.com/cosmicBboy/pandera/pull/29
from pandera.
The error message leaves a lot to be desired, among them the fact that it's handled multiple times via python's error chaining feature.
I'd support including the , though it would be nice if the error message were a bit cleaner it would be clear where it's coming from (the offending dataframe is in the last traceback):dataframe.name
in the error message if it's not None
<ipython-input-29-dc85ea979a2b> in <module>
----> 1 schema.validate(df) <<< this dataframe
At least the way I use dataframes, I often don't name them... is this common practice? In any case, I think two things would help:
- figuring out how to raise only a single
Error
(I think I have an idea on how to do this) including thedataframe.name
in the error message if it's set.
Edit
I assumed that dataframes had an optional name
attribute, but it turns out they don't. I'd prefer not to rely on a user-provided custom attribute... otherwise I'm unaware of getting the variable name of the dataframe object.
I was, however, able to refactor the error-handling logic to produce a cleaner error message:
In [1]: import pandas as pd
...: import numpy as np
...: from pandera import Column, DataFrameSchema, PandasDtype, SeriesSchema, Check, Int
...:
...: schema = DataFrameSchema(
...: {
...: "a": Column(PandasDtype.Int, Check(lambda x: x > 0))
...: })
...:
...: df = pd.DataFrame({
...: "a": [1, 2, np.nan]
...: })
...:
...: schema.validate(df)
---------------------------------------------------------------------------
SchemaError Traceback (most recent call last)
<ipython-input-1-f4098e24342e> in <module>
12 })
13
---> 14 schema.validate(df)
~/git/pandera/pandera/pandera.py in validate(self, dataframe)
180 return self.schema.validate(dataframe)
181 except SchemaError as e:
--> 182 raise SchemaError(str(e)) from None
183
184
SchemaError: non-nullable series 'a' contains null values: {2: nan}
From this it's obvious that line 14 contains the offending dataframe, which I think gets to the spirit of this issue. The problem with this is that it obfuscates the traceback to where the error originally occurred in the source.
from pandera.
After hacking around a little bit and reading through this pandas issue it seems like including the dataframe name to the error message is not the way to go, and explicitly providing some sort of identifier to the validate
call seems like quite a burden on the user, especially since the last traceback message does point to the invalidated dataframe:
SchemaError Traceback (most recent call last)
<ipython-input-29-dc85ea979a2b> in <module>
----> 1 schema.validate(df)
/panderadev/pandera/pandera.py in validate(self, dataframe)
176 if not isinstance(dataframe, pd.DataFrame):
177 raise TypeError("expected dataframe, got %s" % type(dataframe))
--> 178 return self.schema.validate(dataframe)
179
180
/mycondaenv/lib/python3.6/site-packages/schema.py in validate(self, data)
393 return s.validate(data)
394 except SchemaError as x:
--> 395 raise SchemaError([None] + x.autos, [e] + x.errors)
396 except BaseException as x:
397 message = "%r.validate(%r) raised %r" % (s, data, x)
SchemaError: expected series 'a' to have type int64, got float64 and non-nullable series contains null values: {2: nan}
I do agree it's kinda sucky 'cause it's like searching for a needle in a small haystack :/
It also seems like schema
takes a lot of trouble catching SchemaErrors
(as it should), and to me the best solution would be to produce a clear error message, which would mean using schema
only at the Check
level (or decoupling pandera
from schema
altogether)
from pandera.
@mastersplinter it was easier to remove the schema
dependency that I thought :)
after https://github.com/cosmicBboy/pandera/pull/27, the error message (executing the code in the issue description) is much easier to read:
---------------------------------------------------------------------------
SchemaError Traceback (most recent call last)
<ipython-input-1-f4098e24342e> in <module>
12 })
13
---> 14 schema.validate(df)
~/git/pandera/pandera/pandera.py in validate(self, dataframe)
172 if self.index is not None:
173 schema_elements += [self.index]
--> 174 assert all(s(dataframe) for s in schema_elements)
175 if self.transformer is not None:
176 dataframe = self.transformer(dataframe)
~/git/pandera/pandera/pandera.py in <genexpr>(.0)
172 if self.index is not None:
173 schema_elements += [self.index]
--> 174 assert all(s(dataframe) for s in schema_elements)
175 if self.transformer is not None:
176 dataframe = self.transformer(dataframe)
~/git/pandera/pandera/pandera.py in __call__(self, df)
332 raise RuntimeError(
333 "need to `set_name` of column before calling it.")
--> 334 return super(Column, self).__call__(df[self._name])
335
336 def __repr__(self):
~/git/pandera/pandera/pandera.py in __call__(self, series)
222 raise SchemaError(
223 "non-nullable series contains null values: %s" %
--> 224 series[nulls].head(N_FAILURE_CASES).to_dict()
225 )
226
SchemaError: non-nullable series contains null values: {2: nan}
from pandera.
@cosmicBboy this is great. Your changes enable faster debugging when using schema.validate
.
An issue still remains when using the check_input
and check_output
decorators on a function.
Take this example:
import pandas as pd
import numpy as np
from pandera import Column, DataFrameSchema, PandasDtype, SeriesSchema, Check, Int
# Define input and output schemas
schema_in = DataFrameSchema(
{
"a": Column(PandasDtype.Int, Check(lambda x: x > 0))
})
schema_out = DataFrameSchema(
{
"a": Column(PandasDtype.Float, Check(lambda x: x > 0))
})
# This dataframe will fail schema_input, but pass schema_out
df = pd.DataFrame({
"a": [1, 2, np.nan]
})
@check_input(schema_in, 0)
@check_output(schema_out)
def some_function(df):
return df
function_return = some_function(df)
With your latest changes from #27 and mine from #25, the error message doesn't help the user isolate which schema check/decorator raised the issue:
---------------------------------------------------------------------------
SchemaError Traceback (most recent call last)
<ipython-input-24-de765c22ff69> in <module>
----> 1 function_return = some_function(df)
/panderadev/pandera/pandera.py in _wrapper(fn, instance, args, kwargs)
376 args = list(args)
377 if isinstance(obj_getter, int):
--> 378 args[obj_getter] = schema.validate(args[obj_getter])
379 elif isinstance(obj_getter, str):
380 if obj_getter in kwargs:
/panderadev/pandera/pandera.py in validate(self, dataframe)
171 if self.index is not None:
172 schema_elements += [self.index]
--> 173 assert all(s(dataframe) for s in schema_elements)
174 if self.transformer is not None:
175 dataframe = self.transformer(dataframe)
/panderadev/pandera/pandera.py in <genexpr>(.0)
171 if self.index is not None:
172 schema_elements += [self.index]
--> 173 assert all(s(dataframe) for s in schema_elements)
174 if self.transformer is not None:
175 dataframe = self.transformer(dataframe)
/panderadev/pandera/pandera.py in __call__(self, df)
339 raise RuntimeError(
340 "need to `set_name` of column before calling it.")
--> 341 return super(Column, self).__call__(df[self._name])
342
343 def __repr__(self):
/panderadev/pandera/pandera.py in __call__(self, series)
225 "non-nullable series contains null values: %s" %
226 (series.name, self._pandas_dtype.value, series.dtype,
--> 227 series[nulls].head(N_FAILURE_CASES).to_dict()))
228 else:
229 raise SchemaError(
SchemaError: expected series 'a' to have type int64, got float64 and non-nullable series contains null values: {2: nan}
In the traceback it would assist to have some way of finding the decorator which triggered the schema failure.
from pandera.
Related Issues (20)
- Column metadata is not serialized
- Improve dataframe-wide checks and their error reporting (failure_cases) HOT 2
- Pyspark validation fails with regex columns HOT 2
- Make CI dependency management more dependable.
- converting a pa.DataFrameModel into a pa.SeriesSchema? HOT 1
- Handling columns with multiple timezones
- Validation failure of union of strict='filter' schemas due to column deletion if inplace=True
- str_length function not working in pa.Field for PySpark
- Pyspark module - Column class does not support "unique" parameter.
- Pyspark module - str_length check not implemented. HOT 1
- Pyspark module - DataFrameSchema class does not support "add_missing_columns" parameter.
- Pyspark module - validation where required is set to False throws error.
- Add option to return dataframe with columns in order specified in DataFrameSchema.
- Pyspark module - enable other date formats to be coerced into DateType
- Promote `pandera.typing.DataFrame` into the main `pandera` namespace HOT 2
- Suggest `pat` as a convention in `import pandera.typing as pat` in docs
- Add `Column Regex Pattern Matching` support in `DataFrame Model` HOT 2
- Global Config option to control single column index check_name
- Support conversion from Pandera DataFrameSchema or DataFrameModel to PySpark StructType
- MultiIndex unique columns case - pandas error instead of (expected) schema error when pandas version 2.1.0
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from pandera.