Comments (6)
Can coerce have unintended consequences? None into 'None' in a string column for example?
That shouldn't be the case... please feel free to test it out, if that does happen it's a bug :)
from pandera.
isn't this error expected? col_lst
is not in the dataframe passed into the schema.
from pandera.
also:
.convert_dtypes(infer_objects=False, dtype_backend="numpy_nullable")
is changing the datatypes of the float column
import pandas as pd
import pandera as pa
from pandera.typing import DataFrame, Index, Series
class DFModel(pa.DataFrameModel):
idx: Index[int]
col_float: Series[float] = pa.Field(nullable=True)
col_str: Series[str] = pa.Field(nullable=True)
col_lst: Series[list[str]] = pa.Field(nullable=True)
df = (
pd.DataFrame(
[
{"idx": 1, "col_float": 1.0, "col_str": "one", "col_lst": None},
{"idx": 2, "col_float": None, "col_str": None, "col_lst": None},
{"idx": 3, "col_float": None, "col_str": None, "col_lst": None},
]
)
.convert_dtypes(infer_objects=False, dtype_backend="numpy_nullable")
)
print(df.dtypes)
idx Int64
col_float Int64
col_str string[python]
col_lst object
dtype: object
from pandera.
Yes! sorry that was a bad copy-pasta.
Here is the error that I think i'm struggling with:
import numpy as np
import pandas as pd
import pandera as pa
from pandera.typing import DataFrame, Index, Series
class DFModel(pa.DataFrameModel):
idx: Index[int]
col_float: Series[float] = pa.Field(nullable=True)
col_str: Series[str] = pa.Field(nullable=True)
col_lst: Series[list[str]] = pa.Field(nullable=True)
df = (
pd.DataFrame(
[
{"idx": 1, "col_float": 1.0, "col_str": "one", "col_lst": ["1", "2", "3"]},
{"idx": 2, "col_float": 1.0, "col_str": None, "col_lst": None},
{"idx": 3, "col_float": None, "col_str": None, "col_lst": None},
]
)
.pipe(DataFrame[DFModel])
)
# works!
df = (
pd.DataFrame(
[
{"idx": 2, "col_float": 1.0, "col_str": None, "col_lst": None},
{"idx": 3, "col_float": None, "col_str": None, "col_lst": None},
]
)
.pipe(DataFrame[DFModel])
)
# also works!
df = (
pd.DataFrame(
[
{"idx": 2, "col_float": 1.0, "col_str": None, "col_lst": None},
{"idx": 3, "col_float": None, "col_str": None, "col_lst": None},
]
)
.fillna(np.nan)
.pipe(DataFrame[DFModel])
)
...
/site-packages/pandera/error_handlers.py:38, in SchemaErrorHandler.collect_error(self, reason_code, schema_error, original_exc)
31 """Collect schema error, raising exception if lazy is False.
32
33 :param reason_code: string representing reason for error.
34 :param schema_error: ``SchemaError`` object.
35 :param original_exc: original exception associated with the SchemaError.
36 """
37 if not self._lazy:
---> 38 raise schema_error from original_exc
40 # delete data of validated object from SchemaError object to prevent
41 # storing copies of the validated DataFrame/Series for every
42 # SchemaError collected.
43 del schema_error.data
SchemaError: expected series 'col_lst' to have type list[str], got float64
df = (
pd.DataFrame(
[
{"idx": 2, "col_float": 1.0, "col_str": None, "col_lst": None},
{"idx": 3, "col_float": None, "col_str": None, "col_lst": None},
]
)
.convert_dtypes(infer_objects=False, dtype_backend="numpy_nullable")
.pipe(DataFrame[DFModel])
)
...
/site-packages/pandera/error_handlers.py:38, in SchemaErrorHandler.collect_error(self, reason_code, schema_error, original_exc)
31 """Collect schema error, raising exception if lazy is False.
32
33 :param reason_code: string representing reason for error.
34 :param schema_error: ``SchemaError`` object.
35 :param original_exc: original exception associated with the SchemaError.
36 """
37 if not self._lazy:
---> 38 raise schema_error from original_exc
40 # delete data of validated object from SchemaError object to prevent
41 # storing copies of the validated DataFrame/Series for every
42 # SchemaError collected.
43 del schema_error.data
SchemaError: expected series 'col_float' to have type float64, got Int64
And it is related to the same type converstion that you've noticed
None pass validation, but NaNs don't
from pandera.
So there are two solutions:
- Set
coerce=True
with:
class DFModel(pa.DataFrameModel):
idx: Index[int]
col_float: Series[float] = pa.Field(nullable=True)
col_str: Series[str] = pa.Field(nullable=True)
col_lst: Series[list[str]] = pa.Field(nullable=True)
class Config:
coerce=True
This is will validate correctly.
- Convert
col_lst
to an object data type.
df = (
pd.DataFrame(
[
{"idx": 2, "col_float": 1.0, "col_str": None, "col_lst": None},
{"idx": 3, "col_float": None, "col_str": None, "col_lst": None},
]
)
.fillna(np.nan)
.astype({"col_lst": "object"})
.pipe(DataFrame[DFModel])
)
The issue here is that, with coerce=False
, pandera just validates the data type as is. So whatever the dtypes of the columns are of the incoming dataframe is what pandera checks against the schema type.
from pandera.
Right this makes sense. Because by this point, pandas has forced col_lst into a float in order to accept a NaN.
Can coerce have unintended consequences? None
into 'None'
in a string column for example?
from pandera.
Related Issues (20)
- Cannot access member "to_parquet"
- None and empty list columns error HOT 2
- Column Order Validation using Pyspark SQL Data Validation is not Working. HOT 3
- Pyspark unique check doesn't return error HOT 3
- Add support for `PANDERA_VALIDATION_ENABLED` for pandas HOT 5
- `list[str]` type broken HOT 3
- multiple items in a list fails validation HOT 4
- Pandera timezone-agnostic datetime type HOT 7
- Include drop_invalid_rows attribute in deserialization from_json()
- pydantic validation to raise ValidationError instead of ValueError HOT 2
- Design Data Types Library That Supports Both PySpark & Pandas HOT 9
- Simplify dependency graph HOT 11
- Idea: DataFrame Validation State Caching For Runtime Optimization - Only Validate What Needs To Be Validated! HOT 2
- Idea: Suppport the "&" operation between two DataFrameModels HOT 2
- Is it possible to validate geopandas GeoDataFrame geometry type? HOT 3
- `add_missing_columns` sometimes adds same missing column multiple times HOT 3
- Timezone-aware bug with Multi-Index
- TypeError using Annotated with Category
- Pandas Backend check_dtype function is not compatible with numpy.bool_
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from pandera.