Comments (7)
Hi @max-raphael this is somewhat of a challenging use case to fulfill with datetimes because if we have a timezone-agnostic datetime, how do we deal with coercion?
Imagine we support something like:
class MySchema(DataFrameModel):
local_datetime: DateTime(has_tz=True) # just checks that the datetimes have any timezone
class Config:
coerce = True
If we do coerce=True
, what timezone should we coerce to? Solutions here would be:
- Default to
UTC
- Raise an exception
from pandera.
This is similar to the problem of having a generic Number
type: this can check if the data type is any of the int or float types, but when we coerce, what data type should it default to?
from pandera.
I hear you, that does pose a tricky problem. Thinking about it from my perspective as a user, I think I would prefer to have this as an option but be disallowed from coercing this field (via some Exception) due to the ambiguous nature of the data type rather than not have it accessible to me at all.
Perhaps even an Exception is too much. Pandera could still allow users to specify coerce=True
and coerce other fields, and add a warning level log statement that informs the user that this field cannot be coerced due to its data type.
from pandera.
Perhaps even an Exception is too much. We could still allow users to specify coerce=True and coerce other fields, and add a warning level log statement that informs the user that this field cannot be coerced due to its data type.
How would you feel about defaulting to UTC on coercion (if the incoming raw data is not TZ-aware) and raising a warning that the dtypes are coerced to UTC? I generally like to do something rather than nothing on coercion to prevent propagation of surprise (i.e. a non-TZ aware dataframe after validation with coerce=True
).
from pandera.
That seems acceptable to me. I think if incoming data is not tz-aware, then that's a reasonable approach so long as Pandera logs the warning and includes it in the documentation!
from pandera.
@cosmicBboy Hi, just following up here. Are we aligned on the feature? If so, what are the next steps? Thanks again for engaging with this, I think it would be helpful to many Pandera users.
from pandera.
Related Issues (20)
- Cannot access member "to_parquet"
- None and empty list columns error HOT 2
- Column Order Validation using Pyspark SQL Data Validation is not Working. HOT 3
- Pyspark unique check doesn't return error HOT 3
- Add support for `PANDERA_VALIDATION_ENABLED` for pandas HOT 5
- `list[str]` type broken HOT 3
- multiple items in a list fails validation HOT 4
- None in fail list nullable validation HOT 6
- Include drop_invalid_rows attribute in deserialization from_json()
- pydantic validation to raise ValidationError instead of ValueError HOT 2
- Design Data Types Library That Supports Both PySpark & Pandas HOT 9
- Simplify dependency graph HOT 11
- Idea: DataFrame Validation State Caching For Runtime Optimization - Only Validate What Needs To Be Validated! HOT 2
- Idea: Suppport the "&" operation between two DataFrameModels HOT 2
- Is it possible to validate geopandas GeoDataFrame geometry type? HOT 3
- `add_missing_columns` sometimes adds same missing column multiple times HOT 3
- Timezone-aware bug with Multi-Index
- TypeError using Annotated with Category
- Pandas Backend check_dtype function is not compatible with numpy.bool_
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from pandera.