Comments (9)
Long story short, the primitives are there, it'll just be some work before we can realize the vision of "one DataFrameModel to rule them all" 💍.
I see this question can be broken down into two sub-problems:
- How to create a common
DataFrameModel
interface that can validate a suite of supported dataframe types - How to create a common type system such that I can use a single data type system that can work for different dataframe types (but still have to use different
DataFrameModel
classes). This is what the pandera type system was designed for, but will take some work to make it really nice developer experience.
I don't want to create 2 pandera DataFrameModels for each type, that seems like a really bad practice.
Agreed, but you might be surprised how challenging this is to get right 🙃.
I'd say (2) is a little easier to tackle right now. Basically we'd need to add the library-agnostic data types as supported equivalent types in the pyspark_engine
module. This is already somewhat supported in the pandas_engine
module but that also needs to be cleaned up.
As for (1), that's going to take more work, but basically we'd need to create a generic DataFrameSchema
and DataFrameModel
interface that supports both pandas and pyspark (and e.g. polars, etc). This would require some pretty big internal changes to the way that DataFrameModel
works (I'm not happy about its current state and will need to overhaul it), and perhaps use Pythonic typing conventions like Annotated[pd.DataFrame, DataFrameModel]
instead of pandera.typing.pandas.DataFrame[DataFrameModel]
.
@NeerajMalhotra-QB @jaskaransinghsidana FYI.
I guess to kick this effort off, would you mind sharing some example code of what you're doing today @lior5654 ?
from pandera.
Any comments on this?
from pandera.
Thanks for the detailed answer! I really appreciate it.
I'll share a minimal PoC soon, but another question arises - do you recommend other libraries to do this these days? Are you aware of any libraries that currently support such concepts? @cosmicBboy
from pandera.
@lior5654 as far as I know I don't know of efforts to create a "unified dataframe model for schema validation"... pandera
is the only such effort I know of :) Happy to learn about others if any community members know about other efforts, but would love to work with you to figure out how we can achieve this vision with pandera.
The "one DataFrameModel to rule them all" really is the goal here, but for the longest time pandera only supported pandas
-compliant APIs, e.g. modin, dask, pyspark.pandas. Recent support forpyspark-sql
was the first experiment to see if pandera can really support other dataframe libraries (the answer is yes 😀). So as not to generalize too early @NeerajMalhotra-QB and team and I decided to essentially duplicate some of the code when we built out the pyspark-sql-native support.
Now with the efforts to support polars and ibis I think we're in a good position to generalize the API so we can have a generic DataFrameSchema
and DataFrameModel
base class, which can serve as the single entrypoint for validating dataframe-like objects, delegating to the appropriate backend and type system as needed.
from pandera.
Thanks again for the reply. before I send code samples, I have another question.
Regarding modin, dask, pyspark.pandas - which are all pandas API compliant, can I use the same DataFrameModel
to validate the all? I see that pyspark.pandas has its own Series class in pandera.
from pandera.
But I think my intentions are pretty clear, here's a code sample:
import pyspark.pandas as ps
import pandas as pd
import pandera as pa
# should GENERIC, for ALL pandas compliant APIS
# in the future - also for non pandas compliat APIS (only when applicable)
# currently - this is for pandas itself only
from pandera.typing import DataFrame, Series
class Schema(pa.DataFrameModel):
state: Series[str]
city: Series[str]
price: Series[int] = pa.Field(in_range={"min_value": 5, "max_value": 20})
# validate a pd.DataFrame (this will work)
# validate a ps.DataFrame (this has it's own Series type so you need to define the same class with a different Series
# ...
I guess for a start, supporting all pandas-compliant APIs with a single class should be easy, right?
from pandera.
Oh I just tested, and it seems like the pandera pandas DataFrame model works seamlessly with pyspark.pandas API, great to know!
from pandera.
Oh I just tested, and it seems like the pandera pandas DataFrame model works seamlessly with pyspark.pandas API, great to know!
Yep! The pyspark.pandas integration has been around for longer
from pandera.
supporting all pandas-compliant APIs with a single class should be easy, right?
Yes, this is possible today with the backend extensions plugin. This is currently done for dask, modin, and pyspark.pandas. The challenge is making pyspark.sql
dataframe schemas use the same schema specification as the pandas one, just with a different backend.
from pandera.
Related Issues (20)
- Cannot access member "to_parquet"
- None and empty list columns error HOT 2
- Column Order Validation using Pyspark SQL Data Validation is not Working. HOT 3
- Pyspark unique check doesn't return error HOT 3
- Add support for `PANDERA_VALIDATION_ENABLED` for pandas HOT 5
- `list[str]` type broken HOT 3
- multiple items in a list fails validation HOT 4
- Pandera timezone-agnostic datetime type HOT 7
- None in fail list nullable validation HOT 6
- Include drop_invalid_rows attribute in deserialization from_json()
- pydantic validation to raise ValidationError instead of ValueError HOT 2
- Simplify dependency graph HOT 11
- Idea: DataFrame Validation State Caching For Runtime Optimization - Only Validate What Needs To Be Validated! HOT 2
- Idea: Suppport the "&" operation between two DataFrameModels HOT 2
- Is it possible to validate geopandas GeoDataFrame geometry type? HOT 3
- `add_missing_columns` sometimes adds same missing column multiple times HOT 3
- Timezone-aware bug with Multi-Index
- TypeError using Annotated with Category
- Pandas Backend check_dtype function is not compatible with numpy.bool_
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from pandera.