Giter VIP home page Giter VIP logo

Comments (9)

cosmicBboy avatar cosmicBboy commented on May 31, 2024 1

Long story short, the primitives are there, it'll just be some work before we can realize the vision of "one DataFrameModel to rule them all" 💍.

I see this question can be broken down into two sub-problems:

  1. How to create a common DataFrameModel interface that can validate a suite of supported dataframe types
  2. How to create a common type system such that I can use a single data type system that can work for different dataframe types (but still have to use different DataFrameModel classes). This is what the pandera type system was designed for, but will take some work to make it really nice developer experience.

I don't want to create 2 pandera DataFrameModels for each type, that seems like a really bad practice.

Agreed, but you might be surprised how challenging this is to get right 🙃.

I'd say (2) is a little easier to tackle right now. Basically we'd need to add the library-agnostic data types as supported equivalent types in the pyspark_engine module. This is already somewhat supported in the pandas_engine module but that also needs to be cleaned up.

As for (1), that's going to take more work, but basically we'd need to create a generic DataFrameSchema and DataFrameModel interface that supports both pandas and pyspark (and e.g. polars, etc). This would require some pretty big internal changes to the way that DataFrameModel works (I'm not happy about its current state and will need to overhaul it), and perhaps use Pythonic typing conventions like Annotated[pd.DataFrame, DataFrameModel] instead of pandera.typing.pandas.DataFrame[DataFrameModel].

@NeerajMalhotra-QB @jaskaransinghsidana FYI.

I guess to kick this effort off, would you mind sharing some example code of what you're doing today @lior5654 ?

from pandera.

lior5654 avatar lior5654 commented on May 31, 2024

Any comments on this?

from pandera.

lior5654 avatar lior5654 commented on May 31, 2024

Thanks for the detailed answer! I really appreciate it.

I'll share a minimal PoC soon, but another question arises - do you recommend other libraries to do this these days? Are you aware of any libraries that currently support such concepts? @cosmicBboy

from pandera.

cosmicBboy avatar cosmicBboy commented on May 31, 2024

@lior5654 as far as I know I don't know of efforts to create a "unified dataframe model for schema validation"... pandera is the only such effort I know of :) Happy to learn about others if any community members know about other efforts, but would love to work with you to figure out how we can achieve this vision with pandera.

The "one DataFrameModel to rule them all" really is the goal here, but for the longest time pandera only supported pandas-compliant APIs, e.g. modin, dask, pyspark.pandas. Recent support forpyspark-sql was the first experiment to see if pandera can really support other dataframe libraries (the answer is yes 😀). So as not to generalize too early @NeerajMalhotra-QB and team and I decided to essentially duplicate some of the code when we built out the pyspark-sql-native support.

Now with the efforts to support polars and ibis I think we're in a good position to generalize the API so we can have a generic DataFrameSchema and DataFrameModel base class, which can serve as the single entrypoint for validating dataframe-like objects, delegating to the appropriate backend and type system as needed.

from pandera.

lior5654 avatar lior5654 commented on May 31, 2024

Thanks again for the reply. before I send code samples, I have another question.
Regarding modin, dask, pyspark.pandas - which are all pandas API compliant, can I use the same DataFrameModel to validate the all? I see that pyspark.pandas has its own Series class in pandera.

@cosmicBboy

from pandera.

lior5654 avatar lior5654 commented on May 31, 2024

But I think my intentions are pretty clear, here's a code sample:

import pyspark.pandas as ps
import pandas as pd
import pandera as pa

# should GENERIC, for ALL pandas compliant APIS
# in the future - also for non pandas compliat APIS (only when applicable)
# currently - this is for pandas itself only
from pandera.typing import DataFrame, Series


class Schema(pa.DataFrameModel):
    state: Series[str]
    city: Series[str]
    price: Series[int] = pa.Field(in_range={"min_value": 5, "max_value": 20})

# validate a pd.DataFrame (this will work)
# validate a ps.DataFrame (this has it's own Series type so you need to define the same class with a different Series
# ...

I guess for a start, supporting all pandas-compliant APIs with a single class should be easy, right?

from pandera.

lior5654 avatar lior5654 commented on May 31, 2024

Oh I just tested, and it seems like the pandera pandas DataFrame model works seamlessly with pyspark.pandas API, great to know!

from pandera.

cosmicBboy avatar cosmicBboy commented on May 31, 2024

Oh I just tested, and it seems like the pandera pandas DataFrame model works seamlessly with pyspark.pandas API, great to know!

Yep! The pyspark.pandas integration has been around for longer

from pandera.

cosmicBboy avatar cosmicBboy commented on May 31, 2024

supporting all pandas-compliant APIs with a single class should be easy, right?

Yes, this is possible today with the backend extensions plugin. This is currently done for dask, modin, and pyspark.pandas. The challenge is making pyspark.sql dataframe schemas use the same schema specification as the pandas one, just with a different backend.

from pandera.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.