Giter VIP home page Giter VIP logo

Comments (6)

kykyi avatar kykyi commented on May 14, 2024 2

Hey @cosmicBboy, I came across this SO answer of yours which is related to this issue.

Is having some kind of drop_invalid=<bool> kwarg to the Schema something which would be inside the scope of what pandera is meant for? If so I'd be interested in opening a PR to give it a shot.

I'm using pandera to build "custom data frames" as an ORM of sorts to map parquet files to pandas dfs. I've achieved the drop_invalid behaviour by setting a validate method on the Base class which all the custom data frames inherit from which looks for this kwarg:

class BaseCustomDataFrame(pd.DataFrame):
    SCHEMA = None
    
    def __init__(self, *args, **kwargs):
        super().__init__(*args, **kwargs)

        if not self.SCHEMA:
            raise SchemaNotProvidedError

        self.validate()

    def validate(self, drop_invalid_rows=False):
        if drop_invalid_rows:
            fail_index = []
            try:
                self.SCHEMA.validate(self)
            except pa.errors.SchemaError as ex:
                fail_index = ex.failure_cases["index"]

            self = self[~self.index.isin(fail_index)]
        else:
            self.SCHEMA.validate(self)

        return self

Moving this behaviour into the library would be useful and remove try/catch logic which others could also be needing to implement.

from pandera.

cosmicBboy avatar cosmicBboy commented on May 14, 2024 2

I think drop_invalid makes sense!

from pandera.

cosmicBboy avatar cosmicBboy commented on May 14, 2024 1

the current solution for this would be:

try:
    schema.validate(df, lazy=True)
except pa.errors.SchemaErrors as err:
    invalid_data = err.data.loc[err.failure_cases["index"]]
    valid_data = err.data.loc[~err.data.index.isin(err.failure_cases["index"])]

from pandera.

mastersplinter avatar mastersplinter commented on May 14, 2024

Giving users more info about failure cases would be great

I feel like it adds complexity to let the user just use .failure_cases - If I encountered this scenario I'd probably just write what I needed in pandas, but the concept makes sense for SchemaErrors alone.

from pandera.

sebastian-heinz avatar sebastian-heinz commented on May 14, 2024

I also want to partition the data to valid / invalid.
Use case is to continue working with the valid data, and not dropping everything. As well as pushing failed cases to another place for inspection.

However above method still does not work very well, a lot of times I get index reported as None, which is a big problem for this.

Related: #1080

from pandera.

cosmicBboy avatar cosmicBboy commented on May 14, 2024

@sebastian-heinz basically rows in the err.failure_cases dataframe where index is None means that the error applies to the entire column.

You could ignore those rows with notna()

try:
    schema.validate(df, lazy=True)
except pa.errors.SchemaErrors as err:
    invalid_data = err.data.loc[err.failure_cases["index"].notna()]
    valid_data = err.data.loc[~err.data.index.isin(err.failure_cases["index"])]

from pandera.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.