Discuss the use case for partitioning a dataframe into valid and invalid portions.

Hey <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url=

the current solution for this would be: <div class="highlight highlight-source-pyt

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

should DataFrameSchema support partitioning dataframe into valid/invalid parts? about pandera HOT 6 OPEN

unionai-oss commented on May 14, 2024

should DataFrameSchema support partitioning dataframe into valid/invalid parts?

from pandera.

Comments (6)

kykyi commented on May 14, 2024 2

Hey @cosmicBboy, I came across this SO answer of yours which is related to this issue.

Is having some kind of drop_invalid=<bool> kwarg to the Schema something which would be inside the scope of what pandera is meant for? If so I'd be interested in opening a PR to give it a shot.

I'm using pandera to build "custom data frames" as an ORM of sorts to map parquet files to pandas dfs. I've achieved the drop_invalid behaviour by setting a validate method on the Base class which all the custom data frames inherit from which looks for this kwarg:

class BaseCustomDataFrame(pd.DataFrame):
    SCHEMA = None
    
    def __init__(self, *args, **kwargs):
        super().__init__(*args, **kwargs)

        if not self.SCHEMA:
            raise SchemaNotProvidedError

        self.validate()

    def validate(self, drop_invalid_rows=False):
        if drop_invalid_rows:
            fail_index = []
            try:
                self.SCHEMA.validate(self)
            except pa.errors.SchemaError as ex:
                fail_index = ex.failure_cases["index"]

            self = self[~self.index.isin(fail_index)]
        else:
            self.SCHEMA.validate(self)

        return self

Moving this behaviour into the library would be useful and remove try/catch logic which others could also be needing to implement.

from pandera.

cosmicBboy commented on May 14, 2024 2

I think drop_invalid makes sense!

from pandera.

cosmicBboy commented on May 14, 2024 1

the current solution for this would be:

try:
    schema.validate(df, lazy=True)
except pa.errors.SchemaErrors as err:
    invalid_data = err.data.loc[err.failure_cases["index"]]
    valid_data = err.data.loc[~err.data.index.isin(err.failure_cases["index"])]

from pandera.

mastersplinter commented on May 14, 2024

Giving users more info about failure cases would be great

I feel like it adds complexity to let the user just use .failure_cases - If I encountered this scenario I'd probably just write what I needed in pandas, but the concept makes sense for SchemaErrors alone.

from pandera.

sebastian-heinz commented on May 14, 2024

I also want to partition the data to valid / invalid.
Use case is to continue working with the valid data, and not dropping everything. As well as pushing failed cases to another place for inspection.

However above method still does not work very well, a lot of times I get index reported as None, which is a big problem for this.

Related: #1080

from pandera.

cosmicBboy commented on May 14, 2024

@sebastian-heinz basically rows in the err.failure_cases dataframe where index is None means that the error applies to the entire column.

You could ignore those rows with notna()

try:
    schema.validate(df, lazy=True)
except pa.errors.SchemaErrors as err:
    invalid_data = err.data.loc[err.failure_cases["index"].notna()]
    valid_data = err.data.loc[~err.data.index.isin(err.failure_cases["index"])]

from pandera.

should DataFrameSchema support partitioning dataframe into valid/invalid parts? about pandera HOT 6 OPEN

Comments (6)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent