Giter VIP home page Giter VIP logo

Comments (6)

cosmicBboy avatar cosmicBboy commented on May 30, 2024 1

Can coerce have unintended consequences? None into 'None' in a string column for example?

That shouldn't be the case... please feel free to test it out, if that does happen it's a bug :)

from pandera.

cosmicBboy avatar cosmicBboy commented on May 30, 2024

isn't this error expected? col_lst is not in the dataframe passed into the schema.

from pandera.

cosmicBboy avatar cosmicBboy commented on May 30, 2024

also:

.convert_dtypes(infer_objects=False, dtype_backend="numpy_nullable")

is changing the datatypes of the float column

import pandas as pd
import pandera as pa
from pandera.typing import DataFrame, Index, Series

class DFModel(pa.DataFrameModel):
    idx: Index[int]
    col_float: Series[float] = pa.Field(nullable=True)
    col_str: Series[str] = pa.Field(nullable=True)
    col_lst: Series[list[str]] = pa.Field(nullable=True)

df = (
    pd.DataFrame(
        [
            {"idx": 1, "col_float": 1.0, "col_str": "one", "col_lst": None},
            {"idx": 2, "col_float": None, "col_str": None, "col_lst": None},
            {"idx": 3, "col_float": None, "col_str": None, "col_lst": None},
        ]
    )
    .convert_dtypes(infer_objects=False, dtype_backend="numpy_nullable")
)
print(df.dtypes)
idx                   Int64
col_float             Int64
col_str      string[python]
col_lst              object
dtype: object

from pandera.

janrito avatar janrito commented on May 30, 2024

Yes! sorry that was a bad copy-pasta.

Here is the error that I think i'm struggling with:

import numpy as np
import pandas as pd
import pandera as pa
from pandera.typing import DataFrame, Index, Series

class DFModel(pa.DataFrameModel):
    idx: Index[int]
    col_float: Series[float] = pa.Field(nullable=True)
    col_str: Series[str] = pa.Field(nullable=True)
    col_lst: Series[list[str]] = pa.Field(nullable=True)


df = (
    pd.DataFrame(
        [
            {"idx": 1, "col_float": 1.0, "col_str": "one", "col_lst": ["1", "2", "3"]},
            {"idx": 2, "col_float": 1.0, "col_str": None, "col_lst": None},
            {"idx": 3, "col_float": None, "col_str": None, "col_lst": None},
        ]
    )
    .pipe(DataFrame[DFModel])
)

# works!

df = (
    pd.DataFrame(
        [
            {"idx": 2, "col_float": 1.0, "col_str": None, "col_lst": None},
            {"idx": 3, "col_float": None, "col_str": None, "col_lst": None},
        ]
    )
    .pipe(DataFrame[DFModel])
)

# also works!

df = (
    pd.DataFrame(
        [
            {"idx": 2, "col_float": 1.0, "col_str": None, "col_lst": None},
            {"idx": 3, "col_float": None, "col_str": None, "col_lst": None},
        ]
    )
    .fillna(np.nan)
    .pipe(DataFrame[DFModel])
)

...
/site-packages/pandera/error_handlers.py:38, in SchemaErrorHandler.collect_error(self, reason_code, schema_error, original_exc)
     31 """Collect schema error, raising exception if lazy is False.
     32
     33 :param reason_code: string representing reason for error.
     34 :param schema_error: ``SchemaError`` object.
     35 :param original_exc: original exception associated with the SchemaError.
     36 """
     37 if not self._lazy:
---> 38     raise schema_error from original_exc
     40 # delete data of validated object from SchemaError object to prevent
     41 # storing copies of the validated DataFrame/Series for every
     42 # SchemaError collected.
     43 del schema_error.data

SchemaError: expected series 'col_lst' to have type list[str], got float64

df = (
    pd.DataFrame(
        [
            {"idx": 2, "col_float": 1.0, "col_str": None, "col_lst": None},
            {"idx": 3, "col_float": None, "col_str": None, "col_lst": None},
        ]
    )
    .convert_dtypes(infer_objects=False, dtype_backend="numpy_nullable")
    .pipe(DataFrame[DFModel])
)

...

/site-packages/pandera/error_handlers.py:38, in SchemaErrorHandler.collect_error(self, reason_code, schema_error, original_exc)
     31 """Collect schema error, raising exception if lazy is False.
     32
     33 :param reason_code: string representing reason for error.
     34 :param schema_error: ``SchemaError`` object.
     35 :param original_exc: original exception associated with the SchemaError.
     36 """
     37 if not self._lazy:
---> 38     raise schema_error from original_exc
     40 # delete data of validated object from SchemaError object to prevent
     41 # storing copies of the validated DataFrame/Series for every
     42 # SchemaError collected.
     43 del schema_error.data

SchemaError: expected series 'col_float' to have type float64, got Int64

And it is related to the same type converstion that you've noticed

None pass validation, but NaNs don't

from pandera.

cosmicBboy avatar cosmicBboy commented on May 30, 2024

So there are two solutions:

  1. Set coerce=True with:
class DFModel(pa.DataFrameModel):
    idx: Index[int]
    col_float: Series[float] = pa.Field(nullable=True)
    col_str: Series[str] = pa.Field(nullable=True)
    col_lst: Series[list[str]] = pa.Field(nullable=True)

    class Config:
        coerce=True

This is will validate correctly.

  1. Convert col_lst to an object data type.
df = (
    pd.DataFrame(
        [
            {"idx": 2, "col_float": 1.0, "col_str": None, "col_lst": None},
            {"idx": 3, "col_float": None, "col_str": None, "col_lst": None},
        ]
    )
    .fillna(np.nan)
    .astype({"col_lst": "object"})
    .pipe(DataFrame[DFModel])
)

The issue here is that, with coerce=False, pandera just validates the data type as is. So whatever the dtypes of the columns are of the incoming dataframe is what pandera checks against the schema type.

from pandera.

janrito avatar janrito commented on May 30, 2024

Right this makes sense. Because by this point, pandas has forced col_lst into a float in order to accept a NaN.

Can coerce have unintended consequences? None into 'None' in a string column for example?

from pandera.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.