Describe the bug None and empty list fail validation on a nullabl

also: <div class="highlight highlight-source-python notranslate position-relative

So there are two solutions: Set coer

None in fail list nullable validation about pandera HOT 6 CLOSED

janrito commented on May 30, 2024

None in fail list nullable validation

from pandera.

Comments (6)

cosmicBboy commented on May 30, 2024 1

Can coerce have unintended consequences? None into 'None' in a string column for example?

That shouldn't be the case... please feel free to test it out, if that does happen it's a bug :)

from pandera.

cosmicBboy commented on May 30, 2024

isn't this error expected? col_lst is not in the dataframe passed into the schema.

from pandera.

cosmicBboy commented on May 30, 2024

also:

.convert_dtypes(infer_objects=False, dtype_backend="numpy_nullable")

is changing the datatypes of the float column

import pandas as pd
import pandera as pa
from pandera.typing import DataFrame, Index, Series

class DFModel(pa.DataFrameModel):
    idx: Index[int]
    col_float: Series[float] = pa.Field(nullable=True)
    col_str: Series[str] = pa.Field(nullable=True)
    col_lst: Series[list[str]] = pa.Field(nullable=True)

df = (
    pd.DataFrame(
        [
            {"idx": 1, "col_float": 1.0, "col_str": "one", "col_lst": None},
            {"idx": 2, "col_float": None, "col_str": None, "col_lst": None},
            {"idx": 3, "col_float": None, "col_str": None, "col_lst": None},
        ]
    )
    .convert_dtypes(infer_objects=False, dtype_backend="numpy_nullable")
)
print(df.dtypes)

idx                   Int64
col_float             Int64
col_str      string[python]
col_lst              object
dtype: object

from pandera.

janrito commented on May 30, 2024

Yes! sorry that was a bad copy-pasta.

Here is the error that I think i'm struggling with:

import numpy as np
import pandas as pd
import pandera as pa
from pandera.typing import DataFrame, Index, Series

class DFModel(pa.DataFrameModel):
    idx: Index[int]
    col_float: Series[float] = pa.Field(nullable=True)
    col_str: Series[str] = pa.Field(nullable=True)
    col_lst: Series[list[str]] = pa.Field(nullable=True)


df = (
    pd.DataFrame(
        [
            {"idx": 1, "col_float": 1.0, "col_str": "one", "col_lst": ["1", "2", "3"]},
            {"idx": 2, "col_float": 1.0, "col_str": None, "col_lst": None},
            {"idx": 3, "col_float": None, "col_str": None, "col_lst": None},
        ]
    )
    .pipe(DataFrame[DFModel])
)

# works!

df = (
    pd.DataFrame(
        [
            {"idx": 2, "col_float": 1.0, "col_str": None, "col_lst": None},
            {"idx": 3, "col_float": None, "col_str": None, "col_lst": None},
        ]
    )
    .pipe(DataFrame[DFModel])
)

# also works!

df = (
    pd.DataFrame(
        [
            {"idx": 2, "col_float": 1.0, "col_str": None, "col_lst": None},
            {"idx": 3, "col_float": None, "col_str": None, "col_lst": None},
        ]
    )
    .fillna(np.nan)
    .pipe(DataFrame[DFModel])
)

...
/site-packages/pandera/error_handlers.py:38, in SchemaErrorHandler.collect_error(self, reason_code, schema_error, original_exc)
     31 """Collect schema error, raising exception if lazy is False.
     32
     33 :param reason_code: string representing reason for error.
     34 :param schema_error: ``SchemaError`` object.
     35 :param original_exc: original exception associated with the SchemaError.
     36 """
     37 if not self._lazy:
---> 38     raise schema_error from original_exc
     40 # delete data of validated object from SchemaError object to prevent
     41 # storing copies of the validated DataFrame/Series for every
     42 # SchemaError collected.
     43 del schema_error.data

SchemaError: expected series 'col_lst' to have type list[str], got float64

df = (
    pd.DataFrame(
        [
            {"idx": 2, "col_float": 1.0, "col_str": None, "col_lst": None},
            {"idx": 3, "col_float": None, "col_str": None, "col_lst": None},
        ]
    )
    .convert_dtypes(infer_objects=False, dtype_backend="numpy_nullable")
    .pipe(DataFrame[DFModel])
)

...

/site-packages/pandera/error_handlers.py:38, in SchemaErrorHandler.collect_error(self, reason_code, schema_error, original_exc)
     31 """Collect schema error, raising exception if lazy is False.
     32
     33 :param reason_code: string representing reason for error.
     34 :param schema_error: ``SchemaError`` object.
     35 :param original_exc: original exception associated with the SchemaError.
     36 """
     37 if not self._lazy:
---> 38     raise schema_error from original_exc
     40 # delete data of validated object from SchemaError object to prevent
     41 # storing copies of the validated DataFrame/Series for every
     42 # SchemaError collected.
     43 del schema_error.data

SchemaError: expected series 'col_float' to have type float64, got Int64

And it is related to the same type converstion that you've noticed

None pass validation, but NaNs don't

from pandera.

cosmicBboy commented on May 30, 2024

So there are two solutions:

Set coerce=True with:

class DFModel(pa.DataFrameModel):
    idx: Index[int]
    col_float: Series[float] = pa.Field(nullable=True)
    col_str: Series[str] = pa.Field(nullable=True)
    col_lst: Series[list[str]] = pa.Field(nullable=True)

    class Config:
        coerce=True

This is will validate correctly.

Convert col_lst to an object data type.

df = (
    pd.DataFrame(
        [
            {"idx": 2, "col_float": 1.0, "col_str": None, "col_lst": None},
            {"idx": 3, "col_float": None, "col_str": None, "col_lst": None},
        ]
    )
    .fillna(np.nan)
    .astype({"col_lst": "object"})
    .pipe(DataFrame[DFModel])
)

The issue here is that, with coerce=False, pandera just validates the data type as is. So whatever the dtypes of the columns are of the incoming dataframe is what pandera checks against the schema type.

from pandera.

janrito commented on May 30, 2024

Right this makes sense. Because by this point, pandas has forced col_lst into a float in order to accept a NaN.

Can coerce have unintended consequences? None into 'None' in a string column for example?

from pandera.

None in fail list nullable validation about pandera HOT 6 CLOSED

Comments (6)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent