Giter VIP home page Giter VIP logo

pandera's People

Contributors

a-recknagel avatar andriig13 avatar antonl avatar baskervilski avatar chr1st1ank avatar cosmicbboy avatar cristianmatache avatar dependabot[bot] avatar derinwalters avatar ferhah avatar filipeo2-mck avatar fleimgruber avatar gordonhart avatar honno avatar kvnkho avatar kykyi avatar m1so avatar manel-ab avatar mastersplinter avatar mattb1989 avatar nathanjmcdougall avatar ng-henry avatar nickcrews avatar plague006 avatar ralbertazzi avatar robertcraigie avatar smackesey avatar tfwillems avatar the-matt-morris avatar tpvasconcelos avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

pandera's Issues

add strict=True to DataFrameSchema

if strict=True, then all columns in the dataframe must have a corresponding Column in the dataframe schema. If strict=False, raise UserWarning that indicates columns in the df that aren't being validated.

add support for dataframe schema transformations: add_column, remove_column

these should be methods that correspond with pandas dataframe operations.

For example, if the user adds a column to a dataframe, also support changing the corresponding schema to account for that change:

df = pd.DataFrame({"a": [1, 2, 3]})

schema = DataFrameSchema([Column("a", PandasDtype.Int)])
df = schema.validate(df)

# add a column to the dataframe
df["b"] = ["x", "y", "z"]

# add column to the dataframe schema
schema = schema.add_column(Column("b", PandasDtype.String))
df = schema.validate(df)

# same with removing columns
df = df.dropna("a", axis=1)
schema = schema.remove_column("a")

df = schema.validate(df)


# or reflecting changes in an existing column
df["a"] = df["a"].astype(float)
schema = schema.change_column(Column("a", PandasDtype.Float))

df = schema.validate(df)

modularize pandera code

what started as a small ~200 LOC project is now... larger.

should modularize schemas, checks, hypotheses, errors, and decorators into their own modules.

support DataFrameSchema transformations

This feature supports the use case where I want to be able to declare schema transformations as methods of a DataFrameSchema so that I can express dataframe schema transformations that I expect in a pipeline of transformations.

Schema Transformations

schema1 = DataFrameSchema({
    "col1": Column(Int, Check(lambda s: s >= 0)),
    "col2": Column(Int, Check(lambda s: s >= 0)),
})

schema2 = schema1.add_columns({
    "col3": Column(Int, Check(lambda s: s >= 0))
})

schema3 = schema2.remove_columns(["col1"])

df = pd.DataFrame({
    "col1": [1, 2, 3],
    "col2": [1, 2, 3],
})
df = schema1.validate(df)
df["col3"] = df["col1"] + df["col2"]
df = schema2.validate(df)
del df["col1"]
df = schema3.validate(df)

The add_columns and remove_columns methods should return a deep copy of the dataframe schema so that it doesn't mutate the original schema.columns dict.

Edit: move SchemaPipeline to its own issue #162

update readme on release of next version

suggest replacing

**Supports:** python 2.7, 3.5, 3.6

with:

[![PyPI pyversions](https://img.shields.io/pypi/pyversions/pandera.svg)](https://pypi.python.org/pypi/pandera/)

when the new metadata is available on pypi.

It will be 'missing' before pypi is updated:
PyPI pyversions

make Check element_wise=False the default

since we're working with pandas here, it makes sense to make vectorized checks the default setting to encourage users to take advantage of performance gains of vectorized checks.

discussion: should pandera get a logo?

Should Pandera get a logo?

Many public projects have logos to help with recognisability.

There's a free logo generator on https://hatchful.shopify.com/

To make the logos below, II used the options: Get Started -> Services -> Reliable -> pandera+pandas schema validation -> Online store or website

These are some of the better ones:
Option 1:
image

Option 2:
image

Option 3:
image

ModuleNotFoundError: No module named 'pandera'

Hi Niels,

Awesome library! I pip-installed it, and tried to use it, but for some reason the library can't be found.
I get the same error whether I open python from the command line or open a Jupyter notebook. (I also tried opening a new Terminal window).

I'm using Miniconda3 with pandera I believe installed under:
/miniconda3/lib/python3.6/site-packages (0.0.3)

However, all I can find there is this folder:
pandera-0.0.3.dist-info

...with these files:

METADATA
RECORD
top_level.txt
WHEEL

Could it be that the install doesn't work properly on Miniconda?

Let me know if I need to provide more information

dict-based definition of schema

express schema as a dictionary

DataFrameSchema({
    "column1": Column(Int, Validator)
})

for multiindex column, use tuple as key

DataFrameSchema({
    ("c1_level0", "c1_level1"): Column(Int, Validator, nullable=True)
})

add debug mode

The user enables "debug" mode (maybe as an ENV variable) in order to enter the scope of the schema validator where the error occurs.

implement Hypothesis validator

this is a special class of multi-column validator that performs a hypothesis test on the dataset.

For two-sample hypothesis tests, the minimum requirement is that a groupby argument is specified.

For a one-sample hypothesis test, if the groupby argument is supplied, then the groups argument must be a string of a one-element list specifying the group that you want to test. If groupby is None, then the hypothesis test will apply to the entire column Series.

If we want to statistically assert that the average weight for men is higher than women, then we can do something like:

DataFrameSchema({
    "weight": Column(
        Float,
        checks=[
            Hypothesis(
                test="one_sided_two_sample_t_test", groupby="sex",
                groups=["men", "women"], relationship="gt")
                raise_warning=True)
        ]
    )
})

# warning makes it so that runtime raises warning instead of exception
# in cases where breaking the hypothesis test shouldn't block runtime

Need to figure out a suite of use cases (e.g. confidence interval assertions, etc.)

should DataFrameSchema support partitioning dataframe into valid/invalid parts?

Discuss the use case for partitioning a dataframe into valid and invalid portions.

Akash Gupta on this pandera blogpost comment https://disqus.com/by/disqus_jiG9N3PPd8 expressed interest in separating the dataframe into a valid and invalid portion.

Note exactly sure what the use case he had in mind was, but wondering if this might be a good idea.

The error message of an invalid dataframe contains some information about which indices were invalid, but perhaps debugging can be made easier by providing additional data in the SchemaError raised when calling schema.validate(df).

One proposal would be to extend SchemaError to include a failure_cases attribute and expose it to the user:

class SchemaError(Exception):
    def __init__(self, message, failure_cases):
        super(SchemaError, self).__init__(message)
        # some TBD data structure containing invalid cases. Maybe by `Column` + `Index`?
        self.failure_cases = failure_cases

Then the user can catch these errors in their code:

schema = DataFrameSchema(...)
df = ...

try:
    schema.validate(df)
except SchemaError as e:
    # suppose that `failure_cases` is a dict mapping Column names to a list of indexes in
    # dataframe that didn't pass a particular `Check`, where each element
    # in the list corresponds to the `Check`.
    idx = e.failure_cases["column_name"][0]
    # access specific failure cases in the dataframe
    df.loc[idx, "column_name"]

Why does SeriesSchema not accept a series with name?

I'm having trouble with this section of SeriesSchemaBase:
https://github.com/pandera-dev/pandera/blob/ca5b39d329b3572d2889dd52a41ea97da6ef1534/pandera/pandera.py#L748-L751

It raises an error if I give it a series with name. Here a minimal example:

>>> import pandera
>>> sample_df = pd.DataFrame({
...     "int_col": [1, 2, 3],
...     "float_col": [1.1, 2.5, 9.9]
... })
... 
... series = pd.Series(sample_df['int_col'])  # Necessary here: name=None
... schema = pandera.SeriesSchema(
...     pandas_dtype='int', checks=pandera.Check(lambda s: s > 0)
... )
... schema.validate(series)
Traceback (most recent call last):
  File "<input>", line 10, in <module>
  File "C:\Users\ckrudewi\PycharmProjects\pandera\pandera\pandera.py", line 839, in validate
    if super(SeriesSchema, self).__call__(series):
  File "C:\Users\ckrudewi\PycharmProjects\pandera\pandera\pandera.py", line 751, in __call__
    (type(self), self._name, series.name))
pandera.pandera.SchemaError: Expected <class 'pandera.pandera.SeriesSchema'> to have name 'None', found 'int_col'

I can try to suggest a change as PR. But before I would like to ask if there is a particular reason for this behaviour?

add dataframe-level checks: a list of checks that have access to the entire dataframe

This feature enables checks on the dataframe that assert properties about multiple columns in the dataframe, for example, to assert that the ratio of two columns is <, >, etc.

This should be expressed as a list of checks supplied to the DataFrameSchema constructor:

DataFrameSchema(
    columns={...},
    checks=[
        Check(lambda df: (df["col1"] / df["col2"]) > 1),
        Check(lambda df: (df["col1"] * df["col3"]) < 1),
    ]
)

Note a subtlety here that if we supply element_wise=True, then the function signature should apply to a row in the dataframe, as if we were doing the following:

for row in dataframe.itertuples():
    (row["col1"] / row["col2"]) > 1

In pandera, the above schema would look like:

DataFrameSchema(
    columns={...},
    checks=[
        Check(lambda r: (r["col1"] / r["col2"]) > 1, element_wise=True),
        Check(lambda r: (r["col1"] * r["col3"]) < 1, element_wise=True),
    ]
)

for two sample hypothesis tests, standardize API

For a more intuitive API, two sample Hypothesis tests class method definitions should look like this:

def two_sample_hypothesis_test(
    cls, groupby, group1, group2, relationship, alpha=0.01,
    equal_var=True, nan_policy="propagate")

which is more intuitive than specifying a list of groups that may or may not have 2 elements in it.

Add built-in Checks for common operations

pandera offers a lot of flexibility in terms of the kinds of assertions you can make about a dataframe. However, it would be useful to have built-in checks for common operations. This issue is a proposal to define what those common operations are.

These should all use vectorized pandas operations.

For dtypes that support pandas comparison operators >, >=, <, <=, ==, !=

  • greater than scalar x
  • greater than or equal to scalar x
  • equal to scalar x
  • not equal to scalar x
  • less than scalar x
  • is in range x - y (inclusive or exclusive)
  • is sorted, ascending or descending

For dtypes that support checks for set membership

  • is in list-like/set s
  • is not in list-like/set s

For datetime dtypes

  • check for date format "<some_format>"
  • (comparison operators and range checks also apply here)

For string dtype

  • column matches "regex" pattern
  • no trailing whitespace
  • no leading whitespace
  • column contains "string"
  • column starts with "string"
  • column ends with "string"

PyPI docs logo not loading

Not sure why, but I get the below when looking at pypi. Have tried a few times and get the same error.

The image address on pypi is odd : https://warehouse-camo.cmh1.psfhosted.org/8a54bccff4f1c84259a30f7d18afae06ce64b607/68747470733a2f2f6769746875622e636f6d2f636f736d696342626f792f70616e646572612f626c6f622f6d61737465722f646f63732f736f757263652f5f7374617469632f70616e646572612d62616e6e65722e737667

And doesn't match the GitHub address for the same image which I'd expect to be used:

I'll investigate properly at some point but time constrained at the moment.

image

add support for MultiIndex index/columns

the API for MultiIndex columns could be something like:

DataFrameSchema({
    ("col1_levela", "col1_levelb"): Column(...)
})

The API for MultiIndex indexes could be something like:

MultiIndex(
    Index(Int, ...),
    Index(String, ...),
)

drop python 2.7 support

since python2.7 will no longer be maintained by Jan 1, 2020, we should also no longer maintain it moving forward.

pandera v0.1.3 will be released on June 9th, 2019.

Should plan to follow up with a v0.2.0 version bump to deprecate support for python 2.7 after resolving a few more issues

update dataframe schema API

Make the interface more natural to pandas users:

schema = DataFrameSchema({
    "column1": Int,
    "column2": Float,
    "column3": String
})

No need for the Column class, in this case.
note that Int, Float, and String are now just validators

Or provide a list of validators

schema = DataFrameSchema({
    "column1": [Int, Nullable, Assert(lambda x: x > 1, element_wise=True)],
    ...
})

The Assert signature should be:

Assert(callables*, element_wise=True)

Make callables for each data type, e.g. Int, Float, etc.
and also for Nullable

Validator interface should be the same

Schema definitions in yaml

I suggest allowing to pass schemas as yaml files. That way it wouldn't be necessary to hardcode all the checks when using pandera. Instead they would be defined in the yaml schema.
There are two use cases I see:

  • Validating dataframes in a CI/CD pipeline. There it would be possible to have a validation step which could then be re-configured without changing the actual python code. It would be possible to check multiple different dataframes against their expected schemas with the same python code but just different yaml files
  • Pandera could offer a (simple) command line tool which reads data from some defined formats, such as pickle or json and directly checks it against a schema specified in a file.

The yaml format needs to be designed thoroughly in this case to offer optimal flexibility. I could think of something like this:

YAML schema definition:

# General section for dataframe wide checks
dataframe:
  - min_length: 1000
# Checks per column
columns:
  column1:
    # List of checks, each one is a dictionary
    # this allows parametrization
    - type: int
    - max: 10
    - allow_null: False
  column2:
    - type: float
    - max: -1.2
  column2:
    - type: str
    - match: "^value_"
    # Allow custom functions (here with arguments)
    - custom_function: split_shape
      split_char: "_"
      expected_splits: 2      

Python code:

def split_shape(df, split_char, expected_splits):
   """Custom check function"""
   return (s.str.split(split_char, expand=True).shape[1] == expected_splits)
   
schema = DataFrameSchema.from_yaml(
            path="path_to_yaml",
            custom_functions = [split_shape]
)

validated_df = schema.validate(df)

As we probably don't want that arbitrary Python code can be executed from the yaml file with the !!python syntax I suggest that we rather go with a mix of built-in checks and the option to add user defined functions as in the example above.

Infer dataframe schema?

This is an idea I want to put out for discussion.

Together with a yaml schema format (#91) one could introduce schema inference to create a "draft schema" automatically. Currently it seems a lot of manual work to specify a schema for a dataframe with many features thoroughly. With two methods infer_schema(df) and to_yaml() work could be a little easier, because then the schema would only need some additional fine-tuning.

TensorFlow validation offers such a functionality, for example. However their implementation also shows the complexity of such a feature. It seems to be a two-step approach:

  1. Create statistics
  2. Infer the schema

But maybe pandera doesn't need to offer the same flexibility and could go with a simpler infer_schema function. What do you think:

Wrapping a check_input causes index out of range

The following example:

from pandera import DataFrameSchema, Int, DateTime, String, Check, Column, Float, Bool, check_output, check_input
import pandas as pd
import numpy as np
from gsktools import load_data

df = pd.DataFrame({
        "column1": [10.0, 20.0, 30.0],
        "column2": [1, 2, 3],
    })
โ€‹
schema = DataFrameSchema({
        "column1": Column(Float),
        "column2": Column(Int),
    })

@check_input(schema,1)
def original_function(some_arg, df):
    return df

def wrapper(function_as_parameter,positional_arguments):
    df = function_as_parameter(positional_arguments)
    return df

def function_that_calls_original_function_via_wrapper(df):
    new_df = wrapper(
            function_as_parameter=original_function,
            positional_arguments=df)

function_that_calls_original_function_via_wrapper(df)

Results in this list index out of range error:

---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
<ipython-input-100-da3315e2c30b> in <module>
----> 1 function_that_calls_original_function_via_wrapper(df)

<ipython-input-99-0075cf7550ed> in function_that_calls_original_function_via_wrapper(df)
      2     new_df = wrapper(
      3             function_as_parameter=original_function,
----> 4             positional_arguments=df)

<ipython-input-98-c6fbf50f8087> in wrapper(function_as_parameter, positional_arguments)
      1 def wrapper(function_as_parameter,positional_arguments):
----> 2     df = function_as_parameter(positional_arguments)
      3     return df

/pandera_dev/pandera/pandera.py in _wrapper(fn, instance, args, kwargs)
    766                     zip(arg_spec_args, args))
    767                 args_dict[obj_getter] = schema.validate(args_dict[obj_getter])
--> 768                 args = list(args_dict.values())
    769         elif obj_getter is None:
    770             try:

IndexError: list index out of range

Dataframe null check doesn't provide df name if error occurs

Describe the bug

If a dataframe contains null values in a column, the dataframe name is not returned when the errors in the series are reported. When multiple checks are being run in the same script, this makes debugging difficult as only the column name is returned.

To Reproduce

import pandas as pd
import numpy as np
from pandera import Column, DataFrameSchema, PandasDtype, SeriesSchema, Check, Int

schema = DataFrameSchema(
    {
        "a": Column(PandasDtype.Int, Check(lambda x: x > 0))
    })

df = pd.DataFrame({
    "a": [1, 2, np.nan]
})

schema.validate(df)

This results in an error:

SchemaError: expected series 'a' to have type int64, got float64 and non-nullable series contains null values: {2: nan}

(full error report included at the bottom of this issue).

I think this is the expected behaviour, but this behaviour makes it very difficult to debug for a user. Whilst I have raised a PR which enhances the SeriesSchemaBase class error reporting by including the series/column name, I'm unsure how best to also return the dataframe name so that a user can debug.

Expected behavior:
The name of the dataframe and the column should be returned to the user to enable fast debugging of why the error occurs.

e.g. for the above code, an error like this would be very helpful:

SchemaError: in dataframe 'df', column 'a' expected to have type int64, got float64 and expected to be non-nullable but contains null values: {2: nan}

Python version: 3.6
Pandera version: 0.12

Full error message:

SchemaError                               Traceback (most recent call last)
/mycondaenv/lib/python3.6/site-packages/schema.py in validate(self, data)
    402             try:
--> 403                 if s(data):
    404                     return data

/panderadev/pandera/pandera.py in __call__(self, df)
    340                 "need to `set_name` of column before calling it.")
--> 341         return super(Column, self).__call__(df[self._name])
    342 

/panderadev/pandera/pandera.py in __call__(self, series)
    226                         "expected series '%s' to have type %s, got %s and non-nullable series contains null values: %s" %
--> 227                         (series.name, self._pandas_dtype.value, series.dtype, series[nulls].head(N_FAILURE_CASES).to_dict()))
    228                 else:

SchemaError: expected series 'a' to have type int64, got float64 and non-nullable series contains null values: {2: nan}

During handling of the above exception, another exception occurred:

SchemaError                               Traceback (most recent call last)
/mycondaenv/lib/python3.6/site-packages/schema.py in validate(self, data)
    392             try:
--> 393                 return s.validate(data)
    394             except SchemaError as x:

/mycondaenv/lib/python3.6/site-packages/schema.py in validate(self, data)
    121         for s in [self._schema(s, error=self._error, ignore_extra_keys=self._ignore_extra_keys) for s in self._args]:
--> 122             data = s.validate(data)
    123         return data

/mycondaenv/lib/python3.6/site-packages/schema.py in validate(self, data)
    405             except SchemaError as x:
--> 406                 raise SchemaError([None] + x.autos, [e] + x.errors)
    407             except BaseException as x:

SchemaError: expected series 'a' to have type int64, got float64 and non-nullable series contains null values: {2: nan}

During handling of the above exception, another exception occurred:

SchemaError                               Traceback (most recent call last)
<ipython-input-29-dc85ea979a2b> in <module>
----> 1 schema.validate(df)

/panderadev/pandera/pandera.py in validate(self, dataframe)
    176         if not isinstance(dataframe, pd.DataFrame):
    177             raise TypeError("expected dataframe, got %s" % type(dataframe))
--> 178         return self.schema.validate(dataframe)
    179 
    180 

/mycondaenv/lib/python3.6/site-packages/schema.py in validate(self, data)
    393                 return s.validate(data)
    394             except SchemaError as x:
--> 395                 raise SchemaError([None] + x.autos, [e] + x.errors)
    396             except BaseException as x:
    397                 message = "%r.validate(%r) raised %r" % (s, data, x)

SchemaError: expected series 'a' to have type int64, got float64 and non-nullable series contains null values: {2: nan}

make coerce logic compatible with nullable columns

currently, of coerce=True, the validation logic will first coerce the column to the specified dtype, then the column checks are applied.

This causes an issue where coercing to type str will turn None into 'None', thus invalidating the nullable logic.

Add CompareColumns subclass of Check for dataframe-level multi-column checks

The CompareColumns subclasses of Check are designed to work nicely with dataframe-level checks #14.

CompareColumns class

This class enables built-in comparisons of two columns.

The proposed API for this would be something like:

DataFrameSchema(
    columns={...},
    checks=[
        Compare("col1").greater_than("col2"),
        Compare("col2").less_than_equal("col3"),
    ]
)

This will be an experimental API

add coerce option to DataFrameSchema

this should coerce the type of all columns to the specified dtype before validation.

This should raise a warning when nullable Int columns are defined... these should
just default to Float

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.