Describe the bug Calling schema.example

Check whether you see more-diverse outputs if you actually run the test? Strate

Hypothesis examples are all the same about pandera HOT 7 OPEN

tmcclintock commented on May 30, 2024

Hypothesis examples are all the same

from pandera.

Comments (7)

cosmicBboy commented on May 30, 2024 1

Okay, so it seems like generating smaller dataframes yields higher entropy results:

print(schema.example(size=5))

# generates different datasets
               column1  column2        column3 column4
0                  152        1   9.007199e+15     BBB
1  9223372036854775807        1   1.192093e-07     CCC
2  4148323564460896226       56   6.189641e+16     BBB
3                  123       83   6.103516e-05     CCC
4                32240        2  1.112537e-308     BBB

print(schema.example(size=10))

# we see this consistently
   column1  column2  column3 column4
0    31078        1      0.0     AAA
1        0        1      0.0     AAA
2        0        1      0.0     AAA
3        0        1      0.0     AAA
4        0        1      0.0     AAA
5        0        1      0.0     AAA
6        0        1      0.0     AAA
7        0        1      0.0     AAA
8        0        1      0.0     AAA
9        0        1      0.0     AAA

@tmcclintock recommendations would be:

generate a bunch of smaller dataframes and concat them, it seems like dataframes of about size 5 is the magic number.
restrict your schemas to have only one check (this is pretty unreasonable though).

@Zac-HD any ideas on how to address this? On the pandera side, it would make sense to collect all the schema statistics and combine them all into a single element strategy so we don't have to rely on filter, but that'll require a larger refactoring project.

from pandera.

cosmicBboy commented on May 30, 2024

Looks like this is an issue with the way pandera strategies tries to chain together multiple checks, e.g.

schema = DataFrameSchema(
    {
        "column1": Column(int, Check.ge(0)),
        "column2": Column(int, [Check.in_range(1, 100)]),  # 👈 use a single in_range check instead of ge and le
        "column3": Column(float, Check.ge(0)),
        "column4": Column(str, Check.isin(["AAA", "BBB", "CCC"])),
    }
)

produces

6.100.1 0.0.0+dev0
                column1  column2        column3 column4
0                     0        1   3.402823e+38     AAA
1                     0        1   2.882304e+16     CCC
2                     0        6   2.000010e+00     BBB
3                   247       47   9.999900e-01     BBB
4                 19526       50  1.390036e+164     AAA
5                 56223       63  2.225074e-308     AAA
6                    42       15   7.357397e+15     BBB
7                    97       62   9.999900e-01     CCC
8                     0       69   3.293796e+09     AAA
9   9216616637413720064        4   1.000000e+07     AAA
10    23090105669335094       14   5.397605e-78     CCC
11                    0       50   1.192093e-07     CCC
12           1260840409       98   1.500000e+00     AAA
13                21966       68   1.100000e+00     AAA
14                23289       21   3.333333e-01     CCC
15   912854047966763290       27   6.519203e+16     BBB
16  8876389219764502267        9  5.706631e-178     CCC
17                40004       40   1.500000e+00     CCC
18                  247       77   5.742309e+16     BBB
19                47285       17   1.175494e-38     AAA

from pandera.

Zac-HD commented on May 30, 2024

Check whether you see more-diverse outputs if you actually run the test? Strategies' .example() method often biases simpler (for complicated internal reasons), and dataframes are typically 'sparse' as well - so you might get a fill-value and then few-or-no other values.
Eventually you're going to have to do that project, yeah. The filter-rewriting should be able to handle this case though, so I suspect that there's a simpler fix for this specific issue somewhere in Pandera.

from pandera.

Hypothesis examples are all the same about pandera HOT 7 OPEN

Comments (7)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent