Giter VIP home page Giter VIP logo

Comments (7)

cosmicBboy avatar cosmicBboy commented on May 30, 2024 1

Okay, so it seems like generating smaller dataframes yields higher entropy results:

print(schema.example(size=5))

# generates different datasets
               column1  column2        column3 column4
0                  152        1   9.007199e+15     BBB
1  9223372036854775807        1   1.192093e-07     CCC
2  4148323564460896226       56   6.189641e+16     BBB
3                  123       83   6.103516e-05     CCC
4                32240        2  1.112537e-308     BBB
print(schema.example(size=10))

# we see this consistently
   column1  column2  column3 column4
0    31078        1      0.0     AAA
1        0        1      0.0     AAA
2        0        1      0.0     AAA
3        0        1      0.0     AAA
4        0        1      0.0     AAA
5        0        1      0.0     AAA
6        0        1      0.0     AAA
7        0        1      0.0     AAA
8        0        1      0.0     AAA
9        0        1      0.0     AAA

@tmcclintock recommendations would be:

  • generate a bunch of smaller dataframes and concat them, it seems like dataframes of about size 5 is the magic number.
  • restrict your schemas to have only one check (this is pretty unreasonable though).

@Zac-HD any ideas on how to address this? On the pandera side, it would make sense to collect all the schema statistics and combine them all into a single element strategy so we don't have to rely on filter, but that'll require a larger refactoring project.

from pandera.

cosmicBboy avatar cosmicBboy commented on May 30, 2024

Looks like this is an issue with the way pandera strategies tries to chain together multiple checks, e.g.

schema = DataFrameSchema(
    {
        "column1": Column(int, Check.ge(0)),
        "column2": Column(int, [Check.in_range(1, 100)]),  # 👈 use a single in_range check instead of ge and le
        "column3": Column(float, Check.ge(0)),
        "column4": Column(str, Check.isin(["AAA", "BBB", "CCC"])),
    }
)

produces

6.100.1 0.0.0+dev0
                column1  column2        column3 column4
0                     0        1   3.402823e+38     AAA
1                     0        1   2.882304e+16     CCC
2                     0        6   2.000010e+00     BBB
3                   247       47   9.999900e-01     BBB
4                 19526       50  1.390036e+164     AAA
5                 56223       63  2.225074e-308     AAA
6                    42       15   7.357397e+15     BBB
7                    97       62   9.999900e-01     CCC
8                     0       69   3.293796e+09     AAA
9   9216616637413720064        4   1.000000e+07     AAA
10    23090105669335094       14   5.397605e-78     CCC
11                    0       50   1.192093e-07     CCC
12           1260840409       98   1.500000e+00     AAA
13                21966       68   1.100000e+00     AAA
14                23289       21   3.333333e-01     CCC
15   912854047966763290       27   6.519203e+16     BBB
16  8876389219764502267        9  5.706631e-178     CCC
17                40004       40   1.500000e+00     CCC
18                  247       77   5.742309e+16     BBB
19                47285       17   1.175494e-38     AAA

from pandera.

Zac-HD avatar Zac-HD commented on May 30, 2024
  1. Check whether you see more-diverse outputs if you actually run the test? Strategies' .example() method often biases simpler (for complicated internal reasons), and dataframes are typically 'sparse' as well - so you might get a fill-value and then few-or-no other values.
  2. Eventually you're going to have to do that project, yeah. The filter-rewriting should be able to handle this case though, so I suspect that there's a simpler fix for this specific issue somewhere in Pandera.

from pandera.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.