Giter VIP home page Giter VIP logo

Comments (5)

filipeo2-mck avatar filipeo2-mck commented on May 30, 2024

@NeerajMalhotra-QB

from pandera.

filipeo2-mck avatar filipeo2-mck commented on May 30, 2024

I believe that both solutions are complementary, as each infrastructure available is different from others. In some scenarios, cache will be better, to other, checkpoints. I would like to have more opinions, from Spark masters :)

from pandera.

NeerajMalhotra-QB avatar NeerajMalhotra-QB commented on May 30, 2024

Acknowledging the resource-intensive nature of data validations, I concur that caching could be an ideal solution.

However, before implementing this within pandera, I recommend conducting performance tests on a suitable cluster, as personal laptops might not provide accurate performance insights.

It's crucial to consider that caching could potentially create bottlenecks, particularly on single-node machines, so our solution should address this potential issue as well.

While the post didn't specify the dataframe size, assuming it wasn't substantial due to testing on a local laptop, it would be beneficial to consider leveraging optimized file storage formats (e.g., parquet) for real-world applications.

If the dataframe was loaded from a plain CSV, could you please rerun the same test using a parquet file? This approach will allow for a more accurate comparison and evaluation.

from pandera.

kasperjanehag avatar kasperjanehag commented on May 30, 2024

Hey @filipeo2-mck . Your analysis and the proposed solutions offer a great starting point. The re-computation of the DAG for each validation indeed adds overhead, and your proposals to utilize caching or checkpointing to mitigate this are quite valid.

  • Caching is indeed a quick win, but as you’ve pointed out, it comes with the risk of exhausting memory. However, it could be beneficial in scenarios where the dataframe fits comfortably in memory or when running on a larger cluster.
  • Checkpointing to disk. Although it's not as fast as caching and requires clean-up, it guarantees that intermediate results persist, potentially reducing the processing time.

Finally, if we add these arguments to the validate method, allowing them to tailor the validation process, should we also allow a way to control this process with an environment variable override? I.e maybe you want to limit panderas way of caching without changing the application code?

from pandera.

filipeo2-mck avatar filipeo2-mck commented on May 30, 2024

Hi!

@NeerajMalhotra-QB , the current setup only uses parquet files and I tested it locally only 👍

@kasperjanehag , agree, env vars will give enough flexibility to the user.

from pandera.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.