Describe the bug Symptom I've s

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Hey <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url=

Hi! <a class="user-mention notranslate" data-hovercard-type="user" d

[PySpark] Performance issues during validation about pandera HOT 5 OPEN

filipeo2-mck commented on May 30, 2024

[PySpark] Performance issues during validation

from pandera.

Comments (5)

filipeo2-mck commented on May 30, 2024

@NeerajMalhotra-QB

from pandera.

filipeo2-mck commented on May 30, 2024

I believe that both solutions are complementary, as each infrastructure available is different from others. In some scenarios, cache will be better, to other, checkpoints. I would like to have more opinions, from Spark masters :)

from pandera.

NeerajMalhotra-QB commented on May 30, 2024

Acknowledging the resource-intensive nature of data validations, I concur that caching could be an ideal solution.

However, before implementing this within pandera, I recommend conducting performance tests on a suitable cluster, as personal laptops might not provide accurate performance insights.

It's crucial to consider that caching could potentially create bottlenecks, particularly on single-node machines, so our solution should address this potential issue as well.

While the post didn't specify the dataframe size, assuming it wasn't substantial due to testing on a local laptop, it would be beneficial to consider leveraging optimized file storage formats (e.g., parquet) for real-world applications.

If the dataframe was loaded from a plain CSV, could you please rerun the same test using a parquet file? This approach will allow for a more accurate comparison and evaluation.

from pandera.

kasperjanehag commented on May 30, 2024

Hey @filipeo2-mck . Your analysis and the proposed solutions offer a great starting point. The re-computation of the DAG for each validation indeed adds overhead, and your proposals to utilize caching or checkpointing to mitigate this are quite valid.

Caching is indeed a quick win, but as you’ve pointed out, it comes with the risk of exhausting memory. However, it could be beneficial in scenarios where the dataframe fits comfortably in memory or when running on a larger cluster.
Checkpointing to disk. Although it's not as fast as caching and requires clean-up, it guarantees that intermediate results persist, potentially reducing the processing time.

Finally, if we add these arguments to the validate method, allowing them to tailor the validation process, should we also allow a way to control this process with an environment variable override? I.e maybe you want to limit panderas way of caching without changing the application code?

from pandera.

filipeo2-mck commented on May 30, 2024

Hi!

@NeerajMalhotra-QB , the current setup only uses parquet files and I tested it locally only 👍

@kasperjanehag , agree, env vars will give enough flexibility to the user.

from pandera.

Recommend Projects

[PySpark] Performance issues during validation about pandera HOT 5 OPEN

Comments (5)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent