Comments (5)
from pandera.
I believe that both solutions are complementary, as each infrastructure available is different from others. In some scenarios, cache will be better, to other, checkpoints. I would like to have more opinions, from Spark masters :)
from pandera.
Acknowledging the resource-intensive nature of data validations, I concur that caching could be an ideal solution.
However, before implementing this within pandera
, I recommend conducting performance tests on a suitable cluster, as personal laptops might not provide accurate performance insights.
It's crucial to consider that caching could potentially create bottlenecks, particularly on single-node machines, so our solution should address this potential issue as well.
While the post didn't specify the dataframe size, assuming it wasn't substantial due to testing on a local laptop, it would be beneficial to consider leveraging optimized file storage formats (e.g., parquet) for real-world applications.
If the dataframe was loaded from a plain CSV, could you please rerun the same test using a parquet file? This approach will allow for a more accurate comparison and evaluation.
from pandera.
Hey @filipeo2-mck . Your analysis and the proposed solutions offer a great starting point. The re-computation of the DAG for each validation indeed adds overhead, and your proposals to utilize caching or checkpointing to mitigate this are quite valid.
- Caching is indeed a quick win, but as you’ve pointed out, it comes with the risk of exhausting memory. However, it could be beneficial in scenarios where the dataframe fits comfortably in memory or when running on a larger cluster.
- Checkpointing to disk. Although it's not as fast as caching and requires clean-up, it guarantees that intermediate results persist, potentially reducing the processing time.
Finally, if we add these arguments to the validate method, allowing them to tailor the validation process, should we also allow a way to control this process with an environment variable override? I.e maybe you want to limit panderas way of caching without changing the application code?
from pandera.
Hi!
@NeerajMalhotra-QB , the current setup only uses parquet files and I tested it locally only 👍
@kasperjanehag , agree, env vars will give enough flexibility to the user.
from pandera.
Related Issues (20)
- Joint uniqueness unsatisfiable for data synthesis
- An element_wise Check on a datetime column of an empty DataFrame fails since 0.14
- str_length check raises an DispatchError exception if min_value or max_value are not set HOT 1
- Schema inference from Dask dataframe or series
- check-type decorators do not work in parallel HOT 2
- Pandera example generation seems to to much slower than building dataframes with lists
- Date type not exported
- Add built-in checks for Polars schemas support
- Keeping track of Polars DataTypes for Polars schemas support
- Series[list[TypedDict] fails in Python 3.11 but not in Python 3.12
- feature(pandas): Support string column validation for pandas 2.1.3 HOT 1
- Use DataFrameModel for validating multiindex columns
- DataFrameSchema <NA> column_ordered
- Implement polars LazyFrame backend and core checks
- Support conversion from DataFrameModel to PySpark StructType HOT 1
- DataFrameSchema lazy check_column_values_are_unique failing if columns are not present
- jupyter-server and jupyter-events version conflicts when trying to do install -r dev/requirements-3.11.txt
- Cannot create a pydantic model with a `pandera.typing.pyspark.DataFrame` type. HOT 5
- Performance issue with nulls in pandas dataframe with multi-index validation
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from pandera.