Comments (4)
Hi @SuleymanKaya Have you looked into the Fugue or Spark implementation and maybe using some distributed framework like Spark or Dask? There are 2 options:
These were designed specifically for datasets which are too large to fit into memory.
from datacompy.
Hi @fdosani, thank you for your response.
I appreciate the suggestion to use distributed computing frameworks like Spark or Fugue. These are indeed powerful tools for handling large datasets. However, my use case involves working with DataComPy specifically, and I am interested in enhancing its functionality to better handle chunked data.
While distributed computing frameworks are a solution for large datasets, they might not always be the most practical or accessible solution for all users or all use cases. The feature I proposed would allow users to handle large datasets directly within DataComPy, without the need to set up and use a distributed computing framework.
I understand that this would be a significant addition to DataComPy. I believe this feature would make DataComPy more versatile and user-friendly, especially for those working with large datasets.
Thank you for considering this proposal.
from datacompy.
@ak-gupta @NikhilJArora Any thoughts/opinions on this?
from datacompy.
@SuleymanKaya I've been thinking about this request, and I'm a bit torn. Right now we provide the functionality you are looking for via Spark, and Fugue (which can provide dask, spark, compatibility etc). I don't want to add unnecessary complexity to the code base.
One of the issues with chunking locally is going to be pre-sorting cause if the common join items are in different chunks you will hit issues with the report being in-accurate. If you're dataset doesn't fit into memory then you should either bump up your compute to have more memory to accommodate (EC2 for instance) or just use some distributed computing framework like Spark.
Unfortunately I don't think I'm open to consider this proposal at this point in time cause it would literally mean overhauling the entire package and it would break all the functionality which currently exists and introduce complexity which is solved already via Spark (for example). Sorry for that. I'm going to close out this issue for now. Feel free to put your thoughts if you would like to add additional comments.
from datacompy.
Related Issues (20)
- who can help make the result significantly HOT 2
- Issue in writing report HOT 9
- Look into porting Compare to a polars backend for performance testing. HOT 2
- Abstract base class for native Compare functionality HOT 3
- Are there plans to support Python 3.12.1? HOT 14
- Snowflake and SQL support via Fugue
- edgetest is broken and needs some investigating.
- Datatype standardization before comparing for dataframes from DASK or Pyspark HOT 3
- [Discussion] Deprecate the native Spark implementation in favour of Fugue or Pandas on Spark HOT 1
- `report` throws an exception when all columns match but no rows match
- SparkCompare [PARSE_SYNTAX_ERROR] if column name contains unicode symbols HOT 2
- It seems `SparkCompare` object has no attribute 'sample_mismatch` ? HOT 2
- SparkCompare [PARSE_SYNTAX_ERROR] if a non-join column name contains unicode symbols HOT 1
- Just going to add a note here for future, currently seeing a small difference in pandas vs spark report sample rows when there are rows only in one dataframe.
- switch to ruff for linting and all the things.
- Please add Snowpark support HOT 2
- `Compare` method is modifying input dataframes HOT 2
- datacompy v0.12 spark sample with 5 rows only takes more than a minute to execute on databricks HOT 10
- v0.12.0 doesn't appear to have LegacySparkCompare HOT 1
- SparkCompare fails on Databricks DBR Spark clusters with Unity Catalog enabled HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from datacompy.