Hello DataComPy team, I am currently working with large datasets tha

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Feature Request: Ability to Update Compare Object Over Multiple Chunks about datacompy HOT 4 CLOSED

SuleymanKaya commented on September 25, 2024

Feature Request: Ability to Update Compare Object Over Multiple Chunks

from datacompy.

Comments (4)

fdosani commented on September 25, 2024

Hi @SuleymanKaya Have you looked into the Fugue or Spark implementation and maybe using some distributed framework like Spark or Dask? There are 2 options:

These were designed specifically for datasets which are too large to fit into memory.

from datacompy.

SuleymanKaya commented on September 25, 2024

Hi @fdosani, thank you for your response.

I appreciate the suggestion to use distributed computing frameworks like Spark or Fugue. These are indeed powerful tools for handling large datasets. However, my use case involves working with DataComPy specifically, and I am interested in enhancing its functionality to better handle chunked data.

While distributed computing frameworks are a solution for large datasets, they might not always be the most practical or accessible solution for all users or all use cases. The feature I proposed would allow users to handle large datasets directly within DataComPy, without the need to set up and use a distributed computing framework.

I understand that this would be a significant addition to DataComPy. I believe this feature would make DataComPy more versatile and user-friendly, especially for those working with large datasets.

Thank you for considering this proposal.

from datacompy.

fdosani commented on September 25, 2024

@ak-gupta @NikhilJArora Any thoughts/opinions on this?

from datacompy.

fdosani commented on September 25, 2024

@SuleymanKaya I've been thinking about this request, and I'm a bit torn. Right now we provide the functionality you are looking for via Spark, and Fugue (which can provide dask, spark, compatibility etc). I don't want to add unnecessary complexity to the code base.

One of the issues with chunking locally is going to be pre-sorting cause if the common join items are in different chunks you will hit issues with the report being in-accurate. If you're dataset doesn't fit into memory then you should either bump up your compute to have more memory to accommodate (EC2 for instance) or just use some distributed computing framework like Spark.

Unfortunately I don't think I'm open to consider this proposal at this point in time cause it would literally mean overhauling the entire package and it would break all the functionality which currently exists and introduce complexity which is solved already via Spark (for example). Sorry for that. I'm going to close out this issue for now. Feel free to put your thoughts if you would like to add additional comments.

from datacompy.

Recommend Projects

Feature Request: Ability to Update Compare Object Over Multiple Chunks about datacompy HOT 4 CLOSED

Comments (4)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent