Giter VIP home page Giter VIP logo

Comments (4)

fdosani avatar fdosani commented on June 16, 2024

Hi @SuleymanKaya Have you looked into the Fugue or Spark implementation and maybe using some distributed framework like Spark or Dask? There are 2 options:

These were designed specifically for datasets which are too large to fit into memory.

from datacompy.

SuleymanKaya avatar SuleymanKaya commented on June 16, 2024

Hi @fdosani, thank you for your response.

I appreciate the suggestion to use distributed computing frameworks like Spark or Fugue. These are indeed powerful tools for handling large datasets. However, my use case involves working with DataComPy specifically, and I am interested in enhancing its functionality to better handle chunked data.

While distributed computing frameworks are a solution for large datasets, they might not always be the most practical or accessible solution for all users or all use cases. The feature I proposed would allow users to handle large datasets directly within DataComPy, without the need to set up and use a distributed computing framework.

I understand that this would be a significant addition to DataComPy. I believe this feature would make DataComPy more versatile and user-friendly, especially for those working with large datasets.

Thank you for considering this proposal.

from datacompy.

fdosani avatar fdosani commented on June 16, 2024

@ak-gupta @NikhilJArora Any thoughts/opinions on this?

from datacompy.

fdosani avatar fdosani commented on June 16, 2024

@SuleymanKaya I've been thinking about this request, and I'm a bit torn. Right now we provide the functionality you are looking for via Spark, and Fugue (which can provide dask, spark, compatibility etc). I don't want to add unnecessary complexity to the code base.

One of the issues with chunking locally is going to be pre-sorting cause if the common join items are in different chunks you will hit issues with the report being in-accurate. If you're dataset doesn't fit into memory then you should either bump up your compute to have more memory to accommodate (EC2 for instance) or just use some distributed computing framework like Spark.

Unfortunately I don't think I'm open to consider this proposal at this point in time cause it would literally mean overhauling the entire package and it would break all the functionality which currently exists and introduce complexity which is solved already via Spark (for example). Sorry for that. I'm going to close out this issue for now. Feel free to put your thoughts if you would like to add additional comments.

from datacompy.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.