Giter VIP home page Giter VIP logo

Comments (5)

kaushikcfd avatar kaushikcfd commented on July 17, 2024

Questions that need to be answered:

  1. What to use for the distributed executed?
    • My vote is for mpi4py, proven to be quite stable over the years.
  2. Should we make the distributed package a hard dep?
    • My vote is "No"
  3. How could the current implementation in feinsum changed?
    • Option 1. We could rewrite feinsum.tuning.OpentunerTuner to behave like a server-client model where only rank does the writes to the database and generate inputs for the search exploration. I am slightly worried about the scalability, but the common case is that evaluating a point in the search space costs us ~10 seconds maybe this is not a big problem.
    • Option 2. Each rank runs its own search-space exploration and after each run, it broadcasts the timing result for other ranks to add as extra seed configurations. I'm not sure if opentuner allows us to seed the configurations during the run and the database update costs when multiple processes are performing the updates.

from feinsum.

nchristensen avatar nchristensen commented on July 17, 2024

I have looked at the mpi4py task pool, mpipool, schwimmbad and charm4py task pools. The mpi4py task pool makes the middle two fairly redundant IMO. Between charm4py and mpi4py, mpi4py is easier to build and install, but certain pool executors break/hang on Spectrum MPI. Charm4py pools seemed more stable, but it takes more effort to build charm4py. Most of the basic task execution functionality is similar between the two so it wouldn't be too hard to add both.

A third option might be to divide up the search space in some way and then have each rank search within its subspace. This could introduce load balancing problems, however.

Keeping it as a soft dependency seems fine for now.

from feinsum.

kaushikcfd avatar kaushikcfd commented on July 17, 2024

Most of the basic task execution functionality is similar between the two so it wouldn't be too hard to add both.

I don't think we need a task pool for any of the options here. Option 1 can be done with simple MPI communication primitive by using MPI_ANY_SOURCE for the server rank and all the client ranks know that rank 0 is the server rank. And the server rank would be in a loop as:

while True:
    Blocking RECV from any source.
    Record the result in a database and send the next iteration point to the rank that was completed.

Let me know if I'm missing something.

Option (3) is also an interesting option, a mild concern there is if that might force us with a sub-optimal search-exploration strategy.

from feinsum.

nchristensen avatar nchristensen commented on July 17, 2024

Ray may (or may not) be useful here https://docs.ray.io/en/latest/index.html

from feinsum.

kaushikcfd avatar kaushikcfd commented on July 17, 2024

Thanks,
I looked at the example in https://docs.ray.io/en/latest/tune/index.html and that should blend well with feinsum's description of parameter spaces. Some downsides I see are:

  1. it seems like a heavyweight library that would pull in lots of dependencies
  2. Seems like the distributed tuning depends on Kubernetes infrastructure, which again is a bit heavyweight.

from feinsum.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.