Giter VIP home page Giter VIP logo

Comments (6)

AlecThomson avatar AlecThomson commented on September 14, 2024

Also, the newest version of astropy now supports writing dask arrays directly to FITS:
https://docs.astropy.org/en/v4.1/whatsnew/4.1.html#whatsnew-4-1-fitsdask

from rm-tools.

Cameron-Van-Eck avatar Cameron-Van-Eck commented on September 14, 2024

Hey Alec:

I'm getting some serious Baader-Meinhof effect right now, because I was just reading a little bit about Dask earlier this week. Now that I have an expert (you) to consult, I have some questions:

  1. How difficult/bulky is it to install? Right now we're lucky that the only difficult dependency to install is pymultinest, so I'd rather not add something similarly complicated if it can be avoided.

  2. The description talks a bunch about parallelization within cluster environments, but does it provide parallelization/multi-core within a single machine? I can't find a straight answer to this. If so, does this mean Dask could effectively replace the use of schwimmbad for parallelization in RMclean3D?

  3. How much work would it take to convert the 3D code over, in your estimation? Just find-replace np with da?

At the moment, we're not really planning on running the CIRADA/POSSUM pipelines on a true cluster/scheduler, but instead on an OpenStack VM system where we can spin up independent VMs for different jobs. If Dask can significantly improve the performance within a 16-core VM, then that could be very worthwhile to implement.

Thanks,
Cameron

from rm-tools.

AlecThomson avatar AlecThomson commented on September 14, 2024

Haha spooky! I wouldn't call myself an expert by any means. In fact, this was really my first session using Dask in earnest. But, I think that also speaks to how easy it was to get going with it. On your questions:

  1. I found it very easy. Especially compared to (py)multinest! It can pip or conda installed: https://jobqueue.dask.org/en/latest/install.html. I guess we would want to test on a few different environments. But, I think miniconda usually makes life pretty easy anywhere, really.

  2. Indeed! The performance gain above with parallelising on my laptop, which is pretty beefy (8cores/16 threads, 32GB RAM). Dask seems to cleverly handle the most efficient chunking and parallelisation under the hood. This is handled by the distributed module. In dask, you create a Client object, which handles the resource management. I didn't show it in my example, but this is what I had on my laptop

from dask.distributed import Client

client = Client("tcp://127.0.0.1:57009")
client

image
So I was actually only using 3/4 of my laptop! I created this using the GUI in Jupyterlab, but it's easy to configure from a script, too. For HPCs there are drop-in tools that hook into the appropriate job schedulers (e.g. PBS, Slurm, etc). In particular, there is dask-jobqueue. There's a nice video tutorial there that shows the setup. But basically, you just end make creating a client object that gives you access to whatever resources you're after. I'm trying to get this running on Galaxy at the moment. I can give an update later on how that goes. But for a VM like you describe, I think it would work really easily.

On the last part, I think that yes, this could indeed be a better replacement for parallelising RM-CLEAN over Schwimmbad. Although, that section would probably be the one that would need the most re-work. Perhaps even reverting to version before my parallelisation update would actually be easier!

  1. Given that I was able to do this in my first session every using Dask, I think that the dev overhead would be relatively low. dask.array was pretty much a find-and-replace affair for numpy (as intended by the developers of dask). There were a few things that didn't replace exactly. np.newaxis was one example. Also, the for loop syntax had to change slightly in order to be computed in parallel. There is also the ability to use delayed, where dask.array or dask.dataframe don't drop in neatly.

from rm-tools.

Cameron-Van-Eck avatar Cameron-Van-Eck commented on September 14, 2024

That sounds really promising. I'll put it on my RM-Tools to-do list to build an experimental version using Dask to see how it compares.

from rm-tools.

AlecThomson avatar AlecThomson commented on September 14, 2024

Hey @Cameron-Van-Eck,

I've recently done some experimenting with Dask-ifying the 3D routines. I believe I have working versions of RM-synth and RM-CLEAN in this fork/branch.

Big caveat - still needs testing to ensure the same results as non-parallel verision. We also need to test the scaling, and ensure that we can properly handle the data volumes we want.

The biggest change I needed to make was which files were written to disk. Right now, astropy does not support writing FITS files from Dask arrays in parallel (see #11159). Also, as we know, FITS files do not support complex values. Finally, we do not want to repeat computation unnecessarily. To solve this, I've used the file format Zarr to store intermediate data. This comes at the cost of extra files being written to disk, but I think this is really unavoidable when dealing with data >> memory.

I've also used my own little library for reading FITS to Dask arrays -- da-fits.

Another note - this probably has completely broken the 1D tools (as they are currently implemented). My conversion to Dask was rather rudimentary: I replaced all np. with da. and then fixed whatever broke. In a couple of places I needed to use xarray to handle some of the trickier array handling. However, I think that the 1D tools could easily be adapted to use the same common framework (I just didn't spend any time on them, yet).

I'd be keen to see this tested more, and get your thoughts on the current implementation!

from rm-tools.

Cameron-Van-Eck avatar Cameron-Van-Eck commented on September 14, 2024

Hi @AlecThomson. That all sounds pretty cool. I haven't had time to dig into this at all since we discussed it last December, and probably won't in the foreseeable future (not for the rest of the year, at least).

To my mind, the intermediate Zarr files are not a big deal. If we compare with the current method:
break input cubes into chunks (1 read, 1 write) -> run RM synth (1 read, 1 write) -> reassemble chunks (1 read, 1 write)
then this method:
run RM synth and output to Zarr (1 read, 1 write), -> convert format to FITS (1 read, 1 write)
is already more efficient. Some systems (e.g., the CANFAR VMs I'm using for the POSSUM pipeline) can be heavily IO-bound, making that almost a bigger bottleneck than CPU or RAM.

Out of curiousity: would it be possible to output to HDF5 instead of Zarr (in terms of supporting the parallelization)? Since HDF5 can be read by CARTA, we could make the FITS output optional then.

Breaking 1D is a not insignificant problem, but I'd expect it's mostly a matter of making sure the data is compatible with Dask when it gets passed to the relevant util functions?

Sorry for not being able to test this more in the short-term. This is really great; I was thinking about these kinds of improvements recently with the work I'm doing on the POSSUM pipelines. The 1D pipeline has completely broken parallelization (previously using pySpark), I was debating trying to replace it with Dask.

Cheers,
Cameron

from rm-tools.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.