Given the large data requirements for upcoming surveys, such as POSSUM, I thought I'd

Hey <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url=

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

Scaling 3D routines with Dask about rm-tools HOT 6 OPEN

cirada-tools commented on September 14, 2024

Scaling 3D routines with Dask

from rm-tools.

Comments (6)

AlecThomson commented on September 14, 2024

Also, the newest version of astropy now supports writing dask arrays directly to FITS:
https://docs.astropy.org/en/v4.1/whatsnew/4.1.html#whatsnew-4-1-fitsdask

from rm-tools.

Cameron-Van-Eck commented on September 14, 2024

Hey Alec:

I'm getting some serious Baader-Meinhof effect right now, because I was just reading a little bit about Dask earlier this week. Now that I have an expert (you) to consult, I have some questions:

How difficult/bulky is it to install? Right now we're lucky that the only difficult dependency to install is pymultinest, so I'd rather not add something similarly complicated if it can be avoided.
The description talks a bunch about parallelization within cluster environments, but does it provide parallelization/multi-core within a single machine? I can't find a straight answer to this. If so, does this mean Dask could effectively replace the use of schwimmbad for parallelization in RMclean3D?
How much work would it take to convert the 3D code over, in your estimation? Just find-replace np with da?

At the moment, we're not really planning on running the CIRADA/POSSUM pipelines on a true cluster/scheduler, but instead on an OpenStack VM system where we can spin up independent VMs for different jobs. If Dask can significantly improve the performance within a 16-core VM, then that could be very worthwhile to implement.

Thanks,
Cameron

from rm-tools.

AlecThomson commented on September 14, 2024

Haha spooky! I wouldn't call myself an expert by any means. In fact, this was really my first session using Dask in earnest. But, I think that also speaks to how easy it was to get going with it. On your questions:

I found it very easy. Especially compared to (py)multinest! It can pip or conda installed: https://jobqueue.dask.org/en/latest/install.html. I guess we would want to test on a few different environments. But, I think miniconda usually makes life pretty easy anywhere, really.
Indeed! The performance gain above with parallelising on my laptop, which is pretty beefy (8cores/16 threads, 32GB RAM). Dask seems to cleverly handle the most efficient chunking and parallelisation under the hood. This is handled by the distributed module. In dask, you create a Client object, which handles the resource management. I didn't show it in my example, but this is what I had on my laptop

from dask.distributed import Client

client = Client("tcp://127.0.0.1:57009")
client

So I was actually only using 3/4 of my laptop! I created this using the GUI in Jupyterlab, but it's easy to configure from a script, too. For HPCs there are drop-in tools that hook into the appropriate job schedulers (e.g. PBS, Slurm, etc). In particular, there is dask-jobqueue. There's a nice video tutorial there that shows the setup. But basically, you just end make creating a client object that gives you access to whatever resources you're after. I'm trying to get this running on Galaxy at the moment. I can give an update later on how that goes. But for a VM like you describe, I think it would work really easily.

On the last part, I think that yes, this could indeed be a better replacement for parallelising RM-CLEAN over Schwimmbad. Although, that section would probably be the one that would need the most re-work. Perhaps even reverting to version before my parallelisation update would actually be easier!

Given that I was able to do this in my first session every using Dask, I think that the dev overhead would be relatively low. dask.array was pretty much a find-and-replace affair for numpy (as intended by the developers of dask). There were a few things that didn't replace exactly. np.newaxis was one example. Also, the for loop syntax had to change slightly in order to be computed in parallel. There is also the ability to use delayed, where dask.array or dask.dataframe don't drop in neatly.

from rm-tools.

Cameron-Van-Eck commented on September 14, 2024

That sounds really promising. I'll put it on my RM-Tools to-do list to build an experimental version using Dask to see how it compares.

from rm-tools.

AlecThomson commented on September 14, 2024

Hey @Cameron-Van-Eck,

I've recently done some experimenting with Dask-ifying the 3D routines. I believe I have working versions of RM-synth and RM-CLEAN in this fork/branch.

Big caveat - still needs testing to ensure the same results as non-parallel verision. We also need to test the scaling, and ensure that we can properly handle the data volumes we want.

The biggest change I needed to make was which files were written to disk. Right now, astropy does not support writing FITS files from Dask arrays in parallel (see #11159). Also, as we know, FITS files do not support complex values. Finally, we do not want to repeat computation unnecessarily. To solve this, I've used the file format Zarr to store intermediate data. This comes at the cost of extra files being written to disk, but I think this is really unavoidable when dealing with data >> memory.

I've also used my own little library for reading FITS to Dask arrays -- da-fits.

Another note - this probably has completely broken the 1D tools (as they are currently implemented). My conversion to Dask was rather rudimentary: I replaced all np. with da. and then fixed whatever broke. In a couple of places I needed to use xarray to handle some of the trickier array handling. However, I think that the 1D tools could easily be adapted to use the same common framework (I just didn't spend any time on them, yet).

I'd be keen to see this tested more, and get your thoughts on the current implementation!

from rm-tools.

Cameron-Van-Eck commented on September 14, 2024

Hi @AlecThomson. That all sounds pretty cool. I haven't had time to dig into this at all since we discussed it last December, and probably won't in the foreseeable future (not for the rest of the year, at least).

To my mind, the intermediate Zarr files are not a big deal. If we compare with the current method:
break input cubes into chunks (1 read, 1 write) -> run RM synth (1 read, 1 write) -> reassemble chunks (1 read, 1 write)
then this method:
run RM synth and output to Zarr (1 read, 1 write), -> convert format to FITS (1 read, 1 write)
is already more efficient. Some systems (e.g., the CANFAR VMs I'm using for the POSSUM pipeline) can be heavily IO-bound, making that almost a bigger bottleneck than CPU or RAM.

Out of curiousity: would it be possible to output to HDF5 instead of Zarr (in terms of supporting the parallelization)? Since HDF5 can be read by CARTA, we could make the FITS output optional then.

Breaking 1D is a not insignificant problem, but I'd expect it's mostly a matter of making sure the data is compatible with Dask when it gets passed to the relevant util functions?

Sorry for not being able to test this more in the short-term. This is really great; I was thinking about these kinds of improvements recently with the work I'm doing on the POSSUM pipelines. The 1D pipeline has completely broken parallelization (previously using pySpark), I was debating trying to replace it with Dask.

Cheers,
Cameron

from rm-tools.

Scaling 3D routines with Dask about rm-tools HOT 6 OPEN

Comments (6)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent