Comments (6)
Also, the newest version of astropy now supports writing dask arrays directly to FITS:
https://docs.astropy.org/en/v4.1/whatsnew/4.1.html#whatsnew-4-1-fitsdask
from rm-tools.
Hey Alec:
I'm getting some serious Baader-Meinhof effect right now, because I was just reading a little bit about Dask earlier this week. Now that I have an expert (you) to consult, I have some questions:
-
How difficult/bulky is it to install? Right now we're lucky that the only difficult dependency to install is pymultinest, so I'd rather not add something similarly complicated if it can be avoided.
-
The description talks a bunch about parallelization within cluster environments, but does it provide parallelization/multi-core within a single machine? I can't find a straight answer to this. If so, does this mean Dask could effectively replace the use of schwimmbad for parallelization in RMclean3D?
-
How much work would it take to convert the 3D code over, in your estimation? Just find-replace np with da?
At the moment, we're not really planning on running the CIRADA/POSSUM pipelines on a true cluster/scheduler, but instead on an OpenStack VM system where we can spin up independent VMs for different jobs. If Dask can significantly improve the performance within a 16-core VM, then that could be very worthwhile to implement.
Thanks,
Cameron
from rm-tools.
Haha spooky! I wouldn't call myself an expert by any means. In fact, this was really my first session using Dask in earnest. But, I think that also speaks to how easy it was to get going with it. On your questions:
-
I found it very easy. Especially compared to (py)multinest! It can pip or conda installed: https://jobqueue.dask.org/en/latest/install.html. I guess we would want to test on a few different environments. But, I think miniconda usually makes life pretty easy anywhere, really.
-
Indeed! The performance gain above with parallelising on my laptop, which is pretty beefy (8cores/16 threads, 32GB RAM). Dask seems to cleverly handle the most efficient chunking and parallelisation under the hood. This is handled by the distributed module. In dask, you create a
Client
object, which handles the resource management. I didn't show it in my example, but this is what I had on my laptop
from dask.distributed import Client
client = Client("tcp://127.0.0.1:57009")
client
So I was actually only using 3/4 of my laptop! I created this using the GUI in Jupyterlab, but it's easy to configure from a script, too. For HPCs there are drop-in tools that hook into the appropriate job schedulers (e.g. PBS, Slurm, etc). In particular, there is dask-jobqueue. There's a nice video tutorial there that shows the setup. But basically, you just end make creating a client object that gives you access to whatever resources you're after. I'm trying to get this running on Galaxy at the moment. I can give an update later on how that goes. But for a VM like you describe, I think it would work really easily.
On the last part, I think that yes, this could indeed be a better replacement for parallelising RM-CLEAN over Schwimmbad. Although, that section would probably be the one that would need the most re-work. Perhaps even reverting to version before my parallelisation update would actually be easier!
- Given that I was able to do this in my first session every using Dask, I think that the dev overhead would be relatively low.
dask.array
was pretty much a find-and-replace affair fornumpy
(as intended by the developers of dask). There were a few things that didn't replace exactly.np.newaxis
was one example. Also, thefor
loop syntax had to change slightly in order to be computed in parallel. There is also the ability to use delayed, where dask.array or dask.dataframe don't drop in neatly.
from rm-tools.
That sounds really promising. I'll put it on my RM-Tools to-do list to build an experimental version using Dask to see how it compares.
from rm-tools.
Hey @Cameron-Van-Eck,
I've recently done some experimenting with Dask-ifying the 3D routines. I believe I have working versions of RM-synth and RM-CLEAN in this fork/branch.
Big caveat - still needs testing to ensure the same results as non-parallel verision. We also need to test the scaling, and ensure that we can properly handle the data volumes we want.
The biggest change I needed to make was which files were written to disk. Right now, astropy does not support writing FITS files from Dask arrays in parallel (see #11159). Also, as we know, FITS files do not support complex values. Finally, we do not want to repeat computation unnecessarily. To solve this, I've used the file format Zarr to store intermediate data. This comes at the cost of extra files being written to disk, but I think this is really unavoidable when dealing with data >> memory.
I've also used my own little library for reading FITS to Dask arrays -- da-fits.
Another note - this probably has completely broken the 1D tools (as they are currently implemented). My conversion to Dask was rather rudimentary: I replaced all np.
with da.
and then fixed whatever broke. In a couple of places I needed to use xarray to handle some of the trickier array handling. However, I think that the 1D tools could easily be adapted to use the same common framework (I just didn't spend any time on them, yet).
I'd be keen to see this tested more, and get your thoughts on the current implementation!
from rm-tools.
Hi @AlecThomson. That all sounds pretty cool. I haven't had time to dig into this at all since we discussed it last December, and probably won't in the foreseeable future (not for the rest of the year, at least).
To my mind, the intermediate Zarr files are not a big deal. If we compare with the current method:
break input cubes into chunks (1 read, 1 write) -> run RM synth (1 read, 1 write) -> reassemble chunks (1 read, 1 write)
then this method:
run RM synth and output to Zarr (1 read, 1 write), -> convert format to FITS (1 read, 1 write)
is already more efficient. Some systems (e.g., the CANFAR VMs I'm using for the POSSUM pipeline) can be heavily IO-bound, making that almost a bigger bottleneck than CPU or RAM.
Out of curiousity: would it be possible to output to HDF5 instead of Zarr (in terms of supporting the parallelization)? Since HDF5 can be read by CARTA, we could make the FITS output optional then.
Breaking 1D is a not insignificant problem, but I'd expect it's mostly a matter of making sure the data is compatible with Dask when it gets passed to the relevant util functions?
Sorry for not being able to test this more in the short-term. This is really great; I was thinking about these kinds of improvements recently with the work I'm doing on the POSSUM pipelines. The 1D pipeline has completely broken parallelization (previously using pySpark), I was debating trying to replace it with Dask.
Cheers,
Cameron
from rm-tools.
Related Issues (20)
- Possible typo HOT 2
- Rmsynth 3D returning zero output HOT 5
- QU fitting running for hours on Modelled data. HOT 3
- Fracpol not bias-corrected
- New Stokes I fitting routine
- Noise in Q and U, and MAD vs theoretical HOT 3
- Progressbar slows progress HOT 1
- Speedup loops with JIT HOT 1
- error message of "TypeError: Object of type float32 is not JSON serializable" on RMtools_1D/do_RMsynth_1D.py HOT 5
- A minor problem on RMtools_3D/mk_test_cube_data.py HOT 1
- RM-Clean does too many iterations
- FDF cubes should have properly set (degenerate) Stokes axis HOT 1
- QU-fitting Stokes I model failing for many double-component RM models (m11 and m4) HOT 4
- Suggest deleting swap file HOT 1
- Odd behaviour in peak outputs from RMsynth3d HOT 3
- Different formulas for rmtf FWHM HOT 1
- Peak fitting maps should output de-rotated Q,U
- peakfitcube tool fails with index error HOT 4
- rmtools_peakfitcube fails with all-NaN maps as output HOT 2
- error message 'ValueError: not enough values to unpack (ecpected 4, got 3)' in rmclean3d HOT 7
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from rm-tools.