dask / dask-blog Goto Github PK

View Code? Open in Web Editor NEW

30.0 30.0 35.0 59.4 MB

Dask development blog

Home Page: https://blog.dask.org/

HTML 94.25% Ruby 0.77% CSS 0.33% Python 0.57% Jupyter Notebook 4.08%

dask-blog's Introduction

Dask

Dask is a flexible parallel computing library for analytics. See documentation for more information.

LICENSE

New BSD. See License File.

dask-blog's People

Contributors

Stargazers

Watchers

dask-blog's Issues

Blogpost on loading TIF data into a Dask array

It would be handy to have a blogpost about loading image data into Dask Array, and then possibly storing it into some other format, like HDF5 or Zarr.

It might be nice to have both a trivial example here (a simple stack of images), as well as a less-trivial example (if those are common).

cc @jakirkham , perhaps the world's leading expert

Env build not working on M1 mac

Describe the issue:
I'm trying to contribute a blog but cannot build the ruby/jekyll software environment. Most likely because I'm on an M1 Mac.

Minimal Complete Verifiable Example:
conda install rb-commonmarker
OR
bundle install within the project directory

bundle install

partial output:

...
Installing nokogiri 1.13.9 (arm64-darwin)
Gem::Ext::BuildError: ERROR: Failed to build gem native extension.

    current directory: /Users/rpelgrim/mambaforge/envs/dask-blog/share/rubygems/gems/commonmarker-0.23.6/ext/commonmarker
/Users/rpelgrim/mambaforge/envs/dask-blog/bin/ruby -I /Users/rpelgrim/mambaforge/envs/dask-blog/lib/ruby/3.1.0 -r
./siteconf20221121-16691-jyr1nw.rb extconf.rb
creating Makefile

current directory: /Users/rpelgrim/mambaforge/envs/dask-blog/share/rubygems/gems/commonmarker-0.23.6/ext/commonmarker
make DESTDIR\= clean

current directory: /Users/rpelgrim/mambaforge/envs/dask-blog/share/rubygems/gems/commonmarker-0.23.6/ext/commonmarker
make DESTDIR\=
compiling arena.c
make: arm64-apple-darwin20.0.0-clang: No such file or directory
make: *** [arena.o] Error 1

make failed, exit code 2
...

Anything else we need to know?:

Environment:

Dask version: n/a
Ruby version: 3.1.2
Operating System: MacOS
Install method (conda, pip, source): conda / gem

Blog post on Dask-ML's hyperparameter optimization

It'd be good to show the work performed on Dask-ML and HyperbandSearchCV. It's recently been merged in dask/dask-ml#221.

Change GPU scheduler article

This article https://blog.dask.org/2023/04/14/scheduler-environment-requirements includes statements like the following:

If you use value-add hardware on the client and workers such as GPUs you’ll need to ensure your scheduler has one

This statement can be misleading. It's very true for RAPIDS work, but generally less true for PyTorch or other GPU work. (here is a pretty typical example).

I've fielded a bunch of questions on this topic. Here is an example. I think that we should alter this blog post to talk more about how serialization works. This article is causing non-trivial confusion among general, non-RAPIDS, GPU users.

cc @jacobtomlinson @quasiben @ntabris

Mosaic image fusion blogpost - add a link to Tobias' project in the "Also see" section?

@VolkerH I saw you added a small note to the DaskFusion README that links to Tobias' project.

Should I do the same in the blogpost under the "Also see" section heading at the bottom? They each focus on slightly different things, so it might be helpful for people.

https://twitter.com/TobiasAdeJong/status/1466719789280382977

Some [feedback on the blog post](https://twitter.com/TobiasAdeJong/status/1466719789280382977) by [Tobias deJong](https://github.com/TAdeJong) points out a very similar approach that allows incorporates optimization of tile positions, [see this notebook](https://github.com/TAdeJong/LEEM-analysis/blob/master/6%20-%20Stitching.ipynb).

Blogpost idea: how to choose good settings for Dask on HPC

It'd be good to have a blogpost about how to choose good settings for Dask on HPC. Users are often confused about this.

I think one reason this is particularly confusing is that settings often need to be defined in multiple locations, and people are confused about how they interact. For example, someone might submit a job to SLURM with sbatch, which then runs a python program involving Dask, and want to know how that fits together.

#116 (comment)

...you know what would ALSO be a good blogpost? How to choose good cluster settings. Eg: how your SLURM/PBS/whatever batch submission settings relate to the settings you need to put in your dask-jobqueue cluster object.

To be honest I'm still a bit confused by this, and it is something other people ask me too.

If either @jacobtomlinson or @ian-r-rose would like to help make this, that would be very useful to refer people to (hint, hint) 😄

@guillaumeeb has kindly agreed to help put this together #116 (comment)

Hi all, I saw this issue, and I agree that both ideas would make great articles. Those are questions we see a lot as HPC admin/experts.

I can try to help with the second one one batch submission settings! Everyone is confused about it.

Blog Post Suggestion: Multi TB time series dataset -> Image processing -> video

I have a use case where I collect multiple TB datasets (electron microscopy) time series videos. The image processing requires a combination of resizing (e.g. 4k ->1k xy or time-averaging frames), various filtering steps, background subtraction, and image alignment. Some of these steps can be performed by only considering an individual frame, others require using neighboring frames etc. After the processing, the images are typically rendered to video, and/or processed programmatically. Currently, a lot of this is performed using proprietary software, which is limited in its capabilities and slow.

It would be great to create an open-source repo for analysing these sorts of datasets, underpinned by Dask. I've written a basic workflow, which with a bit of polish and some assistance, I thought could be the basis of a blog post, and would be a great spot for me to start a repo from.

The blog post would demonstrate:

reading image files with a custom file reader (file reader already written)
video writing with FFMPEG from dask array/dask delayed list.
how to handle working with individual frames and multiple frames, implications on chunksize
demonstrating custom user functions.
workflow which scales from laptop to HPC (I have access to HPC compute time)
would be great to include examples gpu accelerated image processing steps.

might be a bit too similar to:

Single GPU CuPy Benchmark Post not listed on blog index

I just noticed that the single-gpu-cupy-benchmarks blog post is not listed on the blog index. Can someone add it? Thanks!

Idea: `ipycytoscape` visualization option

Recently we added a new engine="ipycytoscape" option to Dask's visualize(...) functionality. It might be good to write a short blog post about it. @ian-r-rose, you added ipycytoscape support -- do you have any interest in writing such a post?

This idea came up in the September 2022 monthly community meeting. Opening an issue so we don't loose track of the idea

Blogpost using Dask Array with ITK

We've run into some people who use ITK a bit on imaging data. A blogpost that showed a simple-yet-realistic workflow with Dask Array and ITK together would probably help onboard these groups more effectively.

In conversation with @jakirkham and Gokul we were thinking about something like the following:

Read in some data #26
Deskew some imaging data coming off of an advanced microscope
Deconvolve that data to sharpen things up a bit
Run segmentation to get out regions of interest

This workflow is far from set in stone. I suspect that things will quickly change when someone tries this on real data, but it might provide an initial guideline. Also, as @jakirkham brought up, it might be worth splitting this up into a few smaller posts.

cc @thewtex for visibility. I think that he has also been thinking about these same problems of using Dask to scale out ITK workloads on large datasets.

Blogpost idea: how to generate multiscale image arrays

This PR is currently in progress, but could be merged soon (for some loose value of "soon", I don't have a good idea of when) ome/ome-zarr-py#192

When it is done, I think it might be nice to have a blogpost about how to generate a multiscale image array and save it to disk, etc.

This is something that surprisingly doesn't seem to have a single, obvious, best way to do it (see discussion ome/ome-zarr-py#215). So when there is a convenience function available, it would be good to highlight that with a blogpost.

Jacob, feel free to nudge me in a few months about this, if you like. (That may or may not work, I can't say for sure I'll be available to do more about it then, but it's worth a try)

2019-03-27-dask-cuml.md missing info

The examples are on datasets of 2.8KB and 28GB or 2.8MB and 28MB?

Section Fast Fitting timing

example 1 error is not stated
example 2 error unit of measure is missing

Edit blogpost to add links to the example data

This blogpost should probably be edited to include direct links to the example data and example PSF.

I think I found the data & PSF, more details here #138 (comment)

Link to the data: https://drive.google.com/drive/folders/13mpIfqspKTIINkfoWbFsVtFF8D7jbTqJ (linked to from this earlier blogpost about image loading)

Link to the PSF: https://drive.google.com/drive/folders/13udO-h9epItG5MNWBp0VxBkKCllYBLQF (discussed here)

Details from the blogpost:

We will use image data generously provided by Gokul Upadhyayula at the Advanced Bioimaging Center at UC Berkeley and discussed in this paper (preprint), though the workloads presented here should work for any kind of imaging data, or array data generally.

Update on dask array GPU support

Maybe we should have an update blogpost on how we've been introducing GPU support to dask-image, as well as an update on the broader plan for dispatching (some discussion in scipy/scipy#10204)

What do you think @jakirkham & @quasiben

Dask on HPC, what works and what doesn't

Hi All,

I'd like for a group of us to write a blogpost about using Dask on supercomputers, including why we like it today, and highlighting improvements that could be done in the near future to improve usability. My goal for this post is to show it around to various HPC groups, and to show it to my employer to motivate work in this area. I think that now is a good time for this community to have some impact by sharing its recent experience.

cc'ing some notable users today @guillaumeeb @jhamman @kmpaul @lesteve @dharhas @josephhardinee @jakirkham

To start conversation, if we were to structure the post as five reasons we use Dask on HPC and five things that could be better, what would be those five things? I think it'd be good to get a five-item list from a few people cc'ed above, then maybe we talk about those lists and I (or anyone else if interested) composes an initial draft that we can then all iterate on?

Resize images

The images in this blogpost appear way too large. The size should be reduced so they fit comfortably. This might be happening on other posts (especially other posts of mine) too.

There are two possible approaches to fix this:

Use html syntax instead of markdown <img src="image/file.jpg" alt="alt text" width="700"> with a maximum width of 700 pixels (this seems to be roughly the same `
Reduce the resolution of the image files and commit those with git (additional benefit, slightly smaller repository size).

Dask-GLM doesn't converge with Dask array

After a bit of profiling, this is what I found out for Dask-GLM with Dask array:

    14339    0.139    0.000    0.814    0.000 /home/pentschev/.local/lib/python3.5/site-packages/dask/local.py:430(fire_task)
    44898   19.945    0.000   19.945    0.000 {method 'acquire' of '_thread.lock' objects}
     4055    0.042    0.000   19.992    0.005 /usr/lib/python3.5/threading.py:261(wait)
    14339    0.107    0.000   20.234    0.001 /usr/lib/python3.5/queue.py:147(get)
    14339    0.018    0.000   20.253    0.001 /home/pentschev/.local/lib/python3.5/site-packages/dask/local.py:140(queue_get)
      122    0.117    0.001   22.327    0.183 /home/pentschev/.local/lib/python3.5/site-packages/dask/local.py:345(get_async)
      122    0.013    0.000   22.346    0.183 /home/pentschev/.local/lib/python3.5/site-packages/dask/threaded.py:33(get)
      122    0.004    0.000   22.733    0.186 /home/pentschev/.local/lib/python3.5/site-packages/dask/base.py:345(compute)
        1    0.020    0.020   23.224   23.224 /home/pentschev/.local/lib/python3.5/site-packages/dask_glm/algorithms.py:200(admm)
        1    0.000    0.000   23.267   23.267 /home/pentschev/.local/lib/python3.5/site-packages/dask_glm/utils.py:13(normalize_inputs)
        1    0.000    0.000   23.268   23.268 /home/pentschev/.local/lib/python3.5/site-packages/dask_glm/estimators.py:65(fit)

A big portion of the time seems to be spent on waiting for thread lock. Also, looking at the callers, we see 100 compute() calls departing from admm(), which means it's not converging and stopping only at max_iter as @cicdw suggested:

/home/pentschev/.local/lib/python3.5/site-packages/dask/base.py:345(compute)                               <-     100    0.004   19.637  /home/pentschev/.local/lib/python3.5/site-packages/dask_glm/algorithms.py:197(admm)

Running with NumPy, the algorithm converges, showing only 7 compute() calls:

/home/pentschev/.local/lib/python3.5/site-packages/dask/base.py:345(compute)                          <-       7    0.000    0.120  /home/pentschev/.local/lib/python3.5/site-packages/dask_glm/algorithms.py:197(admm)

I'm running Dask 1.1.4 and Dask-GLM master branch, to ensure that my local changes aren't introduce any bugs. However, if I run my Dask-GLM branch and use CuPy as a backend, it also converges in 7 iterations.

To me this seems to suggest that we have one of those very well-hidden and difficult to track bugs in Dask. Before I spent hours with this, any suggestions what could we look for?

Originally posted by @pentschev in #15

Blogpost idea: choosing good chunk sizes in Dask (turn this tweetorial into a blogpost)

Ian mentioned this tweet to me today. I originally wrote it because I'd just given a tutorial, and lots of people were confused about how to choose good chunk sizes in Dask. Apparently that was helpful to a lot of people (this is supported by the twitter analytics stats, which are much higher than typical).

For better discoverability and permanence, it might be good to have these tips in blogpost format (twitter is a bit ephemeral, and searches don't often unearth content there).

Investigate Dask User Survey

We now have a couple hundred responses from the Dask User Survey. We should probably analyze this data and write about it. This includes things like

Summarizing the results of individual questions
Looking at correlations between different questions (for example maybe HPC users use Dask array more while Yarn users use Dask dataframe more?)
Callout out surprises (like the focus on SSH support)

We should also add context about how this is being used to drive current work

Refresh github action is failing

What happened:
The "Refresh" github action has been failing for the last 8 months. This GitHub Actions workflow is supposed to cause the GitHub Pages site for the Dask blog to be rebuilt every day at 3am UTC.

https://github.com/dask/dask-blog/blob/gh-pages/.github/workflows/refresh.yml

What you expected to happen:
I expected the github action to rebuild/refresh the dask blog website instead of failing.

Minimal Complete Verifiable Example:
Here are the logs for the latest failed action: https://github.com/dask/dask-blog/runs/4403644252?check_suite_focus=true

It fails on the step "Trigger GitHub pages rebuild"

Run curl --fail --request POST \
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed

  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
curl: (22) The requested URL returned error: 401 
Error: Process completed with exit code 22.

Anything else we need to know?:
The last successful refresh action ran 8 months ago. I'm guessing something about the github actions environment changed around that time.

cc @jacobtomlinson - maybe you have some suggestions? As the author of #69 you have some good background context on this.

Environment:
Using Github actions associated with this repository.

Click to expand - "Set up job" information from the github action:

Current runner version: '2.285.0'
Operating System
  Ubuntu
  20.04.3
  LTS
Virtual Environment
  Environment: ubuntu-20.04
  Version: 20211129.1
  Included Software: https://github.com/actions/virtual-environments/blob/ubuntu20/20211129.1/images/linux/Ubuntu2004-README.md
  Image Release: https://github.com/actions/virtual-environments/releases/tag/ubuntu20%2F20211129.1
Virtual Environment Provisioner
  1.0.0.0-master-20211123-1
GITHUB_TOKEN Permissions
  Actions: write
  Checks: write
  Contents: write
  Deployments: write
  Discussions: write
  Issues: write
  Metadata: read
  Packages: write
  Pages: write
  PullRequests: write
  RepositoryProjects: write
  SecurityEvents: write
  Statuses: write
Secret source: Actions
Prepare workflow directory
Prepare all required actions

Annotation machinery

We've had a fair number of questions related to things changed after adding the annotation machinery. Most of the answers are pretty similar. Also this often ties into other things people may want to do (heterogenous computing, special resource allocation, etc.). It might be helpful to create some higher visibility content on this explaining what changed, how users should update existing code, and what new things they might now be able to do with annotations

Blogpost idea: Understanding meta

I am not sure this belongs as a blogpost, but I have been including a bit about meta in some talks that I have been doing and there are some recurring questions that come up on the issue tracker that make it clear that the concept of meta is not well understood.

Basically I'd just be expanding on this comment: dask/dask#8515 (comment)

Add tags metadata to blogposts

Hi Folks, currently it looks like we currently track tags in each of the markdown files for each blogpost, but we don't expose these in the HTML or XML/RSS/Atom feeds.

We're currently trying to migrate over our blog to a new platform and it would be useful to have these in the RSS/Atom feeds, even if they're not visible anywhere.

Does anyone have time to add this? At minimum it would probably mean looking up what approach people use by default, and then editing https://github.com/dask/dask-blog/blob/gh-pages/atom.xml or the layouts directory to include tags somewhere. (Grepping for tags yields some interesting results)

@jacobtomlinson is this easy for you or someone around you?

Should I include RAPIDS posts here?

So I just pushed out http://matthewrocklin.com/blog/work/2019/01/03/dask-array-gpus-first-steps which I think would be a great post to include here as well, except that at the end I say "come work for NVIDIA" which is a bit corporate. Should we include this post on blog.dask.org as well? Some options:

Yes, as long as things like this are at the end and tasteful then yes, we invite posts with an agenda (in this case, recruitment)
Yes, but please remove the "come work for NVIDIA" bit on our version of it
No, it's already posted elsewhere, no need to cross post here
...

Thoughts or objections?

Any interest in a blog post on dask-memusage?

dask-memusage is tool I wrote that gives you max memory usage per task in the executed graph, so you can:

Focus memory optimization efforts.
Configure parallelism to maximize cores when memory usage is the hardware bottleneck.

https://github.com/itamarst/dask-memusage/

Happy to write a blog post if you're interested.

Idea: `msgspec` post

The idea of having Dask-adjacent post on the Dask blog came up in the last monthly community meeting. When thinking of interesting projects in the space, msgspec came up. @jcrist do you think having a short blog post on msgspec in the Dask blog makes sense? Do you have any interest in authoring such a post?

This idea came up in the September 2022 monthly community meeting. Opening an issue so we don't loose track of the idea

Factual correction for text in skeleton analysis blogpost

There is a factual error in my old skeleton analysis blogpost.

The blogpost shows a violin plot of the euclidean-distance measurement from skan, and says this is the skeleton branch thickness.

We can see that there are more thick blood vessel branches in the healthy lung.

That is incorrect. I misunderstood, and the error wasn't caught by Juan's otherwise excellent review (Juan is the author of the skan library).

Instead, it is correct to call this "the straight line distance from one end of the branch to the other".

Juan says:

The first thing I’ll say is that euclidean distance is not the thickness — it is the straight line distance from one end of the branch to the other

See this comment and this comment in an image.sc forum discussion for the full context. That discussion is from last year, but I only became aware of it today.

New Dask + ITK blogpost

Now that [release v5.3rc03 of ITK is available (which should include this PR), it would be good to do a follow up blogpost to this one about using Dask + ITK together.

The purpose of this would be:

To tell other people that ITK images can now be serialized & used with Dask, in the hopes that they experiment with this on their own, and/or
in the event that we find there are still problems that need fixing, to document the current state of work.

The first step is re-running the code from the earlier blogpost with ITK v5.3rc03 or above and seeing whether that works or not. Then we write up whatever we find.

Here are the comments specifically discussing what should be included in a followup blogpost:

Links:

Blogpost: https://blog.dask.org/2019/08/09/image-itk
PR discussing blogpost: #27
PR enabling pickle serialization for ITK images: InsightSoftwareConsortium/ITK#2829

What syntax do I need for links in a table of contents?

Clicking the links in the table of contents in this blogpost doesn't take you to the corresponding section of the post, but it does work if you do it here.

Did I get the syntax for this wrong? Or is jekyll somehow losing these markers when it renders things?

This is what I used as my markdown syntax:

## Contents
* [Background](#Background)
* [What we learned](#What-we-learned)
    * [From Dask users](#From-Dask-users)
    * [From other software libraries](#From-other-software-libraries)
* [Opportunities we see](#Opportunities-we-see)
* [Strategic plan](#strategic-plan)
* [Limitations](#Limitations)
* [Methods](#Methods)
* ```

Imaging post on segmentation

Following on the loading and deconvolution posts (#26 #27) we would next like to perform image segmentation, and then get out some quantities for each region. What is the right way to do this?

@rxist525 @jakirkham @thewtex @jni

Blogpost on cuML and Dask hyperparameter optimization

It would be useful to have a blogpost that shows using cuML and Dask-ML together for hyperparamter optimization. I imagine the gist of this would be something like the following:

cuML is pretty fast on a GPU (small comparison)
But we have many GPUs
We'd like to do hyperparameter optimization, and use all of our GPUs
We start with scikit-learn, and found that a few API holes were missing, so we fixed those (links to PRs)
Now things work with scikit-learn and cuML, but remember, we wanted to use multiple GPUs
Fortuantely dask-ml works the same, so we swap things out, and things seem to work (unless something else breaks, like serialization)

cc @quasiben

dask / dask-blog Goto Github PK

dask-blog's Introduction

Dask

LICENSE

dask-blog's People

Contributors

Stargazers

Watchers

Forkers

dask-blog's Issues

Recommend Projects

Recommend Topics

Recommend Org