Giter VIP home page Giter VIP logo

dask-cookbook's Introduction

Dask Cookbook

nightly-build Binder DOI

This Project Pythia Cookbook provides a comprehensive guide to understanding the basic concepts and collections of Dask as well as its integration with Xarray. Dask is a parallel computing library that allows you to scale your computations to multiple cores or even clusters, while Xarray is a library that enables working with labelled multi-dimensional arrays, with a focus on working with netCDF datasets.

Motivation

The motivation behind this repository is to provide a clear and concise resource for anyone looking to learn about the basic concepts of Dask and its integration with Xarray. By providing step-by-step tutorials, we hope to make it easy for users to understand the fundamental concepts of parallel computing and distributed data processing, as well as how to apply them in practice using Dask and Dask+Xarray.

Authors

Negin Sobhani, Brian Vanderwende, Deepak Cherian, and Ben Kirk

Contributors

Note on Content Origin

This cookbook is derived from the extensive material used in the NCAR tutorial, "Using Dask on HPC systems", which was held in February 2023. The NCAR tutorial series also includes an in-depth exploration and practical use cases of Dask on HPC systems and best practices for Dask on HPC. For the complete set of NCAR tutorial materials, including these additional insights on Dask on HPC, please refer to the main NCAR tutorial content available here.

Structure

In the first chapter of this cookbook, we provide step-by-step tutorials on the basic concepts of Dask, including Dask arrays and Dask dataframes, which are powerful tools for parallel computing and distributed data processing. We explain the key differences between these Dask data structures and their counterparts in NumPy and Pandas.

In the second chapter of the repository, we move on to more advanced topics, such as distributed computing and Dask+Xarray integration. We provide examples of how to use Dask+Xarray to efficiently work with large, labelled multi-dimensional datasets. Finally, we will discuss some best practices regarding Dask+Xarray.

Running the Notebooks

You can either run the notebook using Binder or on your local machine.

Running on Binder

The simplest way to interact with a Jupyter Notebook is through Binder, which enables the execution of a Jupyter Book in the cloud. The details of how this works are not important for now. All you need to know is how to launch a Pythia Cookbooks chapter via Binder. Simply navigate your mouse to the top right corner of the book chapter you are viewing and click on the rocket ship icon, (see figure below), and be sure to select “launch Binder”. After a moment you should be presented with a notebook that you can interact with. I.e. you’ll be able to execute and even change the example programs. You’ll see that the code cells have no output at first, until you execute them by pressing {kbd}Shift+{kbd}Enter. Complete details on how to interact with a live Jupyter notebook are described in Getting Started with Jupyter.

Running on Your Own Machine

If you are interested in running this material locally on your computer, you will need to follow this workflow:

  1. Clone the https://github.com/ProjectPythia/dask-cookbook repository:

     git clone https://github.com/ProjectPythia/dask-cookbook.git
  2. Move into the dask-cookbook directory

    cd dask-cookbook
  3. Create and activate your conda environment from the environment.yml file

    conda env create -f environment.yml
    conda activate dask-cookbook
  4. Move into the notebooks directory and start up Jupyterlab

    cd notebooks/
    jupyter lab

Acknowledgments

  • NCAR CISL/CSG Team
  • ESDS Initiative

dask-cookbook's People

Contributors

negin513 avatar jukent avatar erogluorhan avatar jsignell avatar clyne avatar

Stargazers

Apu avatar Joanmarie Del Vecchio avatar Carlos Frederico Bastarz avatar Huang Zeqin avatar

Watchers

 avatar  avatar

dask-cookbook's Issues

Use this cookbook for FOSS4G-NA Dask workshop?

I am going to be presenting a Dask workshop at FOSS4G-NA 2 (Free and Open Source Software for Geospatial, North America) in late October. It’s a 3 hour slot and I am planning on covering the basic concepts of Dask and touching on the rest of the Pangeo stack. I would like to build on existing work and store the final version somewhere that is group-owned.

This cookbook seems really great and I'd love to build off of it! Would you be open to me opening some PRs? My plan would be:

  • Read through all the material and suggest updates
  • Potentially add a section on delayed/futures
  • Update some language to make it less NCAR-specific (#13)
  • Try to find a public option for how to run the tutorial without relying on binder EDIT: I just realized that project pythia has its own binder instance so maybe this is a non-issue

Just for context I also posted an issue about this on the Pangeo discourse

Write access?

I am planning on presenting this at FOSS4G-NA and in that process I am going to be doing some tweaks over the next few weeks and it would be great to have write access to this repo. @andersy005, @jukent, or @brian-rose is that something that you have the power to grant?

@negin513 is 👍 on this idea.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.