Giter VIP home page Giter VIP logo

df-parallel's Introduction

df-parallel

This repo demonstrates how to setup CONDA environments for popular Dataframe libraries and process large tabular data files.

It compares parallel and out-of-core (data that are too large to fit into the computer's memory) reading and processing of large datasets on CPU and GPU.

Dataframe Library Parallel Out-of-core CPU/GPU Evaluation
Pandas no no [1] CPU eager
Dask yes yes CPU lazy
Spark yes yes CPU lazy
cuDF yes no GPU eager
Dask-cuDF yes yes GPU lazy

[1] Pandas can read data in chunks, but they have to be processed independently.

Running Jupyter Lab locally (CPU only)


Prerequisites: Miniconda3 (light-weight, preferred) or Anaconda3 and Mamba

  • Install Miniconda3
  • Install Mamba: conda install mamba -n base -c conda-forge

  1. Clone this git repository
git clone https://github.com/sbl-sdsc/df-parallel.git
  1. Create CONDA environment
mamba env create -f df-parallel/environment.yml
  1. Activate the CONDA environment
conda activate df-parallel
  1. Launch Jupyter Lab
jupyter lab
  1. Deactivate the CONDA environment
conda deactivate

To remove the CONDA environment, run conda env remove -n df-parallel


Running Jupyter Lab on SDSC Expanse

To launch Jupyter Lab on Expanse, use the galyleo script. Specify your ACCESS account number with the --account option. If you do not have an ACCESS acount and allocation on Expanse, you can apply through NSF’s ACCESS program or for a trial allocation, contact [email protected].

  1. Clone this git repository
git clone https://github.com/sbl-sdsc/df-parallel.git

2a. Run on CPU (Pandas, Dask, and Spark dataframes):

galyleo launch --account <account_number> --partition shared --cpus 10 --memory 20 --time-limit 00:30:00 --conda-env df-parallel --conda-yml "${HOME}/df-parallel/environment.yml" --mamba

2b. Run on GPU (required for cuDF and Dask-cuDF dataframes):

galyleo launch --account <account_number> --partition gpu-shared --cpus 10 --memory 92 --gpus 1 --time-limit 00:30:00 --conda-env df-parallel-gpu --conda-yml "${HOME}/df-parallel/environment-gpu.yml" --mamba

Running the example notebooks

After Jupyter Lab has been launched, run the Notebook 1-DownloadData.ipynb to create a dataset. In this notebook, specify the number of copies (ncopies) to be made from the orignal dataset to increase its size. By default, a single copy is created. After the dataset has been created, run the dataframe specific notebooks. Note, the cuDF and Dask-cuDF dataframe libraries require a GPU.

Test results (not representative)

Results for running on SDSC Expanse GPU node with 10 CPU cores (Intel Xeon Gold 6248 2.5 GHz), 1 GPU (NVIDIA V100 SMX2, 32GB), and 92 GB of memory (DDR4 DRAM), local storage (1.6 TB Samsung PM1745b NVMe PCIe SSD).

Datafile size (gene_info.tsv as of June 2022):

  • Dataset 1: 5.4 GB (18 GB in Pandas)
  • Dataset 2: 21.4 GB (4 x Dataset 1) (62.4 GB in Pandas)
  • Dataset 3: 43.7 GB (8 x Dataset 1)
Dataframe Library time(5.4 GB) (s) time(21.4 GB) (s) time(43.7 GB) (s) Parallel Out-of-core CPU/GPU
Pandas 56.3 222.4 -- [2] no no CPU
Dask 15.7 42.1 121.8 yes yes CPU
Spark 14.2 31.2 56.5 yes yes CPU
cuDF 3.2 -- [2] -- [2] yes no GPU
Dask-cuDF 7.3 11.9 19.0 yes yes GPU

[2] out of memory

df-parallel's People

Contributors

pwrose avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.