Giter VIP home page Giter VIP logo

hpc's Introduction

Using PyTorch in NYU HPC

A quick reference to access NYU's High Performance Computing Prince Cluster.

The official wiki is here, this is an unofficial document created as a quick-start guide for first-time users with a focus in Python and PyTorch.

Get an account

You need to be affiliated to NYU and have a sponsor.

To get an account approved, follow this steps.

Log in

Once you have been approved, you can access HPC from:

  1. Within the NYU network (in campus):

Remember to replace NYUNetID for your own NetID.

Once logged in, the root should be: /home/NYUNetID, so running pwd should print:

[NYUNetID@log-0 ~]$ pwd
/home/NYUNetID
  1. From an off-campus location:

First, Login to your VPN and then login to the bastion host, :

Then login to the cluster:

ssh prince.hpc.nyu.edu

Using Windows.

I use the MobaXterm ssh client with the following settings for the Prince Cluster:

Remote host: prince.hpc.nyu.edu
Username: NYUNetID
Port: 22

This makes it one click to open a terminal to Prince.

File Systems

You can get acces to three filesystems: /home, /scratch, and /archive.

Scratch is a file system mounted on Prince that is connected to the compute nodes where we can upload files faster. Notice that the content gets flushed every 60 days with no backup!

[NYUNetID@log-0 ~]$ cd /scratch/NYUNetID
[NYUNetID@log-0 ~]$ pwd
/scratch/NYUNetID

/home and /scratch are separate filesystems in separate places. Depending on how often you use your files you might want to choose the appropiate file system. I use /home for the files I won't touch often.

Loading Modules

Slurm allows you to load and manage multiple versions and configurations of software packages.

To see available package environments:

module avail

To load a model:

module load [package name]

For example if you want to use Tensorflow-gpu:

module load cudnn/8.0v6.0
module load cuda/8.0.44
module load tensorflow/python3.6/1.3.0

To check what is currently loaded:

module list

To remove all packages:

module purge

To get helpful information about the package:

module show torch/gnu/20170504

Will print something like

--------------------------------------------------------------------------------------------------------------------------------------------------
   /share/apps/modulefiles/torch/gnu/20170504.lua:
--------------------------------------------------------------------------------------------------------------------------------------------------
whatis("Torch: a scientific computing framework with wide support for machine learning algorithms that puts GPUs first")
whatis("Name: torch version: 20170504 compilers: gnu")
load("cmake/intel/3.7.1")
load("cuda/8.0.44")
load("cudnn/8.0v5.1")
load("magma/intel/2.2.0")
...

load(...) are the dependencies that are also loaded when you load a package.

Interactive Mode: Request CPU

You can submit batch jobs in prince to schedule jobs. This requires to write custom bash scripts. Batch jobs are great for longer jobs, and you can also run in interactive mode, which is great for short jobs and troubleshooting.

To run in interactive mode:

[NYUNetID@log-0 ~]$ srun --pty /bin/bash

This will run the default mode: a single CPU core and 2GB memory for 1 hour.

To request more CPU's:

[NYUNetID@log-0 ~]$ srun -n4 -t2:00:00 --mem=4000 --pty /bin/bash
[NYUNetID@c26-16 ~]$ 

That will request 4 compute nodes for 2 hours with 4 Gb of memory.

To exit a request:

[NYUNetID@c26-16 ~]$ exit
[NYUNetID@log-0 ~]$

Interactive Mode: Request GPU

[NYUNetID@log-0 ~]$ srun --gres=gpu:1 --pty /bin/bash
[NYUNetID@gpu-25 ~]$ nvidia-smi
Mon Oct 23 17:49:19 2017
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 367.48                 Driver Version: 367.48                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla K80           On   | 0000:12:00.0     Off |                    0 |
| N/A   37C    P8    29W / 149W |      0MiB / 11439MiB |      0%   E. Process |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID  Type  Process name                               Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

Submit a job

You can write a script that will be executed when the resources you requested became available.

A simple CPU demo:

## 1) Job settings

#!/bin/bash
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=1
#SBATCH --time=5:00:00
#SBATCH --mem=2GB
#SBATCH --job-name=CPUDemo
#SBATCH --mail-type=END
#SBATCH [email protected]
#SBATCH --output=slurm_%j.out
  
## 2) Everything from here on is going to run:

cd /scratch/NYUNetID/demos
python demo.py

Request GPU:

#!/bin/bash
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --gres=gpu:4
#SBATCH --time=10:00:00
#SBATCH --mem=3GB
#SBATCH --job-name=GPUDemo
#SBATCH --mail-type=END
#SBATCH [email protected]
#SBATCH --output=slurm_%j.out

cd /scratch/NYUNetID/trainSomething
source activate ML
python train.py

Submit your job with:

sbatch myscript.s

Monitor the job:

squeue -u $USER

More info here

Transfer Files

I transfer files using MobaXTerm. If you need to setup a tunnel look here

PyTorch

Once you are all setup with the above, to get pytorch you need to do a couple of things:

  1. Create a virtual Environment
  2. Load the appropiate modules in the environment

Creating a Virtual Environment

mkdir /scratch/gs157/tmp/pytorch-gpu
cd pytorch-gpu/
module load  python3/intel/3.6.3
virtualenv --system-site-packages py3.6.3
source py3.6.3/bin/activate

After the above you have your virtual environment setup. Now you need to get pytorch

Installing pytorch

Note on 5/12/20: On Prince, GPU driver does not support CUDA 10.2, if you are running PyTorch, please try to use PyTorch built with CUDA 10.1.

pip3 install torch torchvision
pip install torch==1.5.0+cu101 torchvision==0.6.0+cu101 -f https://download.pytorch.org/whl/torch_stable.html

running pytorch short scripts

Now everytime you want to use your pytorch environment all you need to do is:

[NYUNetID@log-0 ~]$ source py3.6.3/bin/activate - activate python environment
[NYUNetID@log-0 ~]$ srun --gres=gpu:1 --pty /bin/bash - interactive gpu environment on HPC

[NYUNetID@gpu-25 ~]$ cd /scratch/NYUNetID/trainSomething
[NYUNetID@gpu-25 ~]$ python train.py

running jupyter notebook

Instructions are here

  • Once you copied and have your run-jupyter.sbatch
[NYUNetID@log-0 ~]$ source py3.6.3/bin/activate - activate python environment
[NYUNetID@log-0 ~]$ sbatch run-jupyter-gpu.sbatch
[NYUNetID@log-0 ~]$ cat slurm-xxxx.out

in a separate window (ubuntu shell) type:


ssh -L NNNN:localhost:NNN netID@prince

Open a browser at localhost NNNN:

http://localhost:8925/?token=76f100825af441457502d5d080c1776b987a2f76101460f4

hpc's People

Contributors

cvalenzuela avatar gussand avatar juniorxsound avatar coblezc avatar

Stargazers

John McCann Cunniff Jr avatar

Watchers

James Cloos avatar  avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.