argonne-lcf / gettingstarted Goto Github PK
View Code? Open in Web Editor NEWCollection of small examples for running on ALCF resources
Collection of small examples for running on ALCF resources
The sample job scripts reference workq
, but this is not a queue regular users seem to have access to. The docs instead refer to debug
and prod
, so perhaps the sample scripts here should be updated?
Need to update all Polaris examples to include filesystems: #PBS -l filesystems=home:grand:eagle .
In the GPU example: https://github.com/argonne-lcf/GettingStarted/blob/master/Examples/Polaris/affinity_gpu/submit.sh
, the script sets:
NRANKS_PER_NODE=8
but in the example in the online docs it is set to 4
(https://www.alcf.anl.gov/support/user-guides/polaris/queueing-and-running-jobs/example-job-scripts/index.html)
Isn't 4
the correct number, since there are 4 GPUs per node?
Hello,
I am having issues with using deepspeed (stage 2) for 2 node configuration with 8 A100 GPUs. I followed https://github.com/argonne-lcf/GettingStarted/tree/master/DataScience/DeepSpeed, but I am using pytorch lightning instead for implementing DeepSpeed https://lightning.ai/docs/pytorch/stable/advanced/model_parallel/deepspeed.html.
I am finding that I have no problems with training a model with 4 GPUs over 1 node using DeepSpeed(stage=2) with pytorch lightning; however, when I use 2 nodes with 8 total gpus, it appears that the 2nd node is stalling, and the code freezes at the final GPU rank on the first node.
I tried to see if this was reproducible with the boring_model.py
, which is Pytorch Lightning's minimalist code script for reproducible error messages... I ended up with similar issues.
Here is the boring_model.py
: https://gist.github.com/PraljakReps/d699f5d16af00e35cf4c8b8abfb09b6c
Using the trainer found in the above python script, I tried three configurations.
Depending on your pytorch lightning version (see below for my virtual env.), the trainer should look like this to reproduce my errors.
trainer = pl.Trainer(
gpus=4,
max_epochs=1,
num_nodes=2,
precision=16,
strategy="deepspeed_stage_2",
callbacks=[lr_monitor]
)
i followed this link too when running mpiexec
command: https://docs.alcf.anl.gov/polaris/data-science-workflows/frameworks/deepspeed/#:~:text=DeepSpeed.%20The%20base%20conda%20environment,cloning%20the%20base%20environment%20can
Thus, the code that I ran is the following:
NHOSTS=$(wc -l < "${PBS_NODEFILE}")
NGPU_PER_HOST=$(nvidia-smi -L | wc -l)
NGPUS="$((${NHOSTS}*${NGPU_PER_HOST}))"
mpiexec \
--verbose \
--envall \
-n "${NGPUS}" \
--ppn "${NGPU_PER_HOST}" \
--hostfile="${PBS_NODEFILE}" \
python \
boring_model.py
and I am still getting the issue where the second node hangs...
Note: I am entering a compute node interactively with the following command:
qsub -I -l select=2:ngpus=4 -l filesystems=home:eagle -l walltime=1:00:00 -q debug -A <account>
Is there a way to run deepspeed over multiple-nodes on Polaris for the boring_model.py
with pytorch lightning as a test case? Of course, my main goal is to conduct multi-node training for my research project, but I think success running of boring_model.py
with pytorch-lightning+deepspeed is a easy case.
pytorch-lightning==1.9.5
torch==2.0.1
torchmetrics==1.2.0
lightning-bolts==0.7.0
deepspeed==0.12.5
python==3.8.18
Per discussion w/Software Committee, since there is no longer a llvm-sycl module, the oneapi module should be loaded in its place. As such, instructions on this page need to reflect this change:
https://github.com/argonne-lcf/GettingStarted/tree/master/Examples/Polaris/affinity
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.