archer2-hpc / archer2-docs Goto Github PK
View Code? Open in Web Editor NEWRepository for ARCHER2 documentation
Home Page: https://docs.archer2.ac.uk
License: Other
Repository for ARCHER2 documentation
Home Page: https://docs.archer2.ac.uk
License: Other
Show how multiple srun
commands can be used in job scripts - including to place multiple calculations on a single node.
There is currently a duplication of material in
archer-migration/data-migration
and
user-guide/data-migration
This needs to be rationalised.
We need a section on getting the most out of MPI : both generic (e.g. top ten tips for MPI, pointing to further documentation) and specific for ARCHER2 and Slingshot (will need at least the TDS for this). Should also cover what functionality is available in CrayMPI and what is not, also any limits that users should know (maximum tag counts, eager message defaults, etc.).
Most of the template content in the section should be good but needs reviewed and possibly expanded.
This section is going to be difficult until we see what is available via the collaboration platform.
Could include information on using the cray-R environment here.
It would be useful to show the commands for extracting memory use information from Slurm in the profiling or tuning chapters.
For example, to get current memory use of a running job:
sstat --format=JobID,JobName,averss,maxrss,maxrsstask,avevms,maxvms,maxvmsize -j 12345
Or, to get memory use of a completed job:
sacct --format=JobID,JobName,averss,maxrss,maxrsstask,avevms,maxvms,maxvmsize -j 12345
Based on user feedback, I need to:
(Please feel free to assign this issue to me)
Should cover:
ARCHER is no more so some of this material is no longer relevant. Some of the information may still be of use so should be moved to other sections as required.
I intend to add subdirectories with scraped template content under
reserch-software
The following are relevant with most recent existing source
Cirrus -> CASTEP
ARCHER -> Code Staturne
ARCHER -> PyChemShell/ChemShell
Cirrus -> CP2K
ARCHER -> ELK
ARCHER -> FEniCS
Cirrus -> GROMACS
Cirrus -> LAMMPS
New!!! -> Met Office Unified Model
New!!! -> MITgcm
Cirrus -> NAMD
New!!! -> Nektar++
New!!! -> NEMO
ARCHER -> NWChem
ARCHER -> ONETEP
Cirrus -> OpenFOAM
Cirrus -> Quantum Espresso
Cirrus -> VASP
Add information to docs on:
The essential skills section needs to be updated to point to useful material such as Software Carpentry shell-novice.
Template material has been copied across - needs modified to match Cray environment.
Review of ARCHER2 Docs and identifying issues with move to main system (7/ JUL/21)
Quickstart for developers #326 Kevin
All known issues need review - https://docs.archer2.ac.uk/known-issues/index.html (GEORGE)
Add information on using High Memory nodes - https://docs.archer2.ac.uk/faq/index.html#oom-error-on-archer2 (MICHAEL B)
Check recommended IO file setting - https://docs.archer2.ac.uk/user-guide/io/ (DAVID H)
Check containers instructions stay the same https://docs.archer2.ac.uk/user-guide/containers/ (MICHAEL D)
Check python commands work - https://docs.archer2.ac.uk/user-guide/python/ (JULIEN S)
Add note on data visualisation nodes (DVN) - https://docs.archer2.ac.uk/user-guide/analysis/ (MICHAEL B)
check apprentice2 instructions works + archer2jobload - https://docs.archer2.ac.uk/user-guide/profile/#cray-apprentice2
(WILLIAM L)
Check version numbers - https://docs.archer2.ac.uk/user-guide/tuning/ (CODE CONTACTS)
Most will need an update including to check version numbers #283 #284 - https://docs.archer2.ac.uk/software-libraries [KEVIN S]
Most will need an update including to check version numbers - https://docs.archer2.ac.uk/data-tools/index.html (JULIEN S)
Get Crystal developers to check relevant documentation - https://docs.archer2.ac.uk/other-software/crystal/ (ANDY T, TBC)
https://docs.archer2.ac.uk/faq/index.html#archer-work-data
Add year (2021) to date that ARCHER /work was decommissioned - DONE (CB)
https://docs.archer2.ac.uk/user-guide/connecting/#logging-in
Order of password and ssh key passphrase being reversed - DONE (ART)
https://docs.archer2.ac.uk/user-guide/data#work-file-systems - DONE (ART)
Update size to full /work
https://docs.archer2.ac.uk/user-guide/sw-environment - DONE (ART)
@aturner-epcc to look at this. See: #301
https://docs.archer2.ac.uk/user-guide/scheduler/#quality-of-service-qos - DONE (ART)
https://docs.archer2.ac.uk/user-guide/scheduler/#using-modules-in-the-batch-system-the-epcc-job-env-module
Need to review whether epcc-job-env-module will continue
This may break every user submit script if changed! - DONE (ART)
@aturner-epcc to look at this
https://docs.archer2.ac.uk/user-guide/scheduler/#bolt-job-submission-script-creation-tool
*** Julien check if bolt works - DONE (ART)
https://docs.archer2.ac.uk/user-guide/dev-environment/
@aturner-epcc to look at this #302 - DONE (ART)
Document that email notifications are disabled on Slurm. Several queries related to this matter have already been handled on the ARCHER2 Service Desk.
Complete the following section with further guidance about reservations:
https://docs.archer2.ac.uk/user-guide/scheduler/#reservations
The epcc-job-env
module makes sure that there is a default PrgEnv restored (unless users have modified the SBATCH_EXPORT
environment variable. All example scripts should ensure that it is used in the correct place.
We should also add a section in the Scheduler chapter covering the module and what it does. Noting that it must be the first module loaded in a script if it is used.
At the moment we only cover connecting to ARCHER2 from Windows using MobaXTerm:
https://docs.archer2.ac.uk/user-guide/connecting/#logging-in-from-windows-using-mobaxterm
Now that Windows Powershell supports SSH command line more consistently we should update the docs to cover connecting using that mechanism from Windows too.
As Slurm MPMD is not yet working correctly we should document the current workaround in the Scheduler chapter.
The Containers section does not currently have information on how to create and use containers with MPI on ARCHER2. This information does exist, see:
https://epcced.github.io/2020-12-08-Containers-Online/12-singularity-mpi/index.html
and the example ARCHER2 job submission script at:
The options currently specified in the User Guide for hybrid MPI/OpenMP lead to multiple threads being placed on the same core. We need to investigate and find the correct options in Slurm to get the placement working.
We need a section on getting the most out of OpenMP : both generic (e.g. top ten tips for OpenMP, pointing to further documentation) and specific for ARCHER2 and AMD EPYC Zen2 (will need at least the TDS for this). Should also cover what functionality is available in the various PrgEnv and what is not.
Before we can update the Python chapter, we need to decide on the approach to Python on ARCHER2. Initial proposal is:
Need to create the basic documentation for using gdb4hpc
. Could base on docs at:
Could then be improved once we have experience on the system.
We need to look at creating some content here. Some initial pages could be:
New data created in subgroup directories has the correct ownership due to the setgid bit but data copied/moved from elsewhere on /work (e.g. main project directories) keeps its current ownership (and has the setgid bit set so new data within the directories has original ownership). We should document this issue and the use of the chown
command to fix ownership as it does trip users up.
The ARCHER2 Quickstart for package users needs to be added. Overview of the content required can be found at:
The ARCHER2 Quickstart for developers needs to be added. Overview of the content required can be found at:
We should add instructions on using the collections in /etc/cray-pe.d
to setup programming environments on compute nodes.
Complete initial version of scheduler chapter for initial beat release
The current documentation has an example MPI+OpenMP script but no documentation describing the background of how to run these jobs, more advanced placement information and a description of the best layout to match onto the ARCHER2 NUMA structure. This should be added in the Scheduler chapter. Point to the Tuning chapter for more advanced information on OpenMP.
New
Deferred
Other
cray-ga
or removeEarly access users identified that particular run options and configuration are required to get good performance using NAMD on ARCHER2. The NAMD page needs to be updated to include this information.
The following document:
https://docs.archer2.ac.uk/user-guide/debug/#stat
does not reflect the following issue:
https://docs.archer2.ac.uk/known-issues/#stat-view-not-working
It'd be useful to link the issue from the STAT section.
It would be useful to have a new section in the User and Best Practice Guide that covers the ARCHER2 hardware and architecture in more detail. This could go after Overview but before Connecting. This should cover:
In the NERSC scheduler best practice, they use here documents to potentially reduce load on compute nodes and make jobs more efficient. See:
Where they describe creating a script such as:
#!/bin/bash -l
# Submit this script as: "./prepare-env.sh" instead of "sbatch prepare-env.sh"
# Prepare user env needed for Slurm batch job
# such as module load, setup runtime environment variables, or copy input files, etc.
# Basically, these are the commands you usually run ahead of the srun command
module load cray-netcdf
export OMP_NUM_THREADS=4
# Generate the Slurm batch script below with the here document,
# then when sbatch the script later, the user env set up above will run on the login node
# instead of on a head compute node (if included in the Slurm batch script),
# and inherited into the batch job.
cat << EOF > prepare-env.sl
#!/bin/bash
#SBATCH -t 30:00
#SBATCH -N 8
#SBATCH -q debug
#SBATCH -C haswell
srun -n 16 -c 32 --cpu_bind=cores ./myapp.exe
# Other commands needed after srun, such as copy your output filies,
# should still be included in the Slurm script.
cp <my_output_file> <target_location>/.
EOF
# Now submit the batch job
sbatch prepare-env.sl
@kevinstratford commented
Not sure I like that here document business; if the preparatory work is really
significant, it could be a separate job with the main job as dependency. This
prevents conflating scripts (does prepare-env.sh here document overwrite
the submitted prepare-env.sh??)
What do people think, should we include this advice or not?
As part of an eCSE, we are developing a new build options file. We would like to link to a preliminary version of the new file.
(Please feel free to assign this issue to me.)
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.