The archer2-docs from archer2-hpc

Add information on multiple `srun` commands in batch scripts

Show how multiple srun commands can be used in job scripts - including to place multiple calculations on a single node.

Update "Using Python" section based on TDS access

Update "Quickstart for developers" section based on TDS access

archer-migration/data-migration

There is currently a duplication of material in

archer-migration/data-migration

and

user-guide/data-migration

This needs to be rationalised.

Performance tuning and best practice for MPI

We need a section on getting the most out of MPI : both generic (e.g. top ten tips for MPI, pointing to further documentation) and specific for ARCHER2 and Slingshot (will need at least the TDS for this). Should also cover what functionality is available in CrayMPI and what is not, also any limits that users should know (maximum tag counts, eager message defaults, etc.).

Add note on how to query CPU usage from Slurm on running jobs

Update "Containers" section based on TDS access

Containers section updated with known information on Singularity on ARCHER2

Most of the template content in the section should be good but needs reviewed and possibly expanded.

Initial version of data analysis section

This section is going to be difficult until we see what is available via the collaboration platform.

Could include information on using the cray-R environment here.

Add information on getting memory use data from Slurm

It would be useful to show the commands for extracting memory use information from Slurm in the profiling or tuning chapters.

For example, to get current memory use of a running job:

sstat --format=JobID,JobName,averss,maxrss,maxrsstask,avevms,maxvms,maxvmsize -j 12345

Or, to get memory use of a completed job:

sacct --format=JobID,JobName,averss,maxrss,maxrsstask,avevms,maxvms,maxvmsize -j 12345

Modify MITgcm documentation for clarity on ECCOv4-r4 process

Based on user feedback, I need to:

Mention that after using 'wget' to obtain the forcing data, the files need to be copied from their default directory
For clarity and redundancy, copy the compilation instructions into the ECCOv4-r4 case

(Please feel free to assign this issue to me)

Add information on shared directories and their use

Should cover:

Shared directories on /home and /work
Sharing with subgroup, project, others - different directory hierarchies and unix permissions
Impact on quotas
What happens to data in shred directories when user accounts are removed

Remove ARCHER to ARCHER2 part of docs

ARCHER is no more so some of this material is no longer relevant. Some of the information may still be of use so should be moved to other sections as required.

Add generic profiling information (CrayPat))

Add research-software templates

I intend to add subdirectories with scraped template content under

reserch-software

The following are relevant with most recent existing source

Cirrus -> CASTEP
ARCHER -> Code Staturne
ARCHER -> PyChemShell/ChemShell
Cirrus -> CP2K
ARCHER -> ELK
ARCHER -> FEniCS
Cirrus -> GROMACS
Cirrus -> LAMMPS
New!!! -> Met Office Unified Model
New!!! -> MITgcm
Cirrus -> NAMD
New!!! -> Nektar++
New!!! -> NEMO
ARCHER -> NWChem
ARCHER -> ONETEP
Cirrus -> OpenFOAM
Cirrus -> Quantum Espresso
Cirrus -> VASP

Update "Software environment" section based on TDS access

Update "Debugging" section based on TDS access

Add information on resources on ARCHER2

Add information to docs on:

What a CU is and how it corresponds to time use on ARCHER2
How charging works: based on used time rather than requested time
You are charged for the nodes assigned to the job even if you do not use them all. e.g. if you request 4 nodes and and only use 2 then you are charged for the 4 nodes as they are not available to users while assigned to your jobs

Populate Essential Skills section

The essential skills section needs to be updated to point to useful material such as Software Carpentry shell-novice.

Complete initial version of Application Developer Environment

Template material has been copied across - needs modified to match Cray environment.

List of Issues and Items for review prior to main system going live

Review of ARCHER2 Docs and identifying issues with move to main system (7/ JUL/21)

Changes completed

https://docs.archer2.ac.uk/faq/index.html#archer-work-data
Add year (2021) to date that ARCHER /work was decommissioned - DONE (CB)

https://docs.archer2.ac.uk/user-guide/connecting/#logging-in
Order of password and ssh key passphrase being reversed - DONE (ART)

https://docs.archer2.ac.uk/user-guide/data#work-file-systems - DONE (ART)
Update size to full /work

https://docs.archer2.ac.uk/user-guide/sw-environment - DONE (ART)
@aturner-epcc to look at this. See: #301

https://docs.archer2.ac.uk/user-guide/scheduler/#quality-of-service-qos - DONE (ART)

https://docs.archer2.ac.uk/user-guide/scheduler/#using-modules-in-the-batch-system-the-epcc-job-env-module
Need to review whether epcc-job-env-module will continue
This may break every user submit script if changed! - DONE (ART)
@aturner-epcc to look at this

https://docs.archer2.ac.uk/user-guide/scheduler/#bolt-job-submission-script-creation-tool
*** Julien check if bolt works - DONE (ART)

https://docs.archer2.ac.uk/user-guide/dev-environment/
@aturner-epcc to look at this #302 - DONE (ART)

Add UAN fingerprint to connection section

Update "Quickstart for users" section based on TDS access

Slurm email notifications

Document that email notifications are disabled on Slurm. Several queries related to this matter have already been handled on the ARCHER2 Service Desk.

Add reservations info

Complete the following section with further guidance about reservations:
https://docs.archer2.ac.uk/user-guide/scheduler/#reservations

Update "Profiling" section based on TDS access

Make sure all job script examples have correct use of `epcc-job-env` module and document it

The epcc-job-env module makes sure that there is a default PrgEnv restored (unless users have modified the SBATCH_EXPORT environment variable. All example scripts should ensure that it is used in the correct place.

We should also add a section in the Scheduler chapter covering the module and what it does. Noting that it must be the first module loaded in a script if it is used.

Add information on connecting from Windows using command line rather than point and click GUI

At the moment we only cover connecting to ARCHER2 from Windows using MobaXTerm:

https://docs.archer2.ac.uk/user-guide/connecting/#logging-in-from-windows-using-mobaxterm

Now that Windows Powershell supports SSH command line more consistently we should update the docs to cover connecting using that mechanism from Windows too.

Add information on workaround for MPMD jobs

As Slurm MPMD is not yet working correctly we should document the current workaround in the Scheduler chapter.

Add information on creating and using Singularity containers with MPI

The Containers section does not currently have information on how to create and use containers with MPI on ARCHER2. This information does exist, see:

https://epcced.github.io/2020-12-08-Containers-Online/12-singularity-mpi/index.html

and the example ARCHER2 job submission script at:

https://github.com/EPCCed/2020-12-08-Containers-Online/blob/gh-pages/files/osu_latency.slurm.template

Options for hybrid MPI/OpenMP jobs lead to incorrect thread placement

The options currently specified in the User Guide for hybrid MPI/OpenMP lead to multiple threads being placed on the same core. We need to investigate and find the correct options in Slurm to get the placement working.

Performance tuning and best practice for OpenMP

We need a section on getting the most out of OpenMP : both generic (e.g. top ten tips for OpenMP, pointing to further documentation) and specific for ARCHER2 and AMD EPYC Zen2 (will need at least the TDS for this). Should also cover what functionality is available in the various PrgEnv and what is not.

Update "Application development environment" section based on TDS access

Initial version of Python chapter

Before we can update the Python chapter, we need to decide on the approach to Python on ARCHER2. Initial proposal is:

For compute node, high-performance Python: use the cray-python environment. Need to document how you use this and how you install further Python modules on top
For data analysis, serial Python: probably provide an Anaconda distribution. Should this be provided as a module or a container environment?
For self-installed Python: need to recommend a solution. Could be miniconda or could advise to pull containers from the DockerHub

Initial version of Debugging section

Need to create the basic documentation for using gdb4hpc. Could base on docs at:

https://www.alcf.anl.gov/support-center/theta/gdb

Could then be improved once we have experience on the system.

Create initial entries in Data Analysis and Tools section

We need to look at creating some content here. Some initial pages could be:

VisiData
R (Cray R)

Add info on ownership of data in subgroup directories

New data created in subgroup directories has the correct ownership due to the setgid bit but data copied/moved from elsewhere on /work (e.g. main project directories) keeps its current ownership (and has the setgid bit set so new data within the directories has original ownership). We should document this issue and the use of the chown command to fix ownership as it does trip users up.

Add Quickstart for Package Users

The ARCHER2 Quickstart for package users needs to be added. Overview of the content required can be found at:

https://docs.archer2.ac.uk/quick-start/overview.html

Add Quickstart for developers

The ARCHER2 Quickstart for developers needs to be added. Overview of the content required can be found at:

https://docs.archer2.ac.uk/quick-start/overview.html

Add instructions on compiling on compute nodes

We should add instructions on using the collections in /etc/cray-pe.d to setup programming environments on compute nodes.

Complete Scheduler chapter

Complete initial version of scheduler chapter for initial beat release

Document use of hybrid MPI+OpenMP

The current documentation has an example MPI+OpenMP script but no documentation describing the background of how to run these jobs, more advanced placement information and a description of the best layout to match onto the ARCHER2 NUMA structure. This should be added in the Scheduler chapter. Point to the Tuning chapter for more advanced information on OpenMP.

Update library modules requiring new versions

New

ARPACK 3.8.0

Deferred

ADIOS 2.6.0

Other

Confirm status of cray-ga or remove

Update NAMD page with information on how to get good performance

Early access users identified that particular run options and configuration are required to get good performance using NAMD on ARCHER2. The NAMD page needs to be updated to include this information.

Update STAT documentation

The following document:
https://docs.archer2.ac.uk/user-guide/debug/#stat
does not reflect the following issue:
https://docs.archer2.ac.uk/known-issues/#stat-view-not-working
It'd be useful to link the issue from the STAT section.

Filling I/O and file systems sections

Possible seed material at: http://www.archer.ac.uk/documentation/best-practice-guide/io.php and http://www.archer.ac.uk/documentation/user-guide/resource_management.php#sec-3.3

Add detailed hardware information in a new User and Best Practice Guide section

It would be useful to have a new section in the User and Best Practice Guide that covers the ARCHER2 hardware and architecture in more detail. This could go after Overview but before Connecting. This should cover:

System overview: node types, storage types, interconnect, external networking
Compute node details: layout, interconnect
Processor details: cores, core complexes, infinity core, NUMA regions, FP unit and instruction sets, cache
Memory details: type, speed, volume, bandwidth/latency (theoretical and measured)
Interconnect details: topology, features, bandwidth/latency (theoretical and measured)
Point to IO section for more details on storage

Are "here" documents useful for job submission on ARCHER2?

In the NERSC scheduler best practice, they use here documents to potentially reduce load on compute nodes and make jobs more efficient. See:

https://docs.nersc.gov/jobs/best-practices/#improve-efficiency-by-preparing-user-environment-before-running

Where they describe creating a script such as:

#!/bin/bash -l

# Submit this script as: "./prepare-env.sh" instead of "sbatch prepare-env.sh"

# Prepare user env needed for Slurm batch job
# such as module load, setup runtime environment variables, or copy input files, etc.
# Basically, these are the commands you usually run ahead of the srun command 

module load cray-netcdf
export OMP_NUM_THREADS=4

# Generate the Slurm batch script below with the here document, 
# then when sbatch the script later, the user env set up above will run on the login node
# instead of on a head compute node (if included in the Slurm batch script),
# and inherited into the batch job.

cat << EOF > prepare-env.sl 
#!/bin/bash
#SBATCH -t 30:00
#SBATCH -N 8
#SBATCH -q debug
#SBATCH -C haswell

srun -n 16 -c 32 --cpu_bind=cores ./myapp.exe 

# Other commands needed after srun, such as copy your output filies,
# should still be included in the Slurm script.
cp <my_output_file> <target_location>/.
EOF

# Now submit the batch job
sbatch prepare-env.sl

@kevinstratford commented

Not sure I like that here document business; if the preparatory work is really
significant, it could be a separate job with the main job as dependency. This
prevents conflating scripts (does prepare-env.sh here document overwrite
the submitted prepare-env.sh??)

What do people think, should we include this advice or not?

Update "Running jobs on ARCHER2" section based on TDS access

Update MITgcm documentation to include new build options file

As part of an eCSE, we are developing a new build options file. We would like to link to a preliminary version of the new file.

(Please feel free to assign this issue to me.)

archer2-hpc / archer2-docs Goto Github PK

archer2-docs's People

Contributors

Stargazers

Watchers

Forkers

archer2-docs's Issues

Changes completed

Recommend Projects

Recommend Topics

Recommend Org