eessi / test-suite Goto Github PK
View Code? Open in Web Editor NEWA portable test suite for software installations, using ReFrame
License: GNU General Public License v2.0
A portable test suite for software installations, using ReFrame
License: GNU General Public License v2.0
The common_eessi_init()
function should be updated as well, and it should honor the $EESSI_*
environment variables that are set by the EESSI initialization script.
Reframe submits all the tests to the same partition. So if reframe is started from the cpu_milan
partitions all the test that are found for other partitions will also be submitted to cpu_milan
. This especially goes horribly wrong when starting from a GPU partition. Since all the tests meant for the cpu-partitions fail mediately.
We have narrowed down the problem to the following parts of the hortense system and the vsc_hortense.py
config file:
#SBATCH --partition=cpu_milan
in the job script ( rfm_job.sh
)SBATCH_PARTITION=cpu_rome
, set by Hortense cluster modulesbatch rfm_job.sh
SBATCH_PARTITION
variable winsCould it be possible that reframe submits the job with sbatch --partition=cpu_milan rfm.job
?
A possible work around might be to use prepare_cmds
to set the environment variable SBATCH_PARTITION
for every partition in config/vsc_hortense.py
.
useful for debugging
should write into the reframe debug log file
not really an issue as we don't use gitpython, and the test does not fail over it, but:
ERROR: Build of /home/runner/.local/easybuild/easyconfigs/r/ReFrame/ReFrame-4.2.0.eb failed (err: 'build failed (first 300 chars): `/cvmfs/pilot.eessi-hpc.org/versions/2021.12/compat/linux/x86_64/usr/bin/python -m pip check` failed:\ngitpython 3.1.24 requires typing-extensions, which is not installed.\n')
Right now, are namespaces are a bit messy: eessi_checks
and eessi_utils
are installed as two separate packages, and form the top level namespaces. That means we get imports like this in the tests:
from eessi_utils import hooks
from eessi_utils import utils
Furthermore, if we ever inherit from some base mixin class, we'd get things like
from eessi_checks import eessi_testbase
All pretty ugly.
Discussed a bit with Sam on Slack. It might be nice to have everything related to testing in a eessi.testing
namespace (we want to avoid eessi.test
or eessi.tests
, as we probably want to reserve the .tests
for the actual reframe test classes). E.g. we'd have things like:
eessi.testing.utils
eessi.testing.hooks
eessi.testing.generic.eessi_baseline
(a baseline class for mixin inheritance)eessi.testing.tests.tensorflow
(a parent class from which multiple TensorFlow tests could be derived, much like we derive our current GROMACS test from hpctestlib.sciapps.gromacs.benchmarks
)It would leave the option over to adding other Python stuff (not related to testing) to the eessi
namespace if we need to later on (hence the nesting under eessi.testing
). As far as I understand, this can be done with namespace packages, and allows the packages to be installed seperately, versioned separately, etc. See https://packaging.python.org/en/latest/guides/packaging-namespace-packages/.
Opinions @satishskamath @boegel @smoors ?
Of course comments within the document help but some background will be useful to provide reasoning for those comments.
We should be able to control which module is used when running a test
the problem is in the mympyrun
wrapper.
the --hybrid
option expects as value the number of tasks per node instead of total number of tasks.
@register_launcher('mympirun')
class MyMpirunLauncher(JobLauncher):
def command(self, job):
return ['mympirun', '--hybrid', str(job.num_tasks)]
replacing job.num_tasks with job.num_tasks/job.num_nodes
possible solutions: replace --hybrid job.num_tasks
with
--hybrid job.num_tasks_per_node
--> i checked that this indeed fixes the job script--universe job.num_tasks
--mpirunoptions="-np job.num_tasks"
With ReFrame as required dependency
On some systems, specific scales like 1_cpn_2_nodes
don't make sense because the system simply doesn't allow using partial nodes when requested multiple nodes.
This is the case on Snellius @ SURF, where submitting a job that asks for 2 GPUs on 2 different nodes doesn't work.
Ideally, there is a way to express that 1_cpn_2_nodes
should never be used on Snellius through the configuration file.
cfr. discussion in #54 (comment)
currently the number of omp threads is always set to 1, even if there is only one task.
ideally, we should set the number of omp threads equal to the number of cpus per task.
performance comparison of 2 GPU jobs:
Batch Script for 7834827
--------------------------------------------------------------------------------
#!/bin/bash
#SBATCH --job-name="rfm_GROMACS_EESSI___HECBioSim_hEGFRDimer____3328920_0__0_001__gpu___singlenode___1__GROMACS_2021_3_foss_2021a_CUDA_11_3_1_job"
#SBATCH --ntasks=1
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=10
#SBATCH --output=rfm_GROMACS_EESSI___HECBioSim_hEGFRDimer____3328920_0__0_001__gpu___singlenode___1__GROMACS_2021_3_foss_2021a_CUDA_11_3_1_job.out
#SBATCH --error=rfm_GROMACS_EESSI___HECBioSim_hEGFRDimer____3328920_0__0_001__gpu___singlenode___1__GROMACS_2021_3_foss_2021a_CUDA_11_3_1_job.err
#SBATCH --time=0:30:0
#SBATCH --partition=pascal_gpu --gpus-per-node=1
module load GROMACS/2021.3-foss-2021a-CUDA-11.3.1
export OMP_NUM_THREADS=1
curl -LJO https://github.com/victorusu/GROMACS_Benchmark_Suite/raw/1.0.0/HECBioSim/hEGFRDimer/benchmark.tpr
srun gmx_mpi mdrun -dlb yes -ntomp 1 -npme -1 -nb gpu -s benchmark.tpr
PERFORMANCE REPORT
------------------------------------------------------------------------------
GROMACS_EESSI %benchmark_info=HECBioSim/hEGFRDimer %nb_impl=gpu %scale=('singlenode', 1) %module_name=GROMACS/2021.3-foss-2021a-CUDA-11.3.1
- hydra:pascal
- builtin
* num_tasks: 1
* perf: 1.667 ns/day
Batch Script for 7834837
--------------------------------------------------------------------------------
#!/bin/bash
#SBATCH --job-name="rfm_GROMACS_EESSI___HECBioSim_hEGFRDimer____3328920_0__0_001__gpu___singlenode___1__GROMACS_2021_3_foss_2021a_CUDA_11_3_1_job"
#SBATCH --ntasks=1
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=10
#SBATCH --output=rfm_GROMACS_EESSI___HECBioSim_hEGFRDimer____3328920_0__0_001__gpu___singlenode___1__GROMACS_2021_3_foss_2021a_CUDA_11_3_1_job.out
#SBATCH --error=rfm_GROMACS_EESSI___HECBioSim_hEGFRDimer____3328920_0__0_001__gpu___singlenode___1__GROMACS_2021_3_foss_2021a_CUDA_11_3_1_job.err
#SBATCH --time=0:30:0
#SBATCH --partition=pascal_gpu --gpus-per-node=1
module load GROMACS/2021.3-foss-2021a-CUDA-11.3.1
export OMP_NUM_THREADS=10
curl -LJO https://github.com/victorusu/GROMACS_Benchmark_Suite/raw/1.0.0/HECBioSim/hEGFRDimer/benchmark.tpr
srun gmx_mpi mdrun -nb gpu -s benchmark.tpr -dlb yes -ntomp 10 -npme -1
PERFORMANCE REPORT
------------------------------------------------------------------------------
GROMACS_EESSI %benchmark_info=HECBioSim/hEGFRDimer %nb_impl=gpu %scale=('singlenode', 1) %module_name=GROMACS/2021.3-foss-2021a-CUDA-11.3.1
- hydra:pascal
- builtin
* num_tasks: 1
* perf: 8.227 ns/day
Concretely: with https://github.com/EESSI/test-suite/pulls ReFrame instantiate all combinations of partitions and e.g. nb_impl
. This also generates invalid combinations, that we are currently skipping (e.g .nb_impl=gpu
on a partition with CPU-only nodes).
We should probably specify in the test that it requires the partition to have certain features, as described here https://reframe-hpc.readthedocs.io/en/stable/regression_test_api.html?highlight=feature#reframe.core.pipeline.RegressionTest.valid_systems
Some code snippit of what that could look like:
@run_after('init')
def apply_module_info(self):
self.s, self.e, self.m = self.module_info
valid_systems = self.s
if self.nb_impl == 'gpu':
valid_systems = '+gpu'
self.valid_systems = [valid_systems]
self.modules = [self.m]
self.valid_prog_environs = [self.e]
and the in the ReFrame config file we should have something like
...
'partitions': [
{
'name': 'my_partition',
...
'features': [
'gpu',
]
...
}
]
Command executed:
[satishk@int4 projects]$ PYTHONPATH=$PYTHONPATH:$EBROOTREFRAME:$eessihome reframe -vvvv -C eessi_reframe/settings_example.py -c test-suite/eessi/reframe/eessi_checks/applications/ -t CI -t 1_node -l
Part of the output:
Looking for tests in '/gpfs/home5/satishk/projects/test-suite/eessi/reframe/eessi_checks/applications'
Validating '/gpfs/home5/satishk/projects/test-suite/eessi/reframe/eessi_checks/applications/__init__.py': not a test file
Validating '/gpfs/home5/satishk/projects/test-suite/eessi/reframe/eessi_checks/applications/gromacs_check.py': OK
WARNING: skipping test 'GROMACS_EESSI': test has one or more undefined parameters
> Loaded 0 test(s)
Loaded 0 test(s)
Config settings file:
""" This file is a settings file for eessi test suite. """
from os import environ
username = environ.get('USER')
# This is an example configuration file
site_configuration = {
'systems': [
{
'name': 'snellius_eessi',
'descr': 'example_cluster',
'modules_system': 'lmod',
'hostnames': ['tcn*', 'gcn*'],
# Note that the stagedir should be a shared directory available on
# all nodes running ReFrame tests
'stagedir': f'/scratch-shared/{username}/reframe_output/staging',
'partitions': [
{
'name': 'cpu',
'scheduler': 'slurm',
'launcher': 'mpirun',
'access': ['-p thin'],
'environs': ['default'],
'max_jobs': 4,
'processor': {
'num_cpus': 128,
'num_sockets': 2,
'num_cpus_per_socket': 64,
'arch': 'znver2',
},
'features': ['cpu'],
'descr': 'CPU partition'
},
{
'name': 'gpu',
'scheduler': 'slurm',
'launcher': 'mpirun',
'access': ['-p gpu'],
'environs': ['default'],
'max_jobs': 4,
'processor': {
'num_cpus': 72,
'num_sockets': 2,
'num_cpus_per_socket': 36,
'arch': 'icelake',
},
'resources': [
{
'name': '_rfm_gpu',
'options': ['--gpus-per-node={num_gpus_per_node}'],
}
],
'devices': [
{
'type': 'gpu',
'num_devices': 4,
}
],
'features': ['cpu', 'gpu'],
'descr': 'GPU partition'
},
]
},
],
'environments': [
{
'name': 'default',
'cc': 'cc',
'cxx': '',
'ftn': '',
},
],
'logging': [
{
'level': 'debug',
'handlers': [
{
'type': 'stream',
'name': 'stdout',
'level': 'info',
'format': '%(message)s'
},
{
'type': 'file',
'name': 'reframe.log',
'level': 'debug',
'format': '[%(asctime)s] %(levelname)s: %(check_info)s: %(message)s', # noqa: E501
'append': False
}
],
'handlers_perflog': [
{
'type': 'filelog',
'prefix': '%(check_system)s/%(check_partition)s',
'level': 'info',
'format': (
'%(check_job_completion_time)s|reframe %(version)s|'
'%(check_info)s|jobid=%(check_jobid)s|'
'%(check_perf_var)s=%(check_perf_value)s|'
'ref=%(check_perf_ref)s '
'(l=%(check_perf_lower_thres)s, '
'u=%(check_perf_upper_thres)s)|'
'%(check_perf_unit)s'
),
'append': True
}
]
}
],
}
A test using the hEGFRDimerSmallerPL
GROMACS input fails if the GROMACS version is too old:
Program: gmx mdrun, version 2019.3
Source file: src/gromacs/fileio/tpxio.cpp (line 2695)
MPI rank: 0 (out of 96)
Fatal error:
reading tpx file (benchmark.tpr) version 119 with version 116 program
I had this bit written up for the docs on the EESSI test suite, but neither of these options doesn't seem to work as intended with our GROMACS/TensorFlow tests:
#### Filtering by device (CPU, GPU)
By default, ReFrame will generate variants of tests for each applicable
device type, based on the specified [`features`](https://reframe-hpc.readthedocs.io/en/stable/config_reference.html#config.systems.partitions.features) for system partitions (in the ReFrame configuration file) and [`valid_systems`](https://reframe-hpc.readthedocs.io/en/stable/regression_test_api.html#reframe.core.pipeline.RegressionTest.valid_systems) value of the available tests.
To only run checks on CPU, you can use the [`--cpu-only` option](https://reframe-hpc.readthedocs.io/en/stable/manpage.html#cmdoption-cpu-only).
To only run tests on GPU, you can use the [`--gpu-only` option](https://reframe-hpc.readthedocs.io/en/stable/manpage.html#cmdoption-gpu-only).
For example, to only run tests on GPU:
```
reframe --gpu-only
```
You can use `--list` to check the impact of these options on generated checks.
--gpu-only --list
, no checks are generated--cpu-only --list
, I see GROMACS CUDA modules being used to generate tests (they should only be used for GPU checks?)The resulting installation is basically empty, even import eessi
doesn't work.
When I use python setup.py sdist
to see what's collected in a source tarball, it's clear that eessi/*
is skipped entirely:
$ python3 setup.py sdist
running sdist
running egg_info
writing eessi/testsuite/eessi_testsuite.egg-info/PKG-INFO
writing dependency_links to eessi/testsuite/eessi_testsuite.egg-info/dependency_links.txt
writing requirements to eessi/testsuite/eessi_testsuite.egg-info/requires.txt
writing top-level names to eessi/testsuite/eessi_testsuite.egg-info/top_level.txt
reading manifest file 'eessi/testsuite/eessi_testsuite.egg-info/SOURCES.txt'
writing manifest file 'eessi/testsuite/eessi_testsuite.egg-info/SOURCES.txt'
running check
warning: check: missing required meta-data: url
warning: check: missing meta-data: either (author and author_email) or (maintainer and maintainer_email) must be supplied
creating eessi-testsuite-0.0.2
creating eessi-testsuite-0.0.2/eessi
creating eessi-testsuite-0.0.2/eessi/testsuite
creating eessi-testsuite-0.0.2/eessi/testsuite/eessi_testsuite.egg-info
copying files to eessi-testsuite-0.0.2...
copying README.md -> eessi-testsuite-0.0.2
copying setup.cfg -> eessi-testsuite-0.0.2
copying setup.py -> eessi-testsuite-0.0.2
copying eessi/testsuite/eessi_testsuite.egg-info/PKG-INFO -> eessi-testsuite-0.0.2/eessi/testsuite/eessi_testsuite.egg-info
copying eessi/testsuite/eessi_testsuite.egg-info/SOURCES.txt -> eessi-testsuite-0.0.2/eessi/testsuite/eessi_testsuite.egg-info
copying eessi/testsuite/eessi_testsuite.egg-info/dependency_links.txt -> eessi-testsuite-0.0.2/eessi/testsuite/eessi_testsuite.egg-info
copying eessi/testsuite/eessi_testsuite.egg-info/requires.txt -> eessi-testsuite-0.0.2/eessi/testsuite/eessi_testsuite.egg-info
copying eessi/testsuite/eessi_testsuite.egg-info/top_level.txt -> eessi-testsuite-0.0.2/eessi/testsuite/eessi_testsuite.egg-info
Writing eessi-testsuite-0.0.2/setup.cfg
creating dist
Creating tar archive
removing 'eessi-testsuite-0.0.2' (and everything under it)
This is probably due to the older setuptools
version I'm using here (default version on RHEL8):
$ python3 -m pip show setuptools
Name: setuptools
Version: 39.2.0
Summary: Easily download, build, install, upgrade, and uninstall Python packages
Home-page: https://github.com/pypa/setuptools
Author: Python Packaging Authority
Author-email: [email protected]
License: UNKNOWN
Location: /usr/lib/python3.6/site-packages
Requires:
Required-by: jsonschema, setools, vsc-install, Sphinx, protobuf
Both srun
, mpirun
hang when SLURM_MPI_TYPE=pmix
.
Should be bumped in both setup.cfg
and pyproject.toml
The contents of the reframe.log file and reframe.out file get overwritten with every run by reframe.
Currently keeping the logs and outputs of different runs by changing the names of the files manually.
Currently, find_modules
searches for substrings. This can cause issues if software names are used as a suffix.
Example: if I have the modules
TensorFlow/2.6.0-foss-2021a-CUDA-11.3.1
Horovod/0.22.1-foss-2021a-CUDA-11.3.1-TensorFlow-2.6.0
I would want to run the TensorFlow test only on the first. However, find_modules
will return both. Note that the test would run when loading the Horovod module, there's just no added benefit in running this same test twice, since you're essentially testing the same module.
We should probably make the function more specific and make it only match if it starts with this substring.
[casparvl@fair-mastodon-c6g-2xlarge-0001 ~]$ rm -rf /tmp/reframe_421
[casparvl@fair-mastodon-c6g-2xlarge-0001 ~]$ python3 -m venv /tmp/reframe_421
[casparvl@fair-mastodon-c6g-2xlarge-0001 ~]$ source /tmp/reframe_421/bin/activate
(reframe_421) [casparvl@fair-mastodon-c6g-2xlarge-0001 ~]$ python3 -m pip install reframe-hpc==4.2.1
Collecting reframe-hpc==4.2.1
Using cached https://files.pythonhosted.org/packages/aa/6c/a7a5cc465ef223bf686717b6034a817be7028f30b28b4d2b840f1c29bda3/ReFrame_HPC-4.2.1-py3-none-any.whl
Collecting archspec (from reframe-hpc==4.2.1)
Using cached https://files.pythonhosted.org/packages/63/ae/333e7d216dda9134558ddc30792d96bfc58968ff5cc69b4ad9e02dfac654/archspec-0.2.1-py3-none-any.whl
Collecting argcomplete (from reframe-hpc==4.2.1)
Using cached https://files.pythonhosted.org/packages/4f/ef/8b604222ba5e5190e25851aa3a5b754f2002361dc62a258a8e9f13e866f4/argcomplete-3.1.1-py3-none-any.whl
Collecting jsonschema (from reframe-hpc==4.2.1)
Using cached https://files.pythonhosted.org/packages/c5/8f/51e89ce52a085483359217bc72cdbf6e75ee595d5b1d4b5ade40c7e018b8/jsonschema-3.2.0-py2.py3-none-any.whl
Collecting PyYAML (from reframe-hpc==4.2.1)
Using cached https://files.pythonhosted.org/packages/36/2b/61d51a2c4f25ef062ae3f74576b01638bebad5e045f747ff12643df63844/PyYAML-6.0.tar.gz
Collecting lxml (from reframe-hpc==4.2.1)
Using cached https://files.pythonhosted.org/packages/06/5a/e11cad7b79f2cf3dd2ff8f81fa8ca667e7591d3d8451768589996b65dec1/lxml-4.9.2.tar.gz
Complete output from command python setup.py egg_info:
Building lxml version 4.9.2.
Building without Cython.
Error: Please make sure the libxml2 and libxslt development packages are installed.
----------------------------------------
Command "python setup.py egg_info" failed with error code 1 in /tmp/pip-build-bospx5ra/lxml/
You are using pip version 9.0.3, however version 23.1.2 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.
@boegel can you have a look at this? It seems to be missing the development headers for libxml2 and libxslt.
ReFrame supports performance checking, but this requires hard-coding system name & expected performance in the tests. This breaks test portability. Some ideas would be:
This task should explore different options, then decide on how to proceed (e.g. maybe we go for option (1) in the short term, but work on other long term options?). Since we planned on co-design with ReFrame, those options are also attractive.
Use module name + _EESSI.
from common_config import common_logging_config
'logging': common_logging_config,
in configuration filesreframe.log
and reframe.out
, etc.Some applications that are compiled with CUDA run on pure cpu
partitions without complaints such as GROMACS but some applications such as OSU (#54) require libcuda.so.1
which their binaries are linked to. An argument for this was that we can link it with stubs libcuda.so.1
if the normal one is not found using LD_LIBRARY_PATH
.
cuda
feature be defined at the partition level?mpirun -x
.the gromacs test will serve as a blueprint for other tests, so should be documented well
We want to create certain low level tests as well, to validate basic functionality of communication libraries etc. One of those is OSU. This will also help is in figuring out how the test 'blueprint' for GROMACS works for these types of tests.
As a result of a conversation with @bartoldeman I tried to run our GROMACS test on the software stack from The Alliance, but it failed because our current tests (specifically the find_modules
) assume a flat module scheme.
We should think if/how we can improve this. Probably have a look at how The Alliance specifies their ReFrame tests right now.
currently, if you specify valid_systems on the command line, no automatic filtering is done, so you have to do the filtering yourself.
the reason is that unfortunately, combining systemname:partitionname with features is not fully supported in the current version of ReFrame (4.0.5).
for example, this works (multiple list items behave like the OR operation):
valid_partitions = ['*:ampere', '+cpu']
but this does not (multiple string items behave like AND operation):
valid_partitions = ['*:ampere +cpu']
this is what ReFrame shows in this case:
WARNING: skipping test 'GROMACS_EESSI': type error: test-suite/eessi/reframe/eessi-checks/applications/gromacs_check.py:62: failed to set field 'valid_systems': '['*:ampere +gpu']' is not of type 'List[Str[r'^(((\*|(\w[-.\w]*))(:(\*|(\w[-.\w]*)))?)|(([+-](\w[-.\w]*))|(%(\w[-.\w]*)=\S+))(\s+(([+-](\w[-.\w]*))|(%(\w[-.\w]*)=\S+)))*)$']]'
self.valid_systems = [valid_systems]
maybe it's possible check the features of the partitions that are specified on the cmd line, and filter the tests based on that
more info here: https://reframe-hpc.readthedocs.io/en/stable/regression_test_api.html#reframe.core.pipeline.RegressionTest.valid_systems
For example based on https://github.com/EESSI/eessi-demo/tree/main/OpenFOAM
(reframe_421) [eucasparvl@vglogin0003 ~]$ reframe -C test-suite/config/izum_vega.py -c test-suite/eessi/testsuite/tests/apps/ -R -t CI -t "1_node|2_nodes" -r
[ReFrame Setup]
version: 4.2.1
command: '/tmp/reframe_421/bin/reframe -C test-suite/config/izum_vega.py -c test-suite/eessi/testsuite/tests/apps/ -R -t CI -t 1_node|2_nodes -r'
launched by: [email protected]
working directory: '/ceph/hpc/home/eucasparvl'
settings files: '<builtin>', 'test-suite/config/izum_vega.py'
check search path: (R) '/ceph/hpc/home/eucasparvl/test-suite/eessi/testsuite/tests/apps'
stage directory: '/ceph/hpc/home/eucasparvl/reframe_runs/staging'
output directory: '/ceph/hpc/home/eucasparvl/reframe_runs/output'
log files: '/ceph/hpc/home/eucasparvl/reframe_20230622_164619.log'
[==========] Running 4 check(s)
[==========] Started on Thu Jun 22 16:47:10 2023
[----------] start processing checks
[ RUN ] GROMACS_EESSI %benchmark_info=HECBioSim/Crambin %nb_impl=cpu %scale=2_nodes %module_name=GROMACS/2020.4-foss-2020a-Python-3.8.2 /c3790f2a @vega:cpu+default
[ RUN ] GROMACS_EESSI %benchmark_info=HECBioSim/Crambin %nb_impl=cpu %scale=2_nodes %module_name=GROMACS/2020.1-foss-2020a-Python-3.8.2 /5535abba @vega:cpu+default
[ RUN ] GROMACS_EESSI %benchmark_info=HECBioSim/Crambin %nb_impl=cpu %scale=1_node %module_name=GROMACS/2020.4-foss-2020a-Python-3.8.2 /a108fe65 @vega:cpu+default
[ RUN ] GROMACS_EESSI %benchmark_info=HECBioSim/Crambin %nb_impl=cpu %scale=1_node %module_name=GROMACS/2020.1-foss-2020a-Python-3.8.2 /695f354c @vega:cpu+default
[ OK ] (1/4) GROMACS_EESSI %benchmark_info=HECBioSim/Crambin %nb_impl=cpu %scale=2_nodes %module_name=GROMACS/2020.4-foss-2020a-Python-3.8.2 /c3790f2a @vega:cpu+default
P: perf: 345.984 ns/day (r:0, l:None, u:None)
[ OK ] (2/4) GROMACS_EESSI %benchmark_info=HECBioSim/Crambin %nb_impl=cpu %scale=1_node %module_name=GROMACS/2020.1-foss-2020a-Python-3.8.2 /695f354c @vega:cpu+default
P: perf: 6.044 ns/day (r:0, l:None, u:None)
[ OK ] (3/4) GROMACS_EESSI %benchmark_info=HECBioSim/Crambin %nb_impl=cpu %scale=1_node %module_name=GROMACS/2020.4-foss-2020a-Python-3.8.2 /a108fe65 @vega:cpu+default
P: perf: 5.61 ns/day (r:0, l:None, u:None)
[ FAIL ] (4/4) GROMACS_EESSI %benchmark_info=HECBioSim/Crambin %nb_impl=cpu %scale=2_nodes %module_name=GROMACS/2020.1-foss-2020a-Python-3.8.2 /5535abba @vega:cpu+default
==> test failed during 'sanity': test staged in '/ceph/hpc/home/eucasparvl/reframe_runs/staging/vega/cpu/default/GROMACS_EESSI_5535abba'
[----------] all spawned checks have finished
[ FAILED ] Ran 4/4 test case(s) from 4 check(s) (1 failure(s), 0 skipped, 0 aborted)
[==========] Finished on Thu Jun 22 17:17:42 2023
================================================================================================================================================================
SUMMARY OF FAILURES
----------------------------------------------------------------------------------------------------------------------------------------------------------------
FAILURE INFO for GROMACS_EESSI %benchmark_info=HECBioSim/Crambin %nb_impl=cpu %scale=2_nodes %module_name=GROMACS/2020.1-foss-2020a-Python-3.8.2 (run: 1/1)
* Description: GROMACS HECBioSim/Crambin benchmark (NB: cpu)
* System partition: vega:cpu
* Environment: default
* Stage directory: /ceph/hpc/home/eucasparvl/reframe_runs/staging/vega/cpu/default/GROMACS_EESSI_5535abba
* Node list: cn0403,cn0405
* Job type: batch job (id=65838140)
* Dependencies (conceptual): []
* Dependencies (actual): []
* Maintainers: []
* Failing phase: sanity
* Rerun with '-n /5535abba -p default --system vega:cpu -r'
* Reason: sanity error: pattern 'Finished mdrun' not found in 'md.log'
--- rfm_job.out (first 10 lines) ---
--- rfm_job.out ---
--- rfm_job.err (first 10 lines) ---
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0
0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0
100 673k 100 673k 0 0 1081k 0 --:--:-- --:--:-- --:--:-- 1081k
--------------------------------------------------------------------------
Open MPI failed an OFI Libfabric library call (fi_domain). This is highly
unusual; your job may behave unpredictably (and/or abort) after this.
--- rfm_job.err ---
----------------------------------------------------------------------------------------------------------------------------------------------------------------
Log file(s) saved in '/ceph/hpc/home/eucasparvl/reframe_20230622_164619.log'
For example based on https://github.com/EESSI/eessi-demo/tree/main/TensorFlow
This will make sure that people should never run into missing processor info errors.
useful for testing with non-default opts
overriding the executable
variable already works via --setvar executable=<x>
can be done from the command line with reframe --setvar variables=<envar>:<value>
Command line:
gmx_mpi mdrun -nb cpu -s benchmark.tpr -dlb yes -npme -1 -ntomp 1
Reading file benchmark.tpr, VERSION 5.1.4 (single precision)
Note: file tpx version 103, software tpx version 119
Changing nstlist from 10 to 80, rlist from 1.2 to 1.321
-------------------------------------------------------
Program: gmx mdrun, version 2020.1-EasyBuild-4.5.0
Source file: src/gromacs/domdec/domdec.cpp (line 2277)
MPI rank: 0 (out of 2048)
Fatal error:
There is no domain decomposition for 1536 ranks that is compatible with the
given box and a minimum cell size of 0.6725 nm
Change the number of ranks or mdrun option -rdd or -dds
Look in the log file for details on the domain decomposition
For more information and tips for troubleshooting, please check the GROMACS
website at http://www.gromacs.org/Documentation/Errors
This Crambin test is the one tagged with the CI
ReFrame tag.
One solution would be to make the tagging a bit more complicated, and use the Crambin test only for the singlenode case (and possibly two nodes). A more intricate solution could be to also implement a maximum task count for this test. I.e. let it auto-select a task count, and afterwards check if that is larger than 1536, then cap it. It does mean testing on more nodes is then useless, so if those tests are still generated it would lead to a wase of resources.
For now, I'd prefer just tagging a large test case for larger node counts.
Currently, we don't set any binding options explicitly for the GROMACS test. This may or may not result in 'sensible' binding. We should check how we can make sure that it binds in a reasonable way. Also, we should check that it launches correct amounts of tasks on hyperthreading nodes (1 per physical core, not 1 per thread).
For binding, we'd want processes to be bound to physical cores for pure MPI (i.e. CPU runs of GROMACS). For GPU runs (essentially hybrid OpenMP+MPI) we should bind at least the tasks to 1-xth of the nodes CPU cores (if there are x GPUs in the node). Even better would be to also add thread binding to this.
For example based on https://github.com/EESSI/eessi-demo/tree/main/Bioconductor
with feature cuda
we can more explicitly filter on devices that support cuda (required for tests that use a cuda module on a gpu).
later on, we more easily add rocm
and oneapi
when AMD and Intel GPU modules become available in EESSI.
copying discussion of #28
For the 1_core, 2_core and 4_core CPU tests, the walltime is too short.
Point number 2 makes me think. First of all, the error is quite non-descriptive:
FAILURE INFO for GROMACS_EESSI_093
* Expanded name: GROMACS_EESSI %benchmark_info=HECBioSim/hEGFRDimer %nb_impl=cpu %scale=4_cores %module_name=GROMACS/2021.6-foss-2022a
* Description: GROMACS HECBioSim/hEGFRDimer benchmark (NB: cpu)
* System partition: snellius:thin
* Environment: default
* Stage directory: /scratch-shared/casparl/reframe_output/staging/snellius/thin/default/GROMACS_EESSI_1dfdd606
* Node list:
* Job type: batch job (id=2773366)
* Dependencies (conceptual): []
* Dependencies (actual): []
* Maintainers: []
* Failing phase: sanity
* Rerun with '-n /1dfdd606 -p default --system snellius:thin -r'
* Reason: sanity error: pattern 'Finished mdrun' not found in 'md.log'
Maybe we should make a standard sanity check that checks if the job output does not contain something like
slurmstepd: error: *** JOB 2773368 ON tcn509 CANCELLED AT 2023-05-19T20:40:10 DUE TO TIME LIMIT ***
I'm not sure how this generalizes to other systems (I'm assuming all SLURM based systems print this by default), but even so: it doesn't hurt to check. At least there is a better chance of getting a clear error message.
Secondly, how do we make sure we don't run out of walltime? Sure, we could just specify a very long time, but that can be problematic as well (not satisfying max walltimes on a queue, and in our case, jobs <1h actually get backfilled behind a floating reservation that we have in order to reserve some dedicated nodes for short jobs). Should we scale the max walltime based on the amount of resources?
about the walltime issue:
about checking for the exceeded time limit message:
i think that's a good idea. we could expand that to also check for an out-of-memory message, and maybe other slurm messages that i don't know of. even better would be to check that the job error file is empty, but Gromacs prints some stuff to stderr, would be nice if there is a way to force Gromacs to not do that.
mostly 'cpu' and 'gpu' for now, maybe also 'CUDA'
also think about more finegrained/hierarchical naming of more specific devices
also move constants to a separate constants.py
file
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.