Giter VIP home page Giter VIP logo

test-suite's People

Contributors

boegel avatar casparvl avatar larappr avatar ocaisa avatar satishskamath avatar smoors avatar xinan1911 avatar

Stargazers

 avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

test-suite's Issues

job submissions to wrong partition on Hortense

Reframe submits all the tests to the same partition. So if reframe is started from the cpu_milan partitions all the test that are found for other partitions will also be submitted to cpu_milan. This especially goes horribly wrong when starting from a GPU partition. Since all the tests meant for the cpu-partitions fail mediately.

We have narrowed down the problem to the following parts of the hortense system and the vsc_hortense.py config file:

  • The config file adds the following line, #SBATCH --partition=cpu_milan in the job script ( rfm_job.sh)
  • We have the following environment variable, SBATCH_PARTITION=cpu_rome, set by Hortense cluster module
  • Reframe submits the jobs with sbatch rfm_job.sh
  • So the SBATCH_PARTITION variable wins

Could it be possible that reframe submits the job with sbatch --partition=cpu_milan rfm.job?

A possible work around might be to use prepare_cmds to set the environment variable SBATCH_PARTITION for every partition in config/vsc_hortense.py.

error in github action test_with_eessi_pilot

not really an issue as we don't use gitpython, and the test does not fail over it, but:

ERROR: Build of /home/runner/.local/easybuild/easyconfigs/r/ReFrame/ReFrame-4.2.0.eb failed (err: 'build failed (first 300 chars): `/cvmfs/pilot.eessi-hpc.org/versions/2021.12/compat/linux/x86_64/usr/bin/python -m pip check` failed:\ngitpython 3.1.24 requires typing-extensions, which is not installed.\n')

Rethink namespaces for the test suite

Right now, are namespaces are a bit messy: eessi_checks and eessi_utils are installed as two separate packages, and form the top level namespaces. That means we get imports like this in the tests:

from eessi_utils import hooks
from eessi_utils import utils

Furthermore, if we ever inherit from some base mixin class, we'd get things like

from eessi_checks import eessi_testbase

All pretty ugly.

Discussed a bit with Sam on Slack. It might be nice to have everything related to testing in a eessi.testing namespace (we want to avoid eessi.test or eessi.tests, as we probably want to reserve the .tests for the actual reframe test classes). E.g. we'd have things like:

  • eessi.testing.utils
  • eessi.testing.hooks
  • eessi.testing.generic.eessi_baseline (a baseline class for mixin inheritance)
  • eessi.testing.tests.tensorflow (a parent class from which multiple TensorFlow tests could be derived, much like we derive our current GROMACS test from hpctestlib.sciapps.gromacs.benchmarks)

It would leave the option over to adding other Python stuff (not related to testing) to the eessi namespace if we need to later on (hence the nesting under eessi.testing). As far as I understand, this can be done with namespace packages, and allows the packages to be installed seperately, versioned separately, etc. See https://packaging.python.org/en/latest/guides/packaging-namespace-packages/.

Opinions @satishskamath @boegel @smoors ?

wrong number of MPI ranks when running multi-node jobs with vsc_hortense config

the problem is in the mympyrun wrapper.
the --hybrid option expects as value the number of tasks per node instead of total number of tasks.

@register_launcher('mympirun')
class MyMpirunLauncher(JobLauncher):
    def command(self, job):
        return ['mympirun', '--hybrid', str(job.num_tasks)]

replacing job.num_tasks with job.num_tasks/job.num_nodes

possible solutions: replace --hybrid job.num_tasks with

  • --hybrid job.num_tasks_per_node --> i checked that this indeed fixes the job script
  • --universe job.num_tasks
  • --mpirunoptions="-np job.num_tasks"

support for filtering out incompatible scales in configuration file

On some systems, specific scales like 1_cpn_2_nodes don't make sense because the system simply doesn't allow using partial nodes when requested multiple nodes.

This is the case on Snellius @ SURF, where submitting a job that asks for 2 GPUs on 2 different nodes doesn't work.

Ideally, there is a way to express that 1_cpn_2_nodes should never be used on Snellius through the configuration file.

cfr. discussion in #54 (comment)

sub-optimal number of openmp threads

currently the number of omp threads is always set to 1, even if there is only one task.
ideally, we should set the number of omp threads equal to the number of cpus per task.

performance comparison of 2 GPU jobs:

Batch Script for 7834827
--------------------------------------------------------------------------------
#!/bin/bash
#SBATCH --job-name="rfm_GROMACS_EESSI___HECBioSim_hEGFRDimer____3328920_0__0_001__gpu___singlenode___1__GROMACS_2021_3_foss_2021a_CUDA_11_3_1_job"
#SBATCH --ntasks=1
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=10
#SBATCH --output=rfm_GROMACS_EESSI___HECBioSim_hEGFRDimer____3328920_0__0_001__gpu___singlenode___1__GROMACS_2021_3_foss_2021a_CUDA_11_3_1_job.out
#SBATCH --error=rfm_GROMACS_EESSI___HECBioSim_hEGFRDimer____3328920_0__0_001__gpu___singlenode___1__GROMACS_2021_3_foss_2021a_CUDA_11_3_1_job.err
#SBATCH --time=0:30:0
#SBATCH --partition=pascal_gpu --gpus-per-node=1
module load GROMACS/2021.3-foss-2021a-CUDA-11.3.1
export OMP_NUM_THREADS=1
curl -LJO https://github.com/victorusu/GROMACS_Benchmark_Suite/raw/1.0.0/HECBioSim/hEGFRDimer/benchmark.tpr
srun gmx_mpi mdrun -dlb yes -ntomp 1 -npme -1 -nb gpu -s benchmark.tpr

PERFORMANCE REPORT
------------------------------------------------------------------------------
GROMACS_EESSI %benchmark_info=HECBioSim/hEGFRDimer %nb_impl=gpu %scale=('singlenode', 1) %module_name=GROMACS/2021.3-foss-2021a-CUDA-11.3.1
- hydra:pascal
   - builtin
      * num_tasks: 1
      * perf: 1.667 ns/day
Batch Script for 7834837
--------------------------------------------------------------------------------
#!/bin/bash
#SBATCH --job-name="rfm_GROMACS_EESSI___HECBioSim_hEGFRDimer____3328920_0__0_001__gpu___singlenode___1__GROMACS_2021_3_foss_2021a_CUDA_11_3_1_job"
#SBATCH --ntasks=1
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=10
#SBATCH --output=rfm_GROMACS_EESSI___HECBioSim_hEGFRDimer____3328920_0__0_001__gpu___singlenode___1__GROMACS_2021_3_foss_2021a_CUDA_11_3_1_job.out
#SBATCH --error=rfm_GROMACS_EESSI___HECBioSim_hEGFRDimer____3328920_0__0_001__gpu___singlenode___1__GROMACS_2021_3_foss_2021a_CUDA_11_3_1_job.err
#SBATCH --time=0:30:0
#SBATCH --partition=pascal_gpu --gpus-per-node=1
module load GROMACS/2021.3-foss-2021a-CUDA-11.3.1
export OMP_NUM_THREADS=10
curl -LJO https://github.com/victorusu/GROMACS_Benchmark_Suite/raw/1.0.0/HECBioSim/hEGFRDimer/benchmark.tpr
srun gmx_mpi mdrun -nb gpu -s benchmark.tpr -dlb yes -ntomp 10 -npme -1

PERFORMANCE REPORT
------------------------------------------------------------------------------
GROMACS_EESSI %benchmark_info=HECBioSim/hEGFRDimer %nb_impl=gpu %scale=('singlenode', 1) %module_name=GROMACS/2021.3-foss-2021a-CUDA-11.3.1
- hydra:pascal
   - builtin
      * num_tasks: 1
      * perf: 8.227 ns/day

Improve GROMACS test by using new `system features` support in ReFrame

See https://reframe-hpc.readthedocs.io/en/stable/regression_test_api.html?highlight=feature#reframe.core.pipeline.RegressionTest.valid_systems

Concretely: with https://github.com/EESSI/test-suite/pulls ReFrame instantiate all combinations of partitions and e.g. nb_impl. This also generates invalid combinations, that we are currently skipping (e.g .nb_impl=gpu on a partition with CPU-only nodes).

We should probably specify in the test that it requires the partition to have certain features, as described here https://reframe-hpc.readthedocs.io/en/stable/regression_test_api.html?highlight=feature#reframe.core.pipeline.RegressionTest.valid_systems

Some code snippit of what that could look like:

    @run_after('init')
    def apply_module_info(self):
        self.s, self.e, self.m = self.module_info
        valid_systems = self.s
        if self.nb_impl == 'gpu':
            valid_systems = '+gpu'
        self.valid_systems = [valid_systems]
        self.modules = [self.m]
        self.valid_prog_environs = [self.e]

and the in the ReFrame config file we should have something like

...
    'partitions': [
   {
      'name': 'my_partition',
      ...
      'features': [
          'gpu',
      ]
      ...
   }
]

Following the steps in the README leads to `undefined parameters`

Command executed:

[satishk@int4 projects]$ PYTHONPATH=$PYTHONPATH:$EBROOTREFRAME:$eessihome reframe -vvvv -C eessi_reframe/settings_example.py -c test-suite/eessi/reframe/eessi_checks/applications/ -t CI -t 1_node  -l

Part of the output:

Looking for tests in '/gpfs/home5/satishk/projects/test-suite/eessi/reframe/eessi_checks/applications'
Validating '/gpfs/home5/satishk/projects/test-suite/eessi/reframe/eessi_checks/applications/__init__.py': not a test file
Validating '/gpfs/home5/satishk/projects/test-suite/eessi/reframe/eessi_checks/applications/gromacs_check.py': OK
WARNING: skipping test 'GROMACS_EESSI': test has one or more undefined parameters
  > Loaded 0 test(s)
Loaded 0 test(s)

Config settings file:

""" This file is a settings file for eessi test suite. """
from os import environ
username = environ.get('USER')

# This is an example configuration file
site_configuration = {
    'systems': [
        {
            'name': 'snellius_eessi',
            'descr': 'example_cluster',
            'modules_system': 'lmod',
            'hostnames': ['tcn*', 'gcn*'],
            # Note that the stagedir should be a shared directory available on
            # all nodes running ReFrame tests
            'stagedir': f'/scratch-shared/{username}/reframe_output/staging',
            'partitions': [
                {
                    'name': 'cpu',
                    'scheduler': 'slurm',
                    'launcher': 'mpirun',
                    'access':  ['-p thin'],
                    'environs': ['default'],
                    'max_jobs': 4,
                    'processor': {
                        'num_cpus': 128,
                        'num_sockets': 2,
                        'num_cpus_per_socket': 64,
                        'arch': 'znver2',
                    },
                    'features': ['cpu'],
                    'descr': 'CPU partition'
                },
                {
                    'name': 'gpu',
                    'scheduler': 'slurm',
                    'launcher': 'mpirun',
                    'access':  ['-p gpu'],
                    'environs': ['default'],
                    'max_jobs': 4,
                    'processor': {
                        'num_cpus': 72,
                        'num_sockets': 2,
                        'num_cpus_per_socket': 36,
                        'arch': 'icelake',
                    },
                    'resources': [
                        {
                            'name': '_rfm_gpu',
                            'options': ['--gpus-per-node={num_gpus_per_node}'],
                        }
                    ],
                    'devices': [
                        {
                            'type': 'gpu',
                            'num_devices': 4,
                        }
                    ],
                    'features': ['cpu', 'gpu'],
                    'descr': 'GPU partition'
                },
            ]
        },
    ],
    'environments': [
        {
            'name': 'default',
            'cc': 'cc',
            'cxx': '',
            'ftn': '',
        },
    ],
    'logging': [
        {
            'level': 'debug',
            'handlers': [
                {
                    'type': 'stream',
                    'name': 'stdout',
                    'level': 'info',
                    'format': '%(message)s'
                },
                {
                    'type': 'file',
                    'name': 'reframe.log',
                    'level': 'debug',
                    'format': '[%(asctime)s] %(levelname)s: %(check_info)s: %(message)s',   # noqa: E501
                    'append': False
                }
            ],
            'handlers_perflog': [
                {
                    'type': 'filelog',
                    'prefix': '%(check_system)s/%(check_partition)s',
                    'level': 'info',
                    'format': (
                        '%(check_job_completion_time)s|reframe %(version)s|'
                        '%(check_info)s|jobid=%(check_jobid)s|'
                        '%(check_perf_var)s=%(check_perf_value)s|'
                        'ref=%(check_perf_ref)s '
                        '(l=%(check_perf_lower_thres)s, '
                        'u=%(check_perf_upper_thres)s)|'
                        '%(check_perf_unit)s'
                    ),
                    'append': True
                }
            ]
        }
    ],
}

add GROMACS minimal version check

A test using the hEGFRDimerSmallerPL GROMACS input fails if the GROMACS version is too old:

Program:     gmx mdrun, version 2019.3
Source file: src/gromacs/fileio/tpxio.cpp (line 2695)
MPI rank:    0 (out of 96)

Fatal error:
reading tpx file (benchmark.tpr) version 119 with version 116 program

using `--cpu-only` + `--gpu-only` options supported by ReFrame

I had this bit written up for the docs on the EESSI test suite, but neither of these options doesn't seem to work as intended with our GROMACS/TensorFlow tests:

#### Filtering by device (CPU, GPU)

By default, ReFrame will generate variants of tests for each applicable
device type, based on the specified [`features`](https://reframe-hpc.readthedocs.io/en/stable/config_reference.html#config.systems.partitions.features) for system partitions (in the ReFrame configuration file) and [`valid_systems`](https://reframe-hpc.readthedocs.io/en/stable/regression_test_api.html#reframe.core.pipeline.RegressionTest.valid_systems) value of the available tests.

To only run checks on CPU, you can use the [`--cpu-only` option](https://reframe-hpc.readthedocs.io/en/stable/manpage.html#cmdoption-cpu-only).

To only run tests on GPU, you can use the [`--gpu-only` option](https://reframe-hpc.readthedocs.io/en/stable/manpage.html#cmdoption-gpu-only).

For example, to only run tests on GPU:

```
reframe --gpu-only
```

You can use `--list` to check the impact of these options on generated checks.
  • When using --gpu-only --list, no checks are generated
  • When using --cpu-only --list, I see GROMACS CUDA modules being used to generate tests (they should only be used for GPU checks?)

installation with `pip` doesn't work with older `setuptools` versions

The resulting installation is basically empty, even import eessi doesn't work.

When I use python setup.py sdist to see what's collected in a source tarball, it's clear that eessi/* is skipped entirely:

$ python3 setup.py sdist
running sdist
running egg_info
writing eessi/testsuite/eessi_testsuite.egg-info/PKG-INFO
writing dependency_links to eessi/testsuite/eessi_testsuite.egg-info/dependency_links.txt
writing requirements to eessi/testsuite/eessi_testsuite.egg-info/requires.txt
writing top-level names to eessi/testsuite/eessi_testsuite.egg-info/top_level.txt
reading manifest file 'eessi/testsuite/eessi_testsuite.egg-info/SOURCES.txt'
writing manifest file 'eessi/testsuite/eessi_testsuite.egg-info/SOURCES.txt'
running check
warning: check: missing required meta-data: url

warning: check: missing meta-data: either (author and author_email) or (maintainer and maintainer_email) must be supplied

creating eessi-testsuite-0.0.2
creating eessi-testsuite-0.0.2/eessi
creating eessi-testsuite-0.0.2/eessi/testsuite
creating eessi-testsuite-0.0.2/eessi/testsuite/eessi_testsuite.egg-info
copying files to eessi-testsuite-0.0.2...
copying README.md -> eessi-testsuite-0.0.2
copying setup.cfg -> eessi-testsuite-0.0.2
copying setup.py -> eessi-testsuite-0.0.2
copying eessi/testsuite/eessi_testsuite.egg-info/PKG-INFO -> eessi-testsuite-0.0.2/eessi/testsuite/eessi_testsuite.egg-info
copying eessi/testsuite/eessi_testsuite.egg-info/SOURCES.txt -> eessi-testsuite-0.0.2/eessi/testsuite/eessi_testsuite.egg-info
copying eessi/testsuite/eessi_testsuite.egg-info/dependency_links.txt -> eessi-testsuite-0.0.2/eessi/testsuite/eessi_testsuite.egg-info
copying eessi/testsuite/eessi_testsuite.egg-info/requires.txt -> eessi-testsuite-0.0.2/eessi/testsuite/eessi_testsuite.egg-info
copying eessi/testsuite/eessi_testsuite.egg-info/top_level.txt -> eessi-testsuite-0.0.2/eessi/testsuite/eessi_testsuite.egg-info
Writing eessi-testsuite-0.0.2/setup.cfg
creating dist
Creating tar archive
removing 'eessi-testsuite-0.0.2' (and everything under it)

This is probably due to the older setuptools version I'm using here (default version on RHEL8):

$ python3 -m pip show setuptools
Name: setuptools
Version: 39.2.0
Summary: Easily download, build, install, upgrade, and uninstall Python packages
Home-page: https://github.com/pypa/setuptools
Author: Python Packaging Authority
Author-email: [email protected]
License: UNKNOWN
Location: /usr/lib/python3.6/site-packages
Requires:
Required-by: jsonschema, setools, vsc-install, Sphinx, protobuf

Make `eessi_utils.utils.find_modules` more specific

Currently, find_modules searches for substrings. This can cause issues if software names are used as a suffix.

Example: if I have the modules

TensorFlow/2.6.0-foss-2021a-CUDA-11.3.1
Horovod/0.22.1-foss-2021a-CUDA-11.3.1-TensorFlow-2.6.0

I would want to run the TensorFlow test only on the first. However, find_modules will return both. Note that the test would run when loading the Horovod module, there's just no added benefit in running this same test twice, since you're essentially testing the same module.

We should probably make the function more specific and make it only match if it starts with this substring.

Missing os deps when installing ReFrame on graviton nodes in our AWS environment

[casparvl@fair-mastodon-c6g-2xlarge-0001 ~]$ rm -rf /tmp/reframe_421
[casparvl@fair-mastodon-c6g-2xlarge-0001 ~]$ python3 -m venv /tmp/reframe_421
[casparvl@fair-mastodon-c6g-2xlarge-0001 ~]$ source /tmp/reframe_421/bin/activate
(reframe_421) [casparvl@fair-mastodon-c6g-2xlarge-0001 ~]$ python3 -m pip install reframe-hpc==4.2.1
Collecting reframe-hpc==4.2.1
  Using cached https://files.pythonhosted.org/packages/aa/6c/a7a5cc465ef223bf686717b6034a817be7028f30b28b4d2b840f1c29bda3/ReFrame_HPC-4.2.1-py3-none-any.whl
Collecting archspec (from reframe-hpc==4.2.1)
  Using cached https://files.pythonhosted.org/packages/63/ae/333e7d216dda9134558ddc30792d96bfc58968ff5cc69b4ad9e02dfac654/archspec-0.2.1-py3-none-any.whl
Collecting argcomplete (from reframe-hpc==4.2.1)
  Using cached https://files.pythonhosted.org/packages/4f/ef/8b604222ba5e5190e25851aa3a5b754f2002361dc62a258a8e9f13e866f4/argcomplete-3.1.1-py3-none-any.whl
Collecting jsonschema (from reframe-hpc==4.2.1)
  Using cached https://files.pythonhosted.org/packages/c5/8f/51e89ce52a085483359217bc72cdbf6e75ee595d5b1d4b5ade40c7e018b8/jsonschema-3.2.0-py2.py3-none-any.whl
Collecting PyYAML (from reframe-hpc==4.2.1)
  Using cached https://files.pythonhosted.org/packages/36/2b/61d51a2c4f25ef062ae3f74576b01638bebad5e045f747ff12643df63844/PyYAML-6.0.tar.gz
Collecting lxml (from reframe-hpc==4.2.1)
  Using cached https://files.pythonhosted.org/packages/06/5a/e11cad7b79f2cf3dd2ff8f81fa8ca667e7591d3d8451768589996b65dec1/lxml-4.9.2.tar.gz
    Complete output from command python setup.py egg_info:
    Building lxml version 4.9.2.
    Building without Cython.
    Error: Please make sure the libxml2 and libxslt development packages are installed.

    ----------------------------------------
Command "python setup.py egg_info" failed with error code 1 in /tmp/pip-build-bospx5ra/lxml/
You are using pip version 9.0.3, however version 23.1.2 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.

@boegel can you have a look at this? It seems to be missing the development headers for libxml2 and libxslt.

Explore solutions for 'portable' performance monitoring with ReFrame

ReFrame supports performance checking, but this requires hard-coding system name & expected performance in the tests. This breaks test portability. Some ideas would be:

  • Use the standard ReFrame way of comparing against expected performance, but place actual performance references in some file external to the test. These would be system specific, but would at least allow performance checks on known systems, e.g. the ones used in EESSI's own CI.
  • Compare results to known results, with some very large margin. At least this could capture complete mistakes.
    Implement support in ReFrame for outlier detection, instead of hard performance references. This could check e.g. if the current result is within +/- 2 SD of the past 100 results.
  • Implement support in ReFrame for modelling expected performance (e.g. predict expected performance based on core counts, clock speed, processor model...), and then compare to that (again with large margin).
  • Do performance checks externally, based on the ReFrame report.

This task should explore different options, then decide on how to proceed (e.g. maybe we go for option (1) in the short term, but work on other long term options?). Since we planned on co-design with ReFrame, those options are also attractive.

Running CUDA modules on pure `cpu` partitions

Context

Some applications that are compiled with CUDA run on pure cpu partitions without complaints such as GROMACS but some applications such as OSU (#54) require libcuda.so.1 which their binaries are linked to. An argument for this was that we can link it with stubs libcuda.so.1 if the normal one is not found using LD_LIBRARY_PATH.

Problems

  • Can a cuda feature be defined at the partition level?
  • Is it possible to selectively pass an environment variable for a few partitions only?

Potential solutions

  • We have a CUDA feature already defined for the 1st problem.
  • The second problem can be sorted by adding this as a part of the launcher statement such as mpirun -x .

OSU test

We want to create certain low level tests as well, to validate basic functionality of communication libraries etc. One of those is OSU. This will also help is in figuring out how the test 'blueprint' for GROMACS works for these types of tests.

Support for hierarchical module schemes

As a result of a conversation with @bartoldeman I tried to run our GROMACS test on the software stack from The Alliance, but it failed because our current tests (specifically the find_modules) assume a flat module scheme.

We should think if/how we can improve this. Probably have a look at how The Alliance specifies their ReFrame tests right now.

implement better support for cmd line specified valid_systems

currently, if you specify valid_systems on the command line, no automatic filtering is done, so you have to do the filtering yourself.

the reason is that unfortunately, combining systemname:partitionname with features is not fully supported in the current version of ReFrame (4.0.5).

for example, this works (multiple list items behave like the OR operation):

valid_partitions = ['*:ampere', '+cpu']

but this does not (multiple string items behave like AND operation):

valid_partitions = ['*:ampere +cpu']

this is what ReFrame shows in this case:

WARNING: skipping test 'GROMACS_EESSI': type error: test-suite/eessi/reframe/eessi-checks/applications/gromacs_check.py:62: failed to set field 'valid_systems': '['*:ampere +gpu']' is not of type 'List[Str[r'^(((\*|(\w[-.\w]*))(:(\*|(\w[-.\w]*)))?)|(([+-](\w[-.\w]*))|(%(\w[-.\w]*)=\S+))(\s+(([+-](\w[-.\w]*))|(%(\w[-.\w]*)=\S+)))*)$']]'
                self.valid_systems = [valid_systems]

maybe it's possible check the features of the partitions that are specified on the cmd line, and filter the tests based on that

more info here: https://reframe-hpc.readthedocs.io/en/stable/regression_test_api.html#reframe.core.pipeline.RegressionTest.valid_systems

`GROMACS/2020.1-foss-2020a-Python-3.8.2` on 2 nodes fails on Vega

(reframe_421) [eucasparvl@vglogin0003 ~]$ reframe -C test-suite/config/izum_vega.py -c test-suite/eessi/testsuite/tests/apps/ -R -t CI -t "1_node|2_nodes" -r
[ReFrame Setup]
  version:           4.2.1
  command:           '/tmp/reframe_421/bin/reframe -C test-suite/config/izum_vega.py -c test-suite/eessi/testsuite/tests/apps/ -R -t CI -t 1_node|2_nodes -r'
  launched by:       [email protected]
  working directory: '/ceph/hpc/home/eucasparvl'
  settings files:    '<builtin>', 'test-suite/config/izum_vega.py'
  check search path: (R) '/ceph/hpc/home/eucasparvl/test-suite/eessi/testsuite/tests/apps'
  stage directory:   '/ceph/hpc/home/eucasparvl/reframe_runs/staging'
  output directory:  '/ceph/hpc/home/eucasparvl/reframe_runs/output'
  log files:         '/ceph/hpc/home/eucasparvl/reframe_20230622_164619.log'

[==========] Running 4 check(s)
[==========] Started on Thu Jun 22 16:47:10 2023

[----------] start processing checks
[ RUN      ] GROMACS_EESSI %benchmark_info=HECBioSim/Crambin %nb_impl=cpu %scale=2_nodes %module_name=GROMACS/2020.4-foss-2020a-Python-3.8.2 /c3790f2a @vega:cpu+default
[ RUN      ] GROMACS_EESSI %benchmark_info=HECBioSim/Crambin %nb_impl=cpu %scale=2_nodes %module_name=GROMACS/2020.1-foss-2020a-Python-3.8.2 /5535abba @vega:cpu+default
[ RUN      ] GROMACS_EESSI %benchmark_info=HECBioSim/Crambin %nb_impl=cpu %scale=1_node %module_name=GROMACS/2020.4-foss-2020a-Python-3.8.2 /a108fe65 @vega:cpu+default
[ RUN      ] GROMACS_EESSI %benchmark_info=HECBioSim/Crambin %nb_impl=cpu %scale=1_node %module_name=GROMACS/2020.1-foss-2020a-Python-3.8.2 /695f354c @vega:cpu+default
[       OK ] (1/4) GROMACS_EESSI %benchmark_info=HECBioSim/Crambin %nb_impl=cpu %scale=2_nodes %module_name=GROMACS/2020.4-foss-2020a-Python-3.8.2 /c3790f2a @vega:cpu+default
P: perf: 345.984 ns/day (r:0, l:None, u:None)
[       OK ] (2/4) GROMACS_EESSI %benchmark_info=HECBioSim/Crambin %nb_impl=cpu %scale=1_node %module_name=GROMACS/2020.1-foss-2020a-Python-3.8.2 /695f354c @vega:cpu+default
P: perf: 6.044 ns/day (r:0, l:None, u:None)
[       OK ] (3/4) GROMACS_EESSI %benchmark_info=HECBioSim/Crambin %nb_impl=cpu %scale=1_node %module_name=GROMACS/2020.4-foss-2020a-Python-3.8.2 /a108fe65 @vega:cpu+default
P: perf: 5.61 ns/day (r:0, l:None, u:None)
[     FAIL ] (4/4) GROMACS_EESSI %benchmark_info=HECBioSim/Crambin %nb_impl=cpu %scale=2_nodes %module_name=GROMACS/2020.1-foss-2020a-Python-3.8.2 /5535abba @vega:cpu+default
==> test failed during 'sanity': test staged in '/ceph/hpc/home/eucasparvl/reframe_runs/staging/vega/cpu/default/GROMACS_EESSI_5535abba'
[----------] all spawned checks have finished

[  FAILED  ] Ran 4/4 test case(s) from 4 check(s) (1 failure(s), 0 skipped, 0 aborted)
[==========] Finished on Thu Jun 22 17:17:42 2023
================================================================================================================================================================
SUMMARY OF FAILURES
----------------------------------------------------------------------------------------------------------------------------------------------------------------
FAILURE INFO for GROMACS_EESSI %benchmark_info=HECBioSim/Crambin %nb_impl=cpu %scale=2_nodes %module_name=GROMACS/2020.1-foss-2020a-Python-3.8.2 (run: 1/1)
  * Description: GROMACS HECBioSim/Crambin benchmark (NB: cpu)
  * System partition: vega:cpu
  * Environment: default
  * Stage directory: /ceph/hpc/home/eucasparvl/reframe_runs/staging/vega/cpu/default/GROMACS_EESSI_5535abba
  * Node list: cn0403,cn0405
  * Job type: batch job (id=65838140)
  * Dependencies (conceptual): []
  * Dependencies (actual): []
  * Maintainers: []
  * Failing phase: sanity
  * Rerun with '-n /5535abba -p default --system vega:cpu -r'
  * Reason: sanity error: pattern 'Finished mdrun' not found in 'md.log'
--- rfm_job.out (first 10 lines) ---
--- rfm_job.out ---
--- rfm_job.err (first 10 lines) ---
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed

  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0

100  673k  100  673k    0     0  1081k      0 --:--:-- --:--:-- --:--:-- 1081k
--------------------------------------------------------------------------
Open MPI failed an OFI Libfabric library call (fi_domain).  This is highly
unusual; your job may behave unpredictably (and/or abort) after this.
--- rfm_job.err ---
----------------------------------------------------------------------------------------------------------------------------------------------------------------
Log file(s) saved in '/ceph/hpc/home/eucasparvl/reframe_20230622_164619.log'

GROMACS Crambin test fails at larger task counts

Command line:
  gmx_mpi mdrun -nb cpu -s benchmark.tpr -dlb yes -npme -1 -ntomp 1

Reading file benchmark.tpr, VERSION 5.1.4 (single precision)
Note: file tpx version 103, software tpx version 119
Changing nstlist from 10 to 80, rlist from 1.2 to 1.321


-------------------------------------------------------
Program:     gmx mdrun, version 2020.1-EasyBuild-4.5.0
Source file: src/gromacs/domdec/domdec.cpp (line 2277)
MPI rank:    0 (out of 2048)

Fatal error:
There is no domain decomposition for 1536 ranks that is compatible with the
given box and a minimum cell size of 0.6725 nm
Change the number of ranks or mdrun option -rdd or -dds
Look in the log file for details on the domain decomposition

For more information and tips for troubleshooting, please check the GROMACS
website at http://www.gromacs.org/Documentation/Errors

This Crambin test is the one tagged with the CI ReFrame tag.

One solution would be to make the tagging a bit more complicated, and use the Crambin test only for the singlenode case (and possibly two nodes). A more intricate solution could be to also implement a maximum task count for this test. I.e. let it auto-select a task count, and afterwards check if that is larger than 1536, then cap it. It does mean testing on more nodes is then useless, so if those tests are still generated it would lead to a wase of resources.

For now, I'd prefer just tagging a large test case for larger node counts.

Check binding of threads and processes for GROMACS

Currently, we don't set any binding options explicitly for the GROMACS test. This may or may not result in 'sensible' binding. We should check how we can make sure that it binds in a reasonable way. Also, we should check that it launches correct amounts of tasks on hyperthreading nodes (1 per physical core, not 1 per thread).

For binding, we'd want processes to be bound to physical cores for pure MPI (i.e. CPU runs of GROMACS). For GPU runs (essentially hybrid OpenMP+MPI) we should bind at least the tasks to 1-xth of the nodes CPU cores (if there are x GPUs in the node). Even better would be to also add thread binding to this.

add support for specifying GPU vendor in configuration

with feature cuda we can more explicitly filter on devices that support cuda (required for tests that use a cuda module on a gpu).

later on, we more easily add rocm and oneapi when AMD and Intel GPU modules become available in EESSI.

handle time limit better

  • use more appropriate time limits
  • show better error messages

copying discussion of #28

from @casparvl

For the 1_core, 2_core and 4_core CPU tests, the walltime is too short.

Point number 2 makes me think. First of all, the error is quite non-descriptive:

FAILURE INFO for GROMACS_EESSI_093
  * Expanded name: GROMACS_EESSI %benchmark_info=HECBioSim/hEGFRDimer %nb_impl=cpu %scale=4_cores %module_name=GROMACS/2021.6-foss-2022a
  * Description: GROMACS HECBioSim/hEGFRDimer benchmark (NB: cpu)
  * System partition: snellius:thin
  * Environment: default
  * Stage directory: /scratch-shared/casparl/reframe_output/staging/snellius/thin/default/GROMACS_EESSI_1dfdd606
  * Node list:
  * Job type: batch job (id=2773366)
  * Dependencies (conceptual): []
  * Dependencies (actual): []
  * Maintainers: []
  * Failing phase: sanity
  * Rerun with '-n /1dfdd606 -p default --system snellius:thin -r'
  * Reason: sanity error: pattern 'Finished mdrun' not found in 'md.log'

Maybe we should make a standard sanity check that checks if the job output does not contain something like

slurmstepd: error: *** JOB 2773368 ON tcn509 CANCELLED AT 2023-05-19T20:40:10 DUE TO TIME LIMIT ***

I'm not sure how this generalizes to other systems (I'm assuming all SLURM based systems print this by default), but even so: it doesn't hurt to check. At least there is a better chance of getting a clear error message.

Secondly, how do we make sure we don't run out of walltime? Sure, we could just specify a very long time, but that can be problematic as well (not satisfying max walltimes on a queue, and in our case, jobs <1h actually get backfilled behind a floating reservation that we have in order to reserve some dedicated nodes for short jobs). Should we scale the max walltime based on the amount of resources?

from @smoors

about the walltime issue:

  • we cannot know on which cpu archs this test will run, so it's difficult to guess an appropriate time limit for all users, even if we scale it. even more difficult to guess is how fast this will run on future computers. the 4_cores CPU test actually succeeded for me on a skylake node.
  • alternatively, as we agreed that the tests should be short, we could filter out those scales that we expect to take longer than 30 minutes to run. or at least print a warning.
  • ideally we should also check the walltime of the other benchmarks instead of only the first one (HECBioSim/hEGFRDimer).

about checking for the exceeded time limit message:
i think that's a good idea. we could expand that to also check for an out-of-memory message, and maybe other slurm messages that i don't know of. even better would be to check that the job error file is empty, but Gromacs prints some stuff to stderr, would be nice if there is a way to force Gromacs to not do that.

replace hardcoded strings with constants

mostly 'cpu' and 'gpu' for now, maybe also 'CUDA'

also think about more finegrained/hierarchical naming of more specific devices

also move constants to a separate constants.py file

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.