openucx / openhpca Goto Github PK

View Code? Open in Web Editor NEW

7.0 7.0 3.0 3.12 MB

OpenHPCA working group repository

Makefile 1.15% C 66.24% Go 29.52% HTML 3.09%

openhpca's People

Contributors

Stargazers

Watchers

Forkers

gvallee raffenet btmichalowicz

openhpca's Issues

Add capability to compare results from different MPI configurations

Support for 2 execution modes: default and long

The default execution mode would only run the tests required to generate the metrics.
The long execution mode would run all the tests.

SMB msgrate benchmark hangs

I encountered an issue with the SMB msgrate benchmark when testing using the long version of the benchmark run. It seems that the execution hangs early on and the job is eventually killed by slurm when the time limit is reached.

Looking at the script generated by the Go infrastructure, one problem is likely to be that the benchmark specific parameters are not currently passed in. However, something else may also be going on. Adding the correct flags did not fix the issue in my case. Execution seems to be hanging at, or about, MPI_thread_init(), for yet undiscovered reasons.

Execution of smb_msgrate and sma_rma_mt_mpi fails

The above fail to execute as because the necessary binary arguments are not passed in via generated slurm scripts.

Review `make` targets related to bootstrapping

@BrodyWilliams discovered that make init is relying on openhpca_setup but that binary is generated when executing make install. I think make init should ensure openhpca_setup is correctly generated when necessary.

error at the end of execution

I see the following error sometimes after all jobs are done. Unclear at this point if a job failed or if the compilation of results is just facing a problem:

ERROR: unable to display results: undefined data

WebUI should rely on the result package

The WebUI code in doing too much of the processing to display the results, which increase the likeliness to create gaps between the WebUI and the text result file.

Add run option to select specific benchmark suites

There is a need to run benchmarks suites like OSU in a automated manner. So having a run option to choose a specific benchmark suite would be useful.

Overlap profiling

Taylor suggested to provide a profiling capability that would give more details about overlapping when running real applications as well as, if possible, the potential for more overlapping. This could be initially investigated within the collective profiler (https://github.com/gvallee/collective_profiler). These two projects share some of the underlying building blocks so, if relevant, it should be possible to integrate both projects.

Update to go_hpc_jobmgr with a fix for MPICH

gvallee/go_hpc_jobmgr#6

Unable to complete overlap NBC's with Time-Driven Execution -- "Cannot Further increase n_elts"

I'm trying to run some experiments with the time-driven execution model inside the overlap_XYZ collectives, and for all of them, regardless of MPI library, I get the following error (see attached):

This error occurs no matter how high I set OPENHPCA_OVERLAP_MAX_NUM_ELTS, though for some reason, I'm able to get per-message-size latencies if I turn on the calibration feature (not documented, see overlap.h for OPENHPCA_OVERLAP_CALIBRATION). Because of this, I'm not able to see the final output generated for the Time-Driven model (work injected, latency, %-overlap, etc.)

In this particular case, I'm running overlap_igather with the following command: mpirun -np <ntasks> --hostfile hostfile /path/to/overlap_igather after exporting OPENHPCA_OVERLAP_MAX_NUM_ELTS

What can I do to fix this?
Thank you!

Update the text file result

Make sure the text file with the results includes all the metrics. I have seen a potential issue during a test on a new platform.

Setup issues (redirect from older repo)

login1[2](~/src/openhpca) git clone --recurse-submodules [email protected]:openucx/openhpca.git
Cloning into 'openhpca'...
remote: Enumerating objects: 69, done.
remote: Counting objects: 100% (69/69), done.
remote: Compressing objects: 100% (38/38), done.
remote: Total 69 (delta 15), reused 68 (delta 15), pack-reused 0
Receiving objects: 100% (69/69), 51.21 KiB | 2.05 MiB/s, done.
Resolving deltas: 100% (15/15), done.
Submodule 'SMB' (https://github.com/sandialabs/SMB) registered for path 'SMB'
Submodule 'osu_noncontig_mem' (https://github.com/yqin/osu-micro-benchmarks) registered for path 'osu_noncontig_mem'
 Cloning into '/lustre/home/arcurtis/src/openhpca/openhpca/SMB'...
remote: Enumerating objects: 35, done.
remote: Counting objects: 100% (35/35), done.
remote: Compressing objects: 100% (27/27), done.
remote: Total 35 (delta 8), reused 35 (delta 8), pack-reused 0
Cloning into '/lustre/home/arcurtis/src/openhpca/openhpca/osu_noncontig_mem'...
remote: Enumerating objects: 183, done.
remote: Counting objects: 100% (183/183), done.
remote: Compressing objects: 100% (89/89), done.
remote: Total 183 (delta 101), reused 175 (delta 93), pack-reused 0
Receiving objects: 100% (183/183), 598.25 KiB | 8.43 MiB/s, done.
Resolving deltas: 100% (101/101), done.
Submodule path 'SMB': checked out 'fd975bf980a33b9d7241d62871c647d8dd798af8'
Submodule path 'osu_noncontig_mem': checked out 'f3333c93a62bfd2acdc5284db871705f515adea1'

login1[2](~/src/openhpca) cd openhpca/

login1[2](~/src/openhpca/openhpca) make init

make: ./tools/cmd/openhpca_setup/openhpca_setup: Command not found
make: *** [Makefile:19: init] Error 127

login1[2](~/src/openhpca/openhpca) make
openhpca_setup.go:20:2: cannot find package "github.com/gvallee/openhpca/tools/internal/pkg/config" in any of:
	/lustre/home/arcurtis/goroot/src/github.com/gvallee/openhpca/tools/internal/pkg/config (from $GOROOT)
	/lustre/home/arcurtis/go/src/github.com/gvallee/openhpca/tools/internal/pkg/config (from $GOPATH)
make[1]: *** [Makefile:14: openhpca_setup] Error 1
make: *** [Makefile:13: tools] Error 2

OSU for non-contig memory repo is gone

https://github.com/yqin/osu-micro-benchmarks is not available anymore and breaks everything.

Display run report

After running the suite, it would be helpful to display a report about the runs. For instance, it would be nice to have an explicit report when OPENHPCA_OVERLAP_MAX_NUM_ELTS needs to be set to increase the default value used by the overlap benchmarks.