openucx / openhpca Goto Github PK
View Code? Open in Web Editor NEWOpenHPCA working group repository
OpenHPCA working group repository
The default execution mode would only run the tests required to generate the metrics.
The long execution mode would run all the tests.
I encountered an issue with the SMB msgrate benchmark when testing using the long version of the benchmark run. It seems that the execution hangs early on and the job is eventually killed by slurm when the time limit is reached.
Looking at the script generated by the Go infrastructure, one problem is likely to be that the benchmark specific parameters are not currently passed in. However, something else may also be going on. Adding the correct flags did not fix the issue in my case. Execution seems to be hanging at, or about, MPI_thread_init(), for yet undiscovered reasons.
The above fail to execute as because the necessary binary arguments are not passed in via generated slurm scripts.
@BrodyWilliams discovered that make init
is relying on openhpca_setup
but that binary is generated when executing make install
. I think make init
should ensure openhpca_setup
is correctly generated when necessary.
I see the following error sometimes after all jobs are done. Unclear at this point if a job failed or if the compilation of results is just facing a problem:
ERROR: unable to display results: undefined data
The WebUI code in doing too much of the processing to display the results, which increase the likeliness to create gaps between the WebUI and the text result file.
There is a need to run benchmarks suites like OSU in a automated manner. So having a run option to choose a specific benchmark suite would be useful.
Taylor suggested to provide a profiling capability that would give more details about overlapping when running real applications as well as, if possible, the potential for more overlapping. This could be initially investigated within the collective profiler (https://github.com/gvallee/collective_profiler). These two projects share some of the underlying building blocks so, if relevant, it should be possible to integrate both projects.
I'm trying to run some experiments with the time-driven execution model inside the overlap_XYZ collectives, and for all of them, regardless of MPI library, I get the following error (see attached):
This error occurs no matter how high I set OPENHPCA_OVERLAP_MAX_NUM_ELTS
, though for some reason, I'm able to get per-message-size latencies if I turn on the calibration feature (not documented, see overlap.h for OPENHPCA_OVERLAP_CALIBRATION
). Because of this, I'm not able to see the final output generated for the Time-Driven model (work injected, latency, %-overlap, etc.)
In this particular case, I'm running overlap_igather with the following command: mpirun -np <ntasks> --hostfile hostfile /path/to/overlap_igather
after exporting OPENHPCA_OVERLAP_MAX_NUM_ELTS
What can I do to fix this?
Thank you!
Make sure the text file with the results includes all the metrics. I have seen a potential issue during a test on a new platform.
login1[2](~/src/openhpca) git clone --recurse-submodules [email protected]:openucx/openhpca.git
Cloning into 'openhpca'...
remote: Enumerating objects: 69, done.
remote: Counting objects: 100% (69/69), done.
remote: Compressing objects: 100% (38/38), done.
remote: Total 69 (delta 15), reused 68 (delta 15), pack-reused 0
Receiving objects: 100% (69/69), 51.21 KiB | 2.05 MiB/s, done.
Resolving deltas: 100% (15/15), done.
Submodule 'SMB' (https://github.com/sandialabs/SMB) registered for path 'SMB'
Submodule 'osu_noncontig_mem' (https://github.com/yqin/osu-micro-benchmarks) registered for path 'osu_noncontig_mem'
Cloning into '/lustre/home/arcurtis/src/openhpca/openhpca/SMB'...
remote: Enumerating objects: 35, done.
remote: Counting objects: 100% (35/35), done.
remote: Compressing objects: 100% (27/27), done.
remote: Total 35 (delta 8), reused 35 (delta 8), pack-reused 0
Cloning into '/lustre/home/arcurtis/src/openhpca/openhpca/osu_noncontig_mem'...
remote: Enumerating objects: 183, done.
remote: Counting objects: 100% (183/183), done.
remote: Compressing objects: 100% (89/89), done.
remote: Total 183 (delta 101), reused 175 (delta 93), pack-reused 0
Receiving objects: 100% (183/183), 598.25 KiB | 8.43 MiB/s, done.
Resolving deltas: 100% (101/101), done.
Submodule path 'SMB': checked out 'fd975bf980a33b9d7241d62871c647d8dd798af8'
Submodule path 'osu_noncontig_mem': checked out 'f3333c93a62bfd2acdc5284db871705f515adea1'
login1[2](~/src/openhpca) cd openhpca/
login1[2](~/src/openhpca/openhpca) make init
make: ./tools/cmd/openhpca_setup/openhpca_setup: Command not found
make: *** [Makefile:19: init] Error 127
login1[2](~/src/openhpca/openhpca) make
openhpca_setup.go:20:2: cannot find package "github.com/gvallee/openhpca/tools/internal/pkg/config" in any of:
/lustre/home/arcurtis/goroot/src/github.com/gvallee/openhpca/tools/internal/pkg/config (from $GOROOT)
/lustre/home/arcurtis/go/src/github.com/gvallee/openhpca/tools/internal/pkg/config (from $GOPATH)
make[1]: *** [Makefile:14: openhpca_setup] Error 1
make: *** [Makefile:13: tools] Error 2
https://github.com/yqin/osu-micro-benchmarks is not available anymore and breaks everything.
After running the suite, it would be helpful to display a report about the runs. For instance, it would be nice to have an explicit report when OPENHPCA_OVERLAP_MAX_NUM_ELTS
needs to be set to increase the default value used by the overlap benchmarks.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.