Giter VIP home page Giter VIP logo

openhpca's People

Contributors

brodywilliams avatar gvallee avatar raffenet avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

openhpca's Issues

SMB msgrate benchmark hangs

I encountered an issue with the SMB msgrate benchmark when testing using the long version of the benchmark run. It seems that the execution hangs early on and the job is eventually killed by slurm when the time limit is reached.

Looking at the script generated by the Go infrastructure, one problem is likely to be that the benchmark specific parameters are not currently passed in. However, something else may also be going on. Adding the correct flags did not fix the issue in my case. Execution seems to be hanging at, or about, MPI_thread_init(), for yet undiscovered reasons.

error at the end of execution

I see the following error sometimes after all jobs are done. Unclear at this point if a job failed or if the compilation of results is just facing a problem:

ERROR: unable to display results: undefined data

WebUI should rely on the result package

The WebUI code in doing too much of the processing to display the results, which increase the likeliness to create gaps between the WebUI and the text result file.

Overlap profiling

Taylor suggested to provide a profiling capability that would give more details about overlapping when running real applications as well as, if possible, the potential for more overlapping. This could be initially investigated within the collective profiler (https://github.com/gvallee/collective_profiler). These two projects share some of the underlying building blocks so, if relevant, it should be possible to integrate both projects.

Unable to complete overlap NBC's with Time-Driven Execution -- "Cannot Further increase n_elts"

I'm trying to run some experiments with the time-driven execution model inside the overlap_XYZ collectives, and for all of them, regardless of MPI library, I get the following error (see attached):

image

This error occurs no matter how high I set OPENHPCA_OVERLAP_MAX_NUM_ELTS, though for some reason, I'm able to get per-message-size latencies if I turn on the calibration feature (not documented, see overlap.h for OPENHPCA_OVERLAP_CALIBRATION). Because of this, I'm not able to see the final output generated for the Time-Driven model (work injected, latency, %-overlap, etc.)

In this particular case, I'm running overlap_igather with the following command: mpirun -np <ntasks> --hostfile hostfile /path/to/overlap_igather after exporting OPENHPCA_OVERLAP_MAX_NUM_ELTS

What can I do to fix this?
Thank you!

Update the text file result

Make sure the text file with the results includes all the metrics. I have seen a potential issue during a test on a new platform.

Setup issues (redirect from older repo)

login1[2](~/src/openhpca) git clone --recurse-submodules [email protected]:openucx/openhpca.git
Cloning into 'openhpca'...
remote: Enumerating objects: 69, done.
remote: Counting objects: 100% (69/69), done.
remote: Compressing objects: 100% (38/38), done.
remote: Total 69 (delta 15), reused 68 (delta 15), pack-reused 0
Receiving objects: 100% (69/69), 51.21 KiB | 2.05 MiB/s, done.
Resolving deltas: 100% (15/15), done.
Submodule 'SMB' (https://github.com/sandialabs/SMB) registered for path 'SMB'
Submodule 'osu_noncontig_mem' (https://github.com/yqin/osu-micro-benchmarks) registered for path 'osu_noncontig_mem'
 Cloning into '/lustre/home/arcurtis/src/openhpca/openhpca/SMB'...
remote: Enumerating objects: 35, done.
remote: Counting objects: 100% (35/35), done.
remote: Compressing objects: 100% (27/27), done.
remote: Total 35 (delta 8), reused 35 (delta 8), pack-reused 0
Cloning into '/lustre/home/arcurtis/src/openhpca/openhpca/osu_noncontig_mem'...
remote: Enumerating objects: 183, done.
remote: Counting objects: 100% (183/183), done.
remote: Compressing objects: 100% (89/89), done.
remote: Total 183 (delta 101), reused 175 (delta 93), pack-reused 0
Receiving objects: 100% (183/183), 598.25 KiB | 8.43 MiB/s, done.
Resolving deltas: 100% (101/101), done.
Submodule path 'SMB': checked out 'fd975bf980a33b9d7241d62871c647d8dd798af8'
Submodule path 'osu_noncontig_mem': checked out 'f3333c93a62bfd2acdc5284db871705f515adea1'

login1[2](~/src/openhpca) cd openhpca/

login1[2](~/src/openhpca/openhpca) make init

make: ./tools/cmd/openhpca_setup/openhpca_setup: Command not found
make: *** [Makefile:19: init] Error 127

login1[2](~/src/openhpca/openhpca) make
openhpca_setup.go:20:2: cannot find package "github.com/gvallee/openhpca/tools/internal/pkg/config" in any of:
	/lustre/home/arcurtis/goroot/src/github.com/gvallee/openhpca/tools/internal/pkg/config (from $GOROOT)
	/lustre/home/arcurtis/go/src/github.com/gvallee/openhpca/tools/internal/pkg/config (from $GOPATH)
make[1]: *** [Makefile:14: openhpca_setup] Error 1
make: *** [Makefile:13: tools] Error 2

Display run report

After running the suite, it would be helpful to display a report about the runs. For instance, it would be nice to have an explicit report when OPENHPCA_OVERLAP_MAX_NUM_ELTS needs to be set to increase the default value used by the overlap benchmarks.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.