pika-org / pika Goto Github PK

View Code? Open in Web Editor NEW

60.0 60.0 10.0 22.23 MB

pika builds on C++ std::execution with fiber, CUDA, HIP, and MPI support.

Home Page: https://pikacpp.org

License: Boost Software License 1.0

Python 1.35% Shell 1.09% CMake 8.06% C++ 87.99% Batchfile 0.01% Cuda 1.30% Assembly 0.16% Nix 0.01% Awk 0.03%

concurrency cplusplus cpp cuda gpu hip mpi p2300 parallelism rocm stdexec

pika's Issues

Namespace imported CMake targets with pika

Imported targets like Hwloc::hwloc should be namespaced with pika (or pika_internal) instead to avoid potential conflicts with other projects.

Move inspect tool out from repository and allow using configuration file

This will be more useful when/if parts of the repository (e.g. algorithms) are in separate repositories. With a single repository this is not very urgent.

Remove serialization support

Serialization is not used by DLA-Future.

Remove `Local` from target folder names

The sender/receiver CPOs currently use a helper base class to define fallback implementations with tag_fallback_invoke. The need for tag_fallback_invoke should be revisited and the CPO types should potentially be in nested namespaces to avoid the tag_fallback_invoke overloads being in the overload set for unrelated CPOs. This could improve compile times.

Add various status badges

Use bors for all CI

This would reduce duplicate and unnecessary builds. This requires:

CI that is stable for all enabled builds
Jenkins sets the commit status on all builds

Make sure P2300 types isolate their template arguments from ADL

Add tracy support

https://github.com/wolfpld/tracy

This is not terribly difficult on a basic level, but integration into projects and running applications with tracy is a bit clunky, especially for multi-node runs, since a tracy-instrumented application needs to send the data to a server.

#252 adds basic support. Next steps would be one or both of the following:

Start using stackless threads as much as possible since that is easier to do with senders. This would mean the Tracy integration needs no further changes.
Add support for fibers or saving/restoring the annotations on suspend/resume.

Reenable HIP CI on ault

Move testing module into separate library

The testing functionality should not be exposed to users of pika.

Document supported/tested compilers

Add CUDA stream pool, scheduler, and various helpers for dealing with CUDA

Add support for HIP to CUDA sender/receiver functionality

Primarily cuBLAS/cuSOLVER equivalents. Can we already do this or does this need to wait for functionality on the HIP side?

Edit: hipblas seems to be ok (#37), but hipsolver is still an open question.

Consider adding more "official" API headers for CUDA and MPI functionality

The use of async_cuda and async_mpi functionality has so far been through pika/modules/async_{cuda,mpi}.hpp. Since we don't consider pika/modules headers as public API headers we should probably add something more official to access that functionality.

Possible options:

pika/execution/{cuda,hip,gpu,mpi,communication}.hpp
pika/{cuda,hip,gpu,mpi,communication}.hpp

Consider adding a more official API header for the threadmanager and resource partitioner

Currently using threadmanager functionality requires using the pika/modules/threadmanager.hpp header (previously hpx/include/threadmanager.hpp). It should either get its own public API header or be baked into the pika/runtime.hpp header. My preference is probably slightly towards the latter option. The same applies for the resource partitioner.

Investigate `small_vector` performance issue

pika::detail::small_vector seems to be significantly slower than boost::container::small_vector. It's unclear if it's "just" a bug in the implementation or if it's something more inherent in the use of the standard library features in the implementation.

We should:

find a performance test that reproduces the regression (most likely something involving future::then since small_vector is used for storing continuations)
profile/debug/whatever to find out if pika::detail::small_vector is fixable

If we can't find a suitable regression test within pika, the following DLA-Future test shows a clear performance drop: srun -n4 -c36 miniapp/miniapp_triangular_solver --m 20480 --n 20480 --mb 128 --nb 128 --grid-rows 2 --grid-cols 2 --nruns 5 --pika:use-process-mask (on the Piz Daint mc partition). The performance is about ~1150GFlop/s with Boost's small_vector and ~800GFlop/s with pika's.

Use slimmer container image for CI

We currently still use https://hub.docker.com/r/stellargroup/build_env. We could do with a much lighter image without documentation tools and e.g. python.

Add release notes for 0.1.0

Rename `THREADS_PER_LOCALITY` test option to just `THREADS`

The locality concept has little meaning with distributed functionality removed.

Potential bug in `thread_pool_scheduler`

Try to see if this can be reproduced: https://app.circleci.com/pipelines/github/pika-org/pika/133/workflows/00b0996c-601f-4f6c-8f1c-f3b533738a4a/jobs/1462/steps?invite=true#step-103-164.

Document dependencies in README

Explain project name in readme

Add P2300 scheduler queries

Enable hipsolver functionality

We currently only have a macro translation layer for basic CUDA functionality and cuBLAS functionality to HIP equivalents. cuSOLVER is currently CUDA-only.

Remove leftover traits etc. required only by distributed functionality

E.g. is_action can be removed and various specializations removed/simplified.

Test if CUDA callbacks would again be a viable replacement for polling

The event polling has been successful and turned out to perform significantly better than using CUDA callbacks. However, that was tested when the CUDA callbacks still required runtime registration on the CUDA thread. We should check:

if using plain CUDA callbacks would again be a competitive option to event polling in the scheduler, or
if the former does not work well enough if a separate polling thread would work well enough.
Either of these would be beneficial architecturally because they would decouple the CUDA senders from the schedulers. Related: #17.

Use `fmt`

The internal format implementation could perhaps be replaced by fmt?

Remove debug output from #1

#1 added some debug output to pika_add_module.cmake and the Jenkins scripts. This should be removed.

Don't take a lock when getting the polling event counts for CUDA and MPI

Not strictly a bug, but taking the lock (especially trying to take the lock) means that shutdown can take longer or a very long time.

Enable sanitizers in CI

This is potentially very important for debugging. We currently only test with lsan (leak sanitizer). We should try to add tsan (thread sanitizer), msan (memory sanitizer), asan (address sanitizer), and ubsan (undefined behaviour sanitizers). Enabling them with heavy suppression files would be a good start, just to allow consumers of pika to enable sanitizers.

lsan
tsan
asan
ubsan

msan

This would require recompiling all dependencies with -fsanitize=memory, including the standard library: https://github.com/google/sanitizers/wiki/MemorySanitizer#using-instrumented-libraries. This is something to consider doing with the CSCS CI pipelines and spack.

Remove future functionality

Including hpx::async/apply/dataflow/when_all. This can be done once DLA-Future has been completely ported to use senders. The cleanup itself for this issue is not complicated and only requires removing functionality. What functionality is currently missing on the sender/receiver side?

#13
#14
Anything else?

Add variant of `when_all` that accepts containers of senders and sends a container of values

dataflow recursively waits for any containers of futures (of containers). One can do e.g. this with dataflow: dataflow(unwrapping([](vector<T>){...}), vector<future<T>>{...}). There is no equivalent sender adaptor. when_all is variadic and requires all arguments to be senders themselves.

This feature is used in DLA-Future. A when_all_vector(vector<sender_of<T>>) would be sufficient as a starting point.

Use CSCS GitLab CI to replace as many Jenkins builds as possible

In practice this means replacing the non-Cray builds currently running on Jenkins. Work was started on STEllAR-GROUP/hpx-local#5.

Make use of the gitlab matrix functionality at least for release/debug builds, if not for different compiler etc. configurations.

Also replace the HIP testing done with Jenkins with container-runner-hohgant-mi200 (as already started in eth-cscs/DLA-Future#982 for DLA-Future, though that is blocked on an MPI issue; we can initially run all non-MPI tests for pika).

Add documentation for pika

We could initially refer to hpx's documentation and provide only the things that differs from it. But it might diverge too much from it in the future to keep it like this.

To try to have something actionable here, I think we would need the following in order of importance:

list of public API headers (0.5.0) (#225)
document thread binding behaviour from #739 (#751)
examples of typical use cases
list of public API functions/classes/variables (doxygen, and manually curated?)
examples of using specific APIs
detailed documentation per function/class/variable

Revive APEX support

With the distributed functionality removed, the actual APEX support was removed. APEX can be turned into a direct dependency with no special support required on the APEX side and is quite straightforward to implement. I have some old working code for this already but from HPX. This needs to be revived.

Enable MPI tests in CI

We currently don't test MPI functionality anywhere. It needs to be added to at least one CI configuration.

Make `--pika:use-process-mask` the default?

This is the common case, and should possibly be the default. The open question is, what should the default be if there is no process mask (i.e. all pus are in the mask)? We currently default to only one worker thread per core, rather than per pu. We can probably keep this behaviour even if using the process mask is the default. The only confusing aspect is that a process mask with all pus will not necessarily create as many worker threads as bits in the process mask (though that is the case at the moment as well).

Add black and shfmt formatting step in CI

Rename `PIKA_WITH_ASYNC_MPI` option to `PIKA_WITH_MPI`

Separate algorithms into a separate repository

The main open question here is if the algorithms project should rely on the pika runtime, the other way around, or neither? The default execution policies assume that a global thread pool exists which would normally be set up by the runtime.

Remove/replace remaining Boost includes

It would now be feasible to completely remove all dependencies on Boost. Some things would naturally remove Boost dependencies:

Other headers can likely be brought into pika.

Remove filesystem compatibility layer

All versions of macOS that should be supported already have <filesystem>. This also removes one more Boost dependency (Boost error_code).

Make `pika_main` accept const argc/argv

See https://github.com/eth-cscs/DLA-Future/blob/7e340928c05ed7e548ba3140b2a84eba2ed269a6/test/unit/init/test_init.cpp#L266-L267.

Add support for cuBLAS/cuSOLVER to CUDA sender/receiver functionality

Move everything except public functionality to detail namespace

The public API of pika is small: sender/receiver functionality, runtime initialization, what else?

Hidden functionality can then gradually be brought into the public namespace through pika::experimental:: or directly into pika::.

The only reasonable way to do this is module by module:

This is also a good opportunity to do general cleanup.
Avoid nesting detail namespaces into experimental namespace #448

Separate MPI support into separate repository

See #17. The same points apply here.

Fix or disable remaining failing tests using timed suspension

The following (and possibly a few more) tests fail for various reasons after disabling timed suspensions. They need to be dealt with before the first release.

tests.unit.modules.synchronization.shared_mutex.shared_mutex1
tests.unit.modules.threading.condition_variable2
tests.unit.modules.threading.stop_token_cb1
tests.unit.modules.threading.thread
tests.performance.local.tls_overhead

Separate CUDA/HIP/GPU support into another repository

The exact requirements of this are not 100% clear. This at least needs:

The schedulers need a generic way to register polling callbacks, or
The polling can be done in a separate thread?
The runtime/a thread pool needs to know if it should wait for all events to finish

Update performance test references

The current references seem to be somewhat too strict. Alternatively, can we slightly relax the criteria (without missing performance regressions).

pika-org / pika Goto Github PK

pika's Issues

msan

Recommend Projects

Recommend Topics

Recommend Org