Currently, we run Aspect with one thread per MPI process. What we should really do is

This is now possible with the changes from <a class="issue-link js-issue-link" data-er

I agree with <a class="user-mention notranslate" data-hovercard-type="user" data-hover

Exploit more parallelism about aspect HOT 9 CLOSED

geodynamics commented on June 19, 2024

Exploit more parallelism

from aspect.

Comments (9)

bangerth commented on June 19, 2024

This is now possible with the changes from dealii/dealii#146 . All we need is set the last argument to MPI_InitFinalize to numbers::invalid_unsigned_int.

from aspect.

gassmoeller commented on June 19, 2024

Is this still a relevant question? Since it is simple to activate I made a few runs on my laptop with a very small test model, and below are some results that show that replacing MPI processes with threads can indeed speed up things (on my laptop that is). I guess the two important question I would have about the topic are:

Which parts of ASPECT will benefit from threads, and more importantly which will not? Might those matter if we for example run only 1 MPI process per node on a cluster? It seems postprocessing does not benefit from threading?
Can we push the strong scaling limit a bit higher by using less MPI processes per node, because we create less MPI traffic and the individual work sizes per process are larger?

Clearly this needs more investigation on larger clusters, but here are the results from my laptop in Optimized mode for tests/simple_compressible.prm at a fixed frequency (no turbo):

1 process / 8 threads:

+---------------------------------------------+------------+------------+
| Total wallclock time elapsed since start    |      0.19s |            |
|                                             |            |            |
| Section                         | no. calls |  wall time | % of total |
+---------------------------------+-----------+------------+------------+
| Assemble Stokes system          |         2 |   0.00918s |       4.8% |
| Assemble temperature system     |         2 |    0.0277s |        15% |
| Build Stokes preconditioner     |         1 |    0.0147s |       7.7% |
| Build temperature preconditioner|         2 |   0.00322s |       1.7% |
| Solve Stokes system             |         2 |    0.0195s |        10% |
| Solve temperature system        |         2 |   0.00166s |      0.87% |
| Initialization                  |         1 |      0.04s |        21% |
| Postprocessing                  |         2 |    0.0276s |        15% |
| Setup dof systems               |         1 |   0.00688s |       3.6% |
| Setup initial conditions        |         1 |    0.0138s |       7.3% |
+---------------------------------+-----------+------------+------------+

2 processes / 4 threads each:

+---------------------------------------------+------------+------------+
| Total wallclock time elapsed since start    |     0.179s |            |
|                                             |            |            |
| Section                         | no. calls |  wall time | % of total |
+---------------------------------+-----------+------------+------------+
| Assemble Stokes system          |         2 |   0.00925s |       5.2% |
| Assemble temperature system     |         2 |    0.0207s |        12% |
| Build Stokes preconditioner     |         1 |    0.0184s |        10% |
| Build temperature preconditioner|         2 |   0.00289s |       1.6% |
| Solve Stokes system             |         2 |     0.025s |        14% |
| Solve temperature system        |         2 |   0.00301s |       1.7% |
| Initialization                  |         1 |    0.0403s |        22% |
| Postprocessing                  |         2 |    0.0177s |       9.9% |
| Setup dof systems               |         1 |   0.00944s |       5.3% |
| Setup initial conditions        |         1 |    0.0104s |       5.8% |
+---------------------------------+-----------+------------+------------+

4 processes / 2 threads each

+---------------------------------------------+------------+------------+
| Total wallclock time elapsed since start    |      0.18s |            |
|                                             |            |            |
| Section                         | no. calls |  wall time | % of total |
+---------------------------------+-----------+------------+------------+
| Assemble Stokes system          |         2 |    0.0121s |       6.8% |
| Assemble temperature system     |         2 |    0.0186s |        10% |
| Build Stokes preconditioner     |         1 |    0.0204s |        11% |
| Build temperature preconditioner|         2 |   0.00192s |       1.1% |
| Solve Stokes system             |         2 |    0.0294s |        16% |
| Solve temperature system        |         2 |   0.00356s |         2% |
| Initialization                  |         1 |    0.0399s |        22% |
| Postprocessing                  |         2 |    0.0128s |       7.1% |
| Setup dof systems               |         1 |    0.0104s |       5.8% |
| Setup initial conditions        |         1 |   0.00935s |       5.2% |
+---------------------------------+-----------+------------+------------+

8 processes / 1 thread each:

+---------------------------------------------+------------+------------+
| Total wallclock time elapsed since start    |     0.293s |            |
|                                             |            |            |
| Section                         | no. calls |  wall time | % of total |
+---------------------------------+-----------+------------+------------+
| Assemble Stokes system          |         2 |    0.0123s |       4.2% |
| Assemble temperature system     |         2 |    0.0201s |       6.9% |
| Build Stokes preconditioner     |         1 |    0.0301s |        10% |
| Build temperature preconditioner|         2 |   0.00204s |       0.7% |
| Solve Stokes system             |         2 |     0.072s |        25% |
| Solve temperature system        |         2 |    0.0109s |       3.7% |
| Initialization                  |         1 |    0.0636s |        22% |
| Postprocessing                  |         2 |    0.0115s |       3.9% |
| Setup dof systems               |         1 |    0.0165s |       5.6% |
| Setup initial conditions        |         1 |    0.0169s |       5.8% |
+---------------------------------+-----------+------------+------------+

Using threads seems superior to filling all hardware threads with MPI processes, but if that holds for larger machines (and how large the difference is to 4 processes, 1 thread each) needs further testing.

from aspect.

tjhei commented on June 19, 2024

Most of the time is typically spent in our linear solvers which use AMG. AMG does not use multiple threads so I don't think this is worth looking into as of right now. While assembly might be a bit faster using threads, I doubt this will make a difference.
The whole thing is worth looking at again, when we use matrix-free solvers.

from aspect.

bangerth commented on June 19, 2024

I agree with @tjhei -- as long as we spend the majority of time in linear solvers, the only reasonable choice is to use more MPI processes with one thread each. That's because neither Trilinos nor PETSc use threads in their linear solvers and preconditioners.

Out of curiosity, though: I have little confidence in measurements of individual parts of a program that only runs 0.2 seconds overall. What happens if you refine a number of times so that the run-time is, say, somewhere in the 1-5 minute range?

from aspect.

gassmoeller commented on June 19, 2024

Hmm, I agree that the speedup is not huge, but looking at the numbers below I would say that choosing 4/2 (4 mpi processes with 2 threads each) is about 5 % faster in all assemblies than 4/-. The former is the setup that I used so far when using clusters, using as many MPI processes as hardware cores, and ignoring the hyperthreads. Of course 8 MPI processes would be even better, but I am usually running jobs close to the strong scaling limit to get faster model turnaround time, and then using twice as many processes to get a less than 15% speedup (or even increase if I hit the limit) is just not worth it. All in all ignoring threads and trying to make the solver more scalable with MPI seems the better approach to me.

For reference here are results for the simple compressible test with a refinement of 8 (2.5 * 10^6 DoFs), 4 core CPU with hyperthreading, fixed CPU frequency:

1 process / 8 threads:

+---------------------------------------------+------------+------------+
| Total wallclock time elapsed since start    |       108s |            |
|                                             |            |            |
| Section                         | no. calls |  wall time | % of total |
+---------------------------------+-----------+------------+------------+
| Assemble Stokes system          |         1 |       5.1s |       4.7% |
| Assemble temperature system     |         1 |      11.5s |        11% |
| Build Stokes preconditioner     |         1 |      7.61s |         7% |
| Build temperature preconditioner|         1 |      1.54s |       1.4% |
| Solve Stokes system             |         1 |        38s |        35% |
| Solve temperature system        |         1 |      0.12s |      0.11% |
| Initialization                  |         1 |    0.0405s |         0% |
| Postprocessing                  |         1 |      16.1s |        15% |
| Setup dof systems               |         1 |      2.64s |       2.4% |
| Setup initial conditions        |         1 |        11s |        10% |
+---------------------------------+-----------+------------+------------+

2 processes / 4 threads each:

+---------------------------------------------+------------+------------+
| Total wallclock time elapsed since start    |      80.1s |            |
|                                             |            |            |
| Section                         | no. calls |  wall time | % of total |
+---------------------------------+-----------+------------+------------+
| Assemble Stokes system          |         1 |      4.09s |       5.1% |
| Assemble temperature system     |         1 |      7.62s |       9.5% |
| Build Stokes preconditioner     |         1 |      5.23s |       6.5% |
| Build temperature preconditioner|         1 |      1.22s |       1.5% |
| Solve Stokes system             |         1 |      35.2s |        44% |
| Solve temperature system        |         1 |     0.247s |      0.31% |
| Initialization                  |         1 |    0.0666s |         0% |
| Postprocessing                  |         1 |      8.84s |        11% |
| Setup dof systems               |         1 |      2.38s |         3% |
| Setup initial conditions        |         1 |      6.03s |       7.5% |
+---------------------------------+-----------+------------+------------+

4 processes / 2 threads each:

+---------------------------------------------+------------+------------+
| Total wallclock time elapsed since start    |      53.7s |            |
|                                             |            |            |
| Section                         | no. calls |  wall time | % of total |
+---------------------------------+-----------+------------+------------+
| Assemble Stokes system          |         1 |      4.21s |       7.8% |
| Assemble temperature system     |         1 |      5.02s |       9.3% |
| Build Stokes preconditioner     |         1 |      4.38s |       8.2% |
| Build temperature preconditioner|         1 |     0.638s |       1.2% |
| Solve Stokes system             |         1 |      22.7s |        42% |
| Solve temperature system        |         1 |     0.167s |      0.31% |
| Initialization                  |         1 |    0.0405s |         0% |
| Postprocessing                  |         1 |      6.48s |        12% |
| Setup dof systems               |         1 |      1.38s |       2.6% |
| Setup initial conditions        |         1 |      3.12s |       5.8% |
+---------------------------------+-----------+------------+------------+

8 processes:

+---------------------------------------------+------------+------------+
| Total wallclock time elapsed since start    |      47.5s |            |
|                                             |            |            |
| Section                         | no. calls |  wall time | % of total |
+---------------------------------+-----------+------------+------------+
| Assemble Stokes system          |         1 |       4.1s |       8.6% |
| Assemble temperature system     |         1 |      4.71s |       9.9% |
| Build Stokes preconditioner     |         1 |      4.18s |       8.8% |
| Build temperature preconditioner|         1 |     0.577s |       1.2% |
| Solve Stokes system             |         1 |      21.5s |        45% |
| Solve temperature system        |         1 |      0.15s |      0.32% |
| Initialization                  |         1 |    0.0639s |      0.13% |
| Postprocessing                  |         1 |      3.85s |       8.1% |
| Setup dof systems               |         1 |      1.31s |       2.8% |
| Setup initial conditions        |         1 |      2.68s |       5.6% |
+---------------------------------+-----------+------------+------------+

And for comparison the 1 and 4 process case without threads:

1 process no threading:

+---------------------------------------------+------------+------------+
| Total wallclock time elapsed since start    |       133s |            |
|                                             |            |            |
| Section                         | no. calls |  wall time | % of total |
+---------------------------------+-----------+------------+------------+
| Assemble Stokes system          |         1 |      15.5s |        12% |
| Assemble temperature system     |         1 |      16.3s |        12% |
| Build Stokes preconditioner     |         1 |        15s |        11% |
| Build temperature preconditioner|         1 |      1.56s |       1.2% |
| Solve Stokes system             |         1 |      38.2s |        29% |
| Solve temperature system        |         1 |     0.116s |         0% |
| Initialization                  |         1 |    0.0375s |         0% |
| Postprocessing                  |         1 |      19.2s |        14% |
| Setup dof systems               |         1 |      2.99s |       2.2% |
| Setup initial conditions        |         1 |      9.99s |       7.5% |
+---------------------------------+-----------+------------+------------+

4 processes, no threading:

+---------------------------------------------+------------+------------+
| Total wallclock time elapsed since start    |        56s |            |
|                                             |            |            |
| Section                         | no. calls |  wall time | % of total |
+---------------------------------+-----------+------------+------------+
| Assemble Stokes system          |         1 |      4.87s |       8.7% |
| Assemble temperature system     |         1 |      5.63s |        10% |
| Build Stokes preconditioner     |         1 |      4.59s |       8.2% |
| Build temperature preconditioner|         1 |      0.66s |       1.2% |
| Solve Stokes system             |         1 |      22.5s |        40% |
| Solve temperature system        |         1 |     0.174s |      0.31% |
| Initialization                  |         1 |    0.0406s |         0% |
| Postprocessing                  |         1 |      7.52s |        13% |
| Setup dof systems               |         1 |      1.44s |       2.6% |
| Setup initial conditions        |         1 |      3.04s |       5.4% |
+---------------------------------+-----------+------------+------------+

from aspect.

bangerth commented on June 19, 2024

So for the 4/2 and 2/4 cases, did you need to change anything at all, or is this just what ASPECT does? I mean it does improve things a bit over the 4/1 case, so we could just make that the default. If you have 4 physical cores each with hyperthreading, there is no reason not to use the hyperthreading if you only have 4 MPI processes.

I'm sure there will also be more things we parallelize over time in deal.II, so allowing more than one thread may be useful.

from aspect.

gassmoeller commented on June 19, 2024

Below is the only source change I did, and it correctly identified my total number of logical threads and depending on my number of MPI processes it selected the threads correctly. Should we make it an input parameter or compile-time switch first in case there are systems with problems?

diff --git a/source/main.cc b/source/main.cc
index a0fdd79..34cdeb4 100644
--- a/source/main.cc
+++ b/source/main.cc
@@ -462,7 +462,7 @@ int main (int argc, char *argv[])
       // before, so that the destructor of this instance can react if we are
       // currently unwinding the stack if an unhandled exception is being
       // thrown to avoid MPI deadlocks.
-      Utilities::MPI::MPI_InitFinalize mpi_initialization(argc, argv, /*n_threads =*/ 1);
+      Utilities::MPI::MPI_InitFinalize mpi_initialization(argc, argv, numbers::invalid_unsigned_int);
 
       deallog.depth_console(0);

from aspect.

bangerth commented on June 19, 2024

Do we have good reasons not to merge your patch?

(I'll note that both for casual runs to just check something real quick, and probably for the testsuite, we don't typically run with mpirun. It wouldn't seem wrong to make things a bit faster for that.)

from aspect.

gassmoeller commented on June 19, 2024

I do not expect problems, but I could see instance where I want to test runtime for a given number of MPI processes, and I would be annoyed if ASPECT automatically fills up all threads (also what if I want to run 8 models with 1 process each in parallel, that would not work anymore). Maybe an input parameter would be best?

from aspect.

Exploit more parallelism about aspect HOT 9 CLOSED

Comments (9)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent