Giter VIP home page Giter VIP logo

Comments (9)

bangerth avatar bangerth commented on June 19, 2024

This is now possible with the changes from dealii/dealii#146 . All we need is set the last argument to MPI_InitFinalize to numbers::invalid_unsigned_int.

from aspect.

gassmoeller avatar gassmoeller commented on June 19, 2024

Is this still a relevant question? Since it is simple to activate I made a few runs on my laptop with a very small test model, and below are some results that show that replacing MPI processes with threads can indeed speed up things (on my laptop that is). I guess the two important question I would have about the topic are:

  1. Which parts of ASPECT will benefit from threads, and more importantly which will not? Might those matter if we for example run only 1 MPI process per node on a cluster? It seems postprocessing does not benefit from threading?
  2. Can we push the strong scaling limit a bit higher by using less MPI processes per node, because we create less MPI traffic and the individual work sizes per process are larger?

Clearly this needs more investigation on larger clusters, but here are the results from my laptop in Optimized mode for tests/simple_compressible.prm at a fixed frequency (no turbo):

1 process / 8 threads:

+---------------------------------------------+------------+------------+
| Total wallclock time elapsed since start    |      0.19s |            |
|                                             |            |            |
| Section                         | no. calls |  wall time | % of total |
+---------------------------------+-----------+------------+------------+
| Assemble Stokes system          |         2 |   0.00918s |       4.8% |
| Assemble temperature system     |         2 |    0.0277s |        15% |
| Build Stokes preconditioner     |         1 |    0.0147s |       7.7% |
| Build temperature preconditioner|         2 |   0.00322s |       1.7% |
| Solve Stokes system             |         2 |    0.0195s |        10% |
| Solve temperature system        |         2 |   0.00166s |      0.87% |
| Initialization                  |         1 |      0.04s |        21% |
| Postprocessing                  |         2 |    0.0276s |        15% |
| Setup dof systems               |         1 |   0.00688s |       3.6% |
| Setup initial conditions        |         1 |    0.0138s |       7.3% |
+---------------------------------+-----------+------------+------------+

2 processes / 4 threads each:

+---------------------------------------------+------------+------------+
| Total wallclock time elapsed since start    |     0.179s |            |
|                                             |            |            |
| Section                         | no. calls |  wall time | % of total |
+---------------------------------+-----------+------------+------------+
| Assemble Stokes system          |         2 |   0.00925s |       5.2% |
| Assemble temperature system     |         2 |    0.0207s |        12% |
| Build Stokes preconditioner     |         1 |    0.0184s |        10% |
| Build temperature preconditioner|         2 |   0.00289s |       1.6% |
| Solve Stokes system             |         2 |     0.025s |        14% |
| Solve temperature system        |         2 |   0.00301s |       1.7% |
| Initialization                  |         1 |    0.0403s |        22% |
| Postprocessing                  |         2 |    0.0177s |       9.9% |
| Setup dof systems               |         1 |   0.00944s |       5.3% |
| Setup initial conditions        |         1 |    0.0104s |       5.8% |
+---------------------------------+-----------+------------+------------+

4 processes / 2 threads each

+---------------------------------------------+------------+------------+
| Total wallclock time elapsed since start    |      0.18s |            |
|                                             |            |            |
| Section                         | no. calls |  wall time | % of total |
+---------------------------------+-----------+------------+------------+
| Assemble Stokes system          |         2 |    0.0121s |       6.8% |
| Assemble temperature system     |         2 |    0.0186s |        10% |
| Build Stokes preconditioner     |         1 |    0.0204s |        11% |
| Build temperature preconditioner|         2 |   0.00192s |       1.1% |
| Solve Stokes system             |         2 |    0.0294s |        16% |
| Solve temperature system        |         2 |   0.00356s |         2% |
| Initialization                  |         1 |    0.0399s |        22% |
| Postprocessing                  |         2 |    0.0128s |       7.1% |
| Setup dof systems               |         1 |    0.0104s |       5.8% |
| Setup initial conditions        |         1 |   0.00935s |       5.2% |
+---------------------------------+-----------+------------+------------+

8 processes / 1 thread each:

+---------------------------------------------+------------+------------+
| Total wallclock time elapsed since start    |     0.293s |            |
|                                             |            |            |
| Section                         | no. calls |  wall time | % of total |
+---------------------------------+-----------+------------+------------+
| Assemble Stokes system          |         2 |    0.0123s |       4.2% |
| Assemble temperature system     |         2 |    0.0201s |       6.9% |
| Build Stokes preconditioner     |         1 |    0.0301s |        10% |
| Build temperature preconditioner|         2 |   0.00204s |       0.7% |
| Solve Stokes system             |         2 |     0.072s |        25% |
| Solve temperature system        |         2 |    0.0109s |       3.7% |
| Initialization                  |         1 |    0.0636s |        22% |
| Postprocessing                  |         2 |    0.0115s |       3.9% |
| Setup dof systems               |         1 |    0.0165s |       5.6% |
| Setup initial conditions        |         1 |    0.0169s |       5.8% |
+---------------------------------+-----------+------------+------------+

Using threads seems superior to filling all hardware threads with MPI processes, but if that holds for larger machines (and how large the difference is to 4 processes, 1 thread each) needs further testing.

from aspect.

tjhei avatar tjhei commented on June 19, 2024

Most of the time is typically spent in our linear solvers which use AMG. AMG does not use multiple threads so I don't think this is worth looking into as of right now. While assembly might be a bit faster using threads, I doubt this will make a difference.
The whole thing is worth looking at again, when we use matrix-free solvers.

from aspect.

bangerth avatar bangerth commented on June 19, 2024

I agree with @tjhei -- as long as we spend the majority of time in linear solvers, the only reasonable choice is to use more MPI processes with one thread each. That's because neither Trilinos nor PETSc use threads in their linear solvers and preconditioners.

Out of curiosity, though: I have little confidence in measurements of individual parts of a program that only runs 0.2 seconds overall. What happens if you refine a number of times so that the run-time is, say, somewhere in the 1-5 minute range?

from aspect.

gassmoeller avatar gassmoeller commented on June 19, 2024

Hmm, I agree that the speedup is not huge, but looking at the numbers below I would say that choosing 4/2 (4 mpi processes with 2 threads each) is about 5 % faster in all assemblies than 4/-. The former is the setup that I used so far when using clusters, using as many MPI processes as hardware cores, and ignoring the hyperthreads. Of course 8 MPI processes would be even better, but I am usually running jobs close to the strong scaling limit to get faster model turnaround time, and then using twice as many processes to get a less than 15% speedup (or even increase if I hit the limit) is just not worth it. All in all ignoring threads and trying to make the solver more scalable with MPI seems the better approach to me.

For reference here are results for the simple compressible test with a refinement of 8 (2.5 * 10^6 DoFs), 4 core CPU with hyperthreading, fixed CPU frequency:

1 process / 8 threads:

+---------------------------------------------+------------+------------+
| Total wallclock time elapsed since start    |       108s |            |
|                                             |            |            |
| Section                         | no. calls |  wall time | % of total |
+---------------------------------+-----------+------------+------------+
| Assemble Stokes system          |         1 |       5.1s |       4.7% |
| Assemble temperature system     |         1 |      11.5s |        11% |
| Build Stokes preconditioner     |         1 |      7.61s |         7% |
| Build temperature preconditioner|         1 |      1.54s |       1.4% |
| Solve Stokes system             |         1 |        38s |        35% |
| Solve temperature system        |         1 |      0.12s |      0.11% |
| Initialization                  |         1 |    0.0405s |         0% |
| Postprocessing                  |         1 |      16.1s |        15% |
| Setup dof systems               |         1 |      2.64s |       2.4% |
| Setup initial conditions        |         1 |        11s |        10% |
+---------------------------------+-----------+------------+------------+

2 processes / 4 threads each:

+---------------------------------------------+------------+------------+
| Total wallclock time elapsed since start    |      80.1s |            |
|                                             |            |            |
| Section                         | no. calls |  wall time | % of total |
+---------------------------------+-----------+------------+------------+
| Assemble Stokes system          |         1 |      4.09s |       5.1% |
| Assemble temperature system     |         1 |      7.62s |       9.5% |
| Build Stokes preconditioner     |         1 |      5.23s |       6.5% |
| Build temperature preconditioner|         1 |      1.22s |       1.5% |
| Solve Stokes system             |         1 |      35.2s |        44% |
| Solve temperature system        |         1 |     0.247s |      0.31% |
| Initialization                  |         1 |    0.0666s |         0% |
| Postprocessing                  |         1 |      8.84s |        11% |
| Setup dof systems               |         1 |      2.38s |         3% |
| Setup initial conditions        |         1 |      6.03s |       7.5% |
+---------------------------------+-----------+------------+------------+

4 processes / 2 threads each:

+---------------------------------------------+------------+------------+
| Total wallclock time elapsed since start    |      53.7s |            |
|                                             |            |            |
| Section                         | no. calls |  wall time | % of total |
+---------------------------------+-----------+------------+------------+
| Assemble Stokes system          |         1 |      4.21s |       7.8% |
| Assemble temperature system     |         1 |      5.02s |       9.3% |
| Build Stokes preconditioner     |         1 |      4.38s |       8.2% |
| Build temperature preconditioner|         1 |     0.638s |       1.2% |
| Solve Stokes system             |         1 |      22.7s |        42% |
| Solve temperature system        |         1 |     0.167s |      0.31% |
| Initialization                  |         1 |    0.0405s |         0% |
| Postprocessing                  |         1 |      6.48s |        12% |
| Setup dof systems               |         1 |      1.38s |       2.6% |
| Setup initial conditions        |         1 |      3.12s |       5.8% |
+---------------------------------+-----------+------------+------------+

8 processes:

+---------------------------------------------+------------+------------+
| Total wallclock time elapsed since start    |      47.5s |            |
|                                             |            |            |
| Section                         | no. calls |  wall time | % of total |
+---------------------------------+-----------+------------+------------+
| Assemble Stokes system          |         1 |       4.1s |       8.6% |
| Assemble temperature system     |         1 |      4.71s |       9.9% |
| Build Stokes preconditioner     |         1 |      4.18s |       8.8% |
| Build temperature preconditioner|         1 |     0.577s |       1.2% |
| Solve Stokes system             |         1 |      21.5s |        45% |
| Solve temperature system        |         1 |      0.15s |      0.32% |
| Initialization                  |         1 |    0.0639s |      0.13% |
| Postprocessing                  |         1 |      3.85s |       8.1% |
| Setup dof systems               |         1 |      1.31s |       2.8% |
| Setup initial conditions        |         1 |      2.68s |       5.6% |
+---------------------------------+-----------+------------+------------+

And for comparison the 1 and 4 process case without threads:

1 process no threading:

+---------------------------------------------+------------+------------+
| Total wallclock time elapsed since start    |       133s |            |
|                                             |            |            |
| Section                         | no. calls |  wall time | % of total |
+---------------------------------+-----------+------------+------------+
| Assemble Stokes system          |         1 |      15.5s |        12% |
| Assemble temperature system     |         1 |      16.3s |        12% |
| Build Stokes preconditioner     |         1 |        15s |        11% |
| Build temperature preconditioner|         1 |      1.56s |       1.2% |
| Solve Stokes system             |         1 |      38.2s |        29% |
| Solve temperature system        |         1 |     0.116s |         0% |
| Initialization                  |         1 |    0.0375s |         0% |
| Postprocessing                  |         1 |      19.2s |        14% |
| Setup dof systems               |         1 |      2.99s |       2.2% |
| Setup initial conditions        |         1 |      9.99s |       7.5% |
+---------------------------------+-----------+------------+------------+

4 processes, no threading:

+---------------------------------------------+------------+------------+
| Total wallclock time elapsed since start    |        56s |            |
|                                             |            |            |
| Section                         | no. calls |  wall time | % of total |
+---------------------------------+-----------+------------+------------+
| Assemble Stokes system          |         1 |      4.87s |       8.7% |
| Assemble temperature system     |         1 |      5.63s |        10% |
| Build Stokes preconditioner     |         1 |      4.59s |       8.2% |
| Build temperature preconditioner|         1 |      0.66s |       1.2% |
| Solve Stokes system             |         1 |      22.5s |        40% |
| Solve temperature system        |         1 |     0.174s |      0.31% |
| Initialization                  |         1 |    0.0406s |         0% |
| Postprocessing                  |         1 |      7.52s |        13% |
| Setup dof systems               |         1 |      1.44s |       2.6% |
| Setup initial conditions        |         1 |      3.04s |       5.4% |
+---------------------------------+-----------+------------+------------+

from aspect.

bangerth avatar bangerth commented on June 19, 2024

So for the 4/2 and 2/4 cases, did you need to change anything at all, or is this just what ASPECT does? I mean it does improve things a bit over the 4/1 case, so we could just make that the default. If you have 4 physical cores each with hyperthreading, there is no reason not to use the hyperthreading if you only have 4 MPI processes.

I'm sure there will also be more things we parallelize over time in deal.II, so allowing more than one thread may be useful.

from aspect.

gassmoeller avatar gassmoeller commented on June 19, 2024

Below is the only source change I did, and it correctly identified my total number of logical threads and depending on my number of MPI processes it selected the threads correctly. Should we make it an input parameter or compile-time switch first in case there are systems with problems?

diff --git a/source/main.cc b/source/main.cc
index a0fdd79..34cdeb4 100644
--- a/source/main.cc
+++ b/source/main.cc
@@ -462,7 +462,7 @@ int main (int argc, char *argv[])
       // before, so that the destructor of this instance can react if we are
       // currently unwinding the stack if an unhandled exception is being
       // thrown to avoid MPI deadlocks.
-      Utilities::MPI::MPI_InitFinalize mpi_initialization(argc, argv, /*n_threads =*/ 1);
+      Utilities::MPI::MPI_InitFinalize mpi_initialization(argc, argv, numbers::invalid_unsigned_int);
 
       deallog.depth_console(0);

from aspect.

bangerth avatar bangerth commented on June 19, 2024

Do we have good reasons not to merge your patch?

(I'll note that both for casual runs to just check something real quick, and probably for the testsuite, we don't typically run with mpirun. It wouldn't seem wrong to make things a bit faster for that.)

from aspect.

gassmoeller avatar gassmoeller commented on June 19, 2024

I do not expect problems, but I could see instance where I want to test runtime for a given number of MPI processes, and I would be annoyed if ASPECT automatically fills up all threads (also what if I want to run 8 models with 1 process each in parallel, that would not work anymore). Maybe an input parameter would be best?

from aspect.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.