Comments (9)
This is now possible with the changes from dealii/dealii#146 . All we need is set the last argument to MPI_InitFinalize to numbers::invalid_unsigned_int.
from aspect.
Is this still a relevant question? Since it is simple to activate I made a few runs on my laptop with a very small test model, and below are some results that show that replacing MPI processes with threads can indeed speed up things (on my laptop that is). I guess the two important question I would have about the topic are:
- Which parts of ASPECT will benefit from threads, and more importantly which will not? Might those matter if we for example run only 1 MPI process per node on a cluster? It seems postprocessing does not benefit from threading?
- Can we push the strong scaling limit a bit higher by using less MPI processes per node, because we create less MPI traffic and the individual work sizes per process are larger?
Clearly this needs more investigation on larger clusters, but here are the results from my laptop in Optimized mode for tests/simple_compressible.prm at a fixed frequency (no turbo):
1 process / 8 threads:
+---------------------------------------------+------------+------------+
| Total wallclock time elapsed since start | 0.19s | |
| | | |
| Section | no. calls | wall time | % of total |
+---------------------------------+-----------+------------+------------+
| Assemble Stokes system | 2 | 0.00918s | 4.8% |
| Assemble temperature system | 2 | 0.0277s | 15% |
| Build Stokes preconditioner | 1 | 0.0147s | 7.7% |
| Build temperature preconditioner| 2 | 0.00322s | 1.7% |
| Solve Stokes system | 2 | 0.0195s | 10% |
| Solve temperature system | 2 | 0.00166s | 0.87% |
| Initialization | 1 | 0.04s | 21% |
| Postprocessing | 2 | 0.0276s | 15% |
| Setup dof systems | 1 | 0.00688s | 3.6% |
| Setup initial conditions | 1 | 0.0138s | 7.3% |
+---------------------------------+-----------+------------+------------+
2 processes / 4 threads each:
+---------------------------------------------+------------+------------+
| Total wallclock time elapsed since start | 0.179s | |
| | | |
| Section | no. calls | wall time | % of total |
+---------------------------------+-----------+------------+------------+
| Assemble Stokes system | 2 | 0.00925s | 5.2% |
| Assemble temperature system | 2 | 0.0207s | 12% |
| Build Stokes preconditioner | 1 | 0.0184s | 10% |
| Build temperature preconditioner| 2 | 0.00289s | 1.6% |
| Solve Stokes system | 2 | 0.025s | 14% |
| Solve temperature system | 2 | 0.00301s | 1.7% |
| Initialization | 1 | 0.0403s | 22% |
| Postprocessing | 2 | 0.0177s | 9.9% |
| Setup dof systems | 1 | 0.00944s | 5.3% |
| Setup initial conditions | 1 | 0.0104s | 5.8% |
+---------------------------------+-----------+------------+------------+
4 processes / 2 threads each
+---------------------------------------------+------------+------------+
| Total wallclock time elapsed since start | 0.18s | |
| | | |
| Section | no. calls | wall time | % of total |
+---------------------------------+-----------+------------+------------+
| Assemble Stokes system | 2 | 0.0121s | 6.8% |
| Assemble temperature system | 2 | 0.0186s | 10% |
| Build Stokes preconditioner | 1 | 0.0204s | 11% |
| Build temperature preconditioner| 2 | 0.00192s | 1.1% |
| Solve Stokes system | 2 | 0.0294s | 16% |
| Solve temperature system | 2 | 0.00356s | 2% |
| Initialization | 1 | 0.0399s | 22% |
| Postprocessing | 2 | 0.0128s | 7.1% |
| Setup dof systems | 1 | 0.0104s | 5.8% |
| Setup initial conditions | 1 | 0.00935s | 5.2% |
+---------------------------------+-----------+------------+------------+
8 processes / 1 thread each:
+---------------------------------------------+------------+------------+
| Total wallclock time elapsed since start | 0.293s | |
| | | |
| Section | no. calls | wall time | % of total |
+---------------------------------+-----------+------------+------------+
| Assemble Stokes system | 2 | 0.0123s | 4.2% |
| Assemble temperature system | 2 | 0.0201s | 6.9% |
| Build Stokes preconditioner | 1 | 0.0301s | 10% |
| Build temperature preconditioner| 2 | 0.00204s | 0.7% |
| Solve Stokes system | 2 | 0.072s | 25% |
| Solve temperature system | 2 | 0.0109s | 3.7% |
| Initialization | 1 | 0.0636s | 22% |
| Postprocessing | 2 | 0.0115s | 3.9% |
| Setup dof systems | 1 | 0.0165s | 5.6% |
| Setup initial conditions | 1 | 0.0169s | 5.8% |
+---------------------------------+-----------+------------+------------+
Using threads seems superior to filling all hardware threads with MPI processes, but if that holds for larger machines (and how large the difference is to 4 processes, 1 thread each) needs further testing.
from aspect.
Most of the time is typically spent in our linear solvers which use AMG. AMG does not use multiple threads so I don't think this is worth looking into as of right now. While assembly might be a bit faster using threads, I doubt this will make a difference.
The whole thing is worth looking at again, when we use matrix-free solvers.
from aspect.
I agree with @tjhei -- as long as we spend the majority of time in linear solvers, the only reasonable choice is to use more MPI processes with one thread each. That's because neither Trilinos nor PETSc use threads in their linear solvers and preconditioners.
Out of curiosity, though: I have little confidence in measurements of individual parts of a program that only runs 0.2 seconds overall. What happens if you refine a number of times so that the run-time is, say, somewhere in the 1-5 minute range?
from aspect.
Hmm, I agree that the speedup is not huge, but looking at the numbers below I would say that choosing 4/2 (4 mpi processes with 2 threads each) is about 5 % faster in all assemblies than 4/-. The former is the setup that I used so far when using clusters, using as many MPI processes as hardware cores, and ignoring the hyperthreads. Of course 8 MPI processes would be even better, but I am usually running jobs close to the strong scaling limit to get faster model turnaround time, and then using twice as many processes to get a less than 15% speedup (or even increase if I hit the limit) is just not worth it. All in all ignoring threads and trying to make the solver more scalable with MPI seems the better approach to me.
For reference here are results for the simple compressible test with a refinement of 8 (2.5 * 10^6 DoFs), 4 core CPU with hyperthreading, fixed CPU frequency:
1 process / 8 threads:
+---------------------------------------------+------------+------------+
| Total wallclock time elapsed since start | 108s | |
| | | |
| Section | no. calls | wall time | % of total |
+---------------------------------+-----------+------------+------------+
| Assemble Stokes system | 1 | 5.1s | 4.7% |
| Assemble temperature system | 1 | 11.5s | 11% |
| Build Stokes preconditioner | 1 | 7.61s | 7% |
| Build temperature preconditioner| 1 | 1.54s | 1.4% |
| Solve Stokes system | 1 | 38s | 35% |
| Solve temperature system | 1 | 0.12s | 0.11% |
| Initialization | 1 | 0.0405s | 0% |
| Postprocessing | 1 | 16.1s | 15% |
| Setup dof systems | 1 | 2.64s | 2.4% |
| Setup initial conditions | 1 | 11s | 10% |
+---------------------------------+-----------+------------+------------+
2 processes / 4 threads each:
+---------------------------------------------+------------+------------+
| Total wallclock time elapsed since start | 80.1s | |
| | | |
| Section | no. calls | wall time | % of total |
+---------------------------------+-----------+------------+------------+
| Assemble Stokes system | 1 | 4.09s | 5.1% |
| Assemble temperature system | 1 | 7.62s | 9.5% |
| Build Stokes preconditioner | 1 | 5.23s | 6.5% |
| Build temperature preconditioner| 1 | 1.22s | 1.5% |
| Solve Stokes system | 1 | 35.2s | 44% |
| Solve temperature system | 1 | 0.247s | 0.31% |
| Initialization | 1 | 0.0666s | 0% |
| Postprocessing | 1 | 8.84s | 11% |
| Setup dof systems | 1 | 2.38s | 3% |
| Setup initial conditions | 1 | 6.03s | 7.5% |
+---------------------------------+-----------+------------+------------+
4 processes / 2 threads each:
+---------------------------------------------+------------+------------+
| Total wallclock time elapsed since start | 53.7s | |
| | | |
| Section | no. calls | wall time | % of total |
+---------------------------------+-----------+------------+------------+
| Assemble Stokes system | 1 | 4.21s | 7.8% |
| Assemble temperature system | 1 | 5.02s | 9.3% |
| Build Stokes preconditioner | 1 | 4.38s | 8.2% |
| Build temperature preconditioner| 1 | 0.638s | 1.2% |
| Solve Stokes system | 1 | 22.7s | 42% |
| Solve temperature system | 1 | 0.167s | 0.31% |
| Initialization | 1 | 0.0405s | 0% |
| Postprocessing | 1 | 6.48s | 12% |
| Setup dof systems | 1 | 1.38s | 2.6% |
| Setup initial conditions | 1 | 3.12s | 5.8% |
+---------------------------------+-----------+------------+------------+
8 processes:
+---------------------------------------------+------------+------------+
| Total wallclock time elapsed since start | 47.5s | |
| | | |
| Section | no. calls | wall time | % of total |
+---------------------------------+-----------+------------+------------+
| Assemble Stokes system | 1 | 4.1s | 8.6% |
| Assemble temperature system | 1 | 4.71s | 9.9% |
| Build Stokes preconditioner | 1 | 4.18s | 8.8% |
| Build temperature preconditioner| 1 | 0.577s | 1.2% |
| Solve Stokes system | 1 | 21.5s | 45% |
| Solve temperature system | 1 | 0.15s | 0.32% |
| Initialization | 1 | 0.0639s | 0.13% |
| Postprocessing | 1 | 3.85s | 8.1% |
| Setup dof systems | 1 | 1.31s | 2.8% |
| Setup initial conditions | 1 | 2.68s | 5.6% |
+---------------------------------+-----------+------------+------------+
And for comparison the 1 and 4 process case without threads:
1 process no threading:
+---------------------------------------------+------------+------------+
| Total wallclock time elapsed since start | 133s | |
| | | |
| Section | no. calls | wall time | % of total |
+---------------------------------+-----------+------------+------------+
| Assemble Stokes system | 1 | 15.5s | 12% |
| Assemble temperature system | 1 | 16.3s | 12% |
| Build Stokes preconditioner | 1 | 15s | 11% |
| Build temperature preconditioner| 1 | 1.56s | 1.2% |
| Solve Stokes system | 1 | 38.2s | 29% |
| Solve temperature system | 1 | 0.116s | 0% |
| Initialization | 1 | 0.0375s | 0% |
| Postprocessing | 1 | 19.2s | 14% |
| Setup dof systems | 1 | 2.99s | 2.2% |
| Setup initial conditions | 1 | 9.99s | 7.5% |
+---------------------------------+-----------+------------+------------+
4 processes, no threading:
+---------------------------------------------+------------+------------+
| Total wallclock time elapsed since start | 56s | |
| | | |
| Section | no. calls | wall time | % of total |
+---------------------------------+-----------+------------+------------+
| Assemble Stokes system | 1 | 4.87s | 8.7% |
| Assemble temperature system | 1 | 5.63s | 10% |
| Build Stokes preconditioner | 1 | 4.59s | 8.2% |
| Build temperature preconditioner| 1 | 0.66s | 1.2% |
| Solve Stokes system | 1 | 22.5s | 40% |
| Solve temperature system | 1 | 0.174s | 0.31% |
| Initialization | 1 | 0.0406s | 0% |
| Postprocessing | 1 | 7.52s | 13% |
| Setup dof systems | 1 | 1.44s | 2.6% |
| Setup initial conditions | 1 | 3.04s | 5.4% |
+---------------------------------+-----------+------------+------------+
from aspect.
So for the 4/2 and 2/4 cases, did you need to change anything at all, or is this just what ASPECT does? I mean it does improve things a bit over the 4/1 case, so we could just make that the default. If you have 4 physical cores each with hyperthreading, there is no reason not to use the hyperthreading if you only have 4 MPI processes.
I'm sure there will also be more things we parallelize over time in deal.II, so allowing more than one thread may be useful.
from aspect.
Below is the only source change I did, and it correctly identified my total number of logical threads and depending on my number of MPI processes it selected the threads correctly. Should we make it an input parameter or compile-time switch first in case there are systems with problems?
diff --git a/source/main.cc b/source/main.cc
index a0fdd79..34cdeb4 100644
--- a/source/main.cc
+++ b/source/main.cc
@@ -462,7 +462,7 @@ int main (int argc, char *argv[])
// before, so that the destructor of this instance can react if we are
// currently unwinding the stack if an unhandled exception is being
// thrown to avoid MPI deadlocks.
- Utilities::MPI::MPI_InitFinalize mpi_initialization(argc, argv, /*n_threads =*/ 1);
+ Utilities::MPI::MPI_InitFinalize mpi_initialization(argc, argv, numbers::invalid_unsigned_int);
deallog.depth_console(0);
from aspect.
Do we have good reasons not to merge your patch?
(I'll note that both for casual runs to just check something real quick, and probably for the testsuite, we don't typically run with mpirun. It wouldn't seem wrong to make things a bit faster for that.)
from aspect.
I do not expect problems, but I could see instance where I want to test runtime for a given number of MPI processes, and I would be annoyed if ASPECT automatically fills up all threads (also what if I want to run 8 models with 1 process each in parallel, that would not work anymore). Maybe an input parameter would be best?
from aspect.
Related Issues (20)
- DOC: Move plugin info
- FIX or DOC: Consistent user-facing stress convention
- Bug : Strain Rheology in ASPECT 2.5 and deal.II 9.5.0 HOT 1
- ASPECT 3: Run full cookbooks for all complicated models
- Remove items from declare_parameters in diffusion_dislocation material model
- Read in functions only for fields that represent chemical compositions
- Compute a better alpha factor in the implementation of the SPD-ness of the Newton method. HOT 6
- Build deal.ii 9.5.1 with gcc 13.2.0, an error while building trillinos 13.2.0 HOT 6
- ASPECT 3: Update some of our default values HOT 1
- compiling and running ASPECT on TACC Stampede3 HOT 17
- Improve matrix-free performance marginally. HOT 1
- Make the Newton method compatible with the GMG solver HOT 1
- ASPECT cannot compile without WorldBuilder HOT 3
- Particles with free surface hang HOT 9
- Restart with free surface and latest deal.II is failing HOT 8
- Add support for different coordinate systems to more function plugins
- Particle deregistration error on ASPECT v2.6.0-pre with deal.II v9.5.1 HOT 3
- [CI] Jenkins tester is timing out / OOM / slow HOT 1
- CMake: consider removing `deal_ii_setup_target()` and using import targets directly (deal.II version 9.5 or later)
- velocity along the wall HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from aspect.