Project

General

Profile

Task #2069

Simple thread-parallelism inside routines

Added by Erik Lindahl about 3 years ago. Updated about 3 years ago.

Status:
New
Priority:
Low
Assignee:
-
Category:
-
Target version:
-
Difficulty:
uncategorized
Close

Description

GROMACS uses multithreading on several different levels:

  • Inside mdrun we are already using OpenMP and will gradually move to tasks.
  • For tools that are inherently parallel, it works better to let each core handle one frame/unit.
  • For ensemble-level parallelism we will either parallelize over a single node in mdrun, or use as many jobs as there are cores.

However, in addition to this there are a number of operations where we might like to use all cores in a node for better performance - grompp is one example. Here I do not think it makes sense with a complex scheduler or other advanced approaches, but we would simply like a few simple ways of parallelizing over multiple threads.

My current gut feeling is that we can simply use C++11 threads directly, and have some routines where we have optional arguments to specify that we would like to use N threads (or all available cores) instead of just a single one.

History

#1 Updated by Mark Abraham about 3 years ago

Yes, I'm very happy to use std::thread for this kind of thing. We are not interested in the last n percentage points of performance for such things. Things like JIT of OpenCL kernels would also parallelise usefully in such ways.

It would be useful to see if someone has written infrastructure upon std::thread to provide an object that stores e.g. user command-line arguments that might set expected thread usage, and we pass that to our code that uses them. But typically we expect users to be leaving that alone, and we parallelise such work over all available cores in a default way. Some of our analysis tools already have some OpenMP support that works on similar lines.

#2 Updated by Teemu Murtola about 3 years ago

I'm not sure if plain C++11 threads are very suitable for parallelization within a single routine, unless it does a lot of relatively independent work in big units. C++11 mainly provides very low-level functionality that will create native threads for each case. In principle, some of the functionality could be provided through thread pools, but according to the C++ core guidelines, that's not what most library implementations do. So the overhead of creating and joining a real thread every time probably means that this is not suitable for fine-grained parallelization.

#3 Updated by Mark Abraham about 3 years ago

Teemu Murtola wrote:

I'm not sure if plain C++11 threads are very suitable for parallelization within a single routine, unless it does a lot of relatively independent work in big units. C++11 mainly provides very low-level functionality that will create native threads for each case. In principle, some of the functionality could be provided through thread pools, but according to the C++ core guidelines, that's not what most library implementations do. So the overhead of creating and joining a real thread every time probably means that this is not suitable for fine-grained parallelization.

Indeed, std::thread is useful for coarse things. My example of OpenCL JIT is exactly that - making one-time startup stages work better for minimal complexity. (This is catering to a possible future where you might have some accelerators on a very weak n-core ARM CPU, and now a 10-second startup time can be 2 seconds and we don't care about 0.01s overhead for creation.)

#4 Updated by Szilárd Páll about 3 years ago

I second Teemu's point, to the best of my knowledge C++11 threads are not intended for fine-grained tasks -- for doing concurrent JIT compilation even the worst OpenMP tasks implementation will suffice. That means it does not seem to be well-suited for intra-routine multi-threading of compute-tasks without risking unnecessary bottlenecks (like grompp). OpenMP is far superior for that, including memory management, thread pool, etc.

#5 Updated by Roland Schulz about 3 years ago

STS provides simple loop level parallelism the same way as OpenMP without any explicit schedule (we call it default schedule). Thus if we can express the required parallelism in the analysis tools with simple loop level parallelism then one could use STS for that. It would have the advantage of exception support compared to OpenMP. And we would have no differences between mdrun/analsysis and thus wouldn't have the associated complexities of duplicated mechanism (cmake, shared functions, ...).

Also available in: Atom PDF