Project

General

Profile

Task #1829

Future of thread level parallelism

Added by Roland Schulz about 4 years ago. Updated about 3 years ago.

Status:
New
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
Difficulty:
uncategorized
Close

Description

We need to add a framework to add task level parallelism because our current loop level parallelism inside a single domain is not sufficient. We might use TBB for it.

overlap.png (65.4 KB) overlap.png Roland Schulz, 02/23/2016 06:16 PM
tasbench_gomp52.tar.gz (1.86 KB) tasbench_gomp52.tar.gz Szilárd Páll, 03/08/2016 09:01 PM
tasbench_iomp16.tar.gz (1.53 KB) tasbench_iomp16.tar.gz Szilárd Páll, 03/08/2016 09:01 PM

Related issues

Related to GROMACS - Task #1828: Exception handling in mdrunNew
Related to GROMACS - Task #1588: Future of single-node parallel codingClosed
Related to GROMACS - Task #1411: Future of thread_mpiNew

History

#1 Updated by Roland Schulz about 4 years ago

What are the remaining TODOs to decide whether TBB is a good choice?

What are the possible alternatives?

#2 Updated by Roland Schulz about 4 years ago

  • Related to Task #1828: Exception handling in mdrun added

#3 Updated by Berk Hess about 4 years ago

After a HPC workshop in the US, Szilard got strong reservations against TBB. It seems that, at least in it's standard form, it fully relies on overdecomposition of tasks to run efficiently. We will probably have to write our own scheduler to run a minimal number of tasks and to pin threads/tasks to cores.
We want to contact Rio Yokota, who uses TBB in his ExaFMM code that we're currently integrating into Gromacs, to see what his experiences are.

Currently I think we have not found any alternatives, except for writing something ourselves from scratch, or hoping that that OpenMP will become more HPC and low-latency focused.

#4 Updated by John Eblen about 4 years ago

Relevant Gerrit:

https://gerrit.gromacs.org/#/c/4914/

Pinning threads to cores, at least, is not difficult. The above change pins TBB threads according to the Gromacs' settings.

TBB is also easy to use. The above code implements a flow graph that contains both the bonded and nonbonded kernels.

By design, however, TBB does not allow fine-grained control over threads, preferring that work be broken into tasks. This may be its Achilles' Heel for Gromacs.

#5 Updated by Roland Schulz about 4 years ago

What version of OpenMP could we require for 2016? We could only require a version which is supported by all our tier 1 compilers, right? So because MSVC only supports 2.0 we couldn't use any OpenMP features of newer versions?

#6 Updated by John Eblen about 4 years ago

A few questions to clarify what we want to do:

1) Why is loop-level parallelism not sufficient? What problem are we trying to solve?

2) Are we talking about redesigning Gromacs to use task-level parallelism or just adding it in a few places?

3) If the former, do we have a design? If the latter, what places?

4) For 3, why would TBB be better than OpenMP? (This relates to Roland's question in comment 5 as to what features we would be able to use.)

#7 Updated by Szilárd Páll about 4 years ago

John Eblen wrote:

A few questions to clarify what we want to do:

Good questions. These topics we could (and should) discuss for an entire dev meeting. We've had related discussion internally for quite some time. Will try to distill the status quo with brief summary and I suggest we schedule this as a high-prio topic for a next dev meeting.

1) Why is loop-level parallelism not sufficient? What problem are we trying to solve?

In one word: tasking.

In more words:
  • Overlap of highly (data-)parallel tasks with poorly/non-parallel ones.
  • Variable width tasks running concurrently within a wide rank.
  • 1/#NUMA nodes ranks per physical node.
  • Task preemption/interruption to speed up critical path (e.g. switch to non-local force compute as soon as data arrives).

2) Are we talking about redesigning Gromacs to use task-level parallelism or just adding it in a few places?

Yes and yes, the former on the long run, the latter on the short-mid to term, I hope (but I don't think we know the path to get to fully task-parallel execution).

3) If the former, do we have a design? If the latter, what places?

Sort of. We know what we surely want and a few things that we'll most likely need. I suggest taking up this topic in a videoconf.

4) For 3, why would TBB be better than OpenMP? (This relates to Roland's question in comment 5 as to what features we would be able to use.)

See above.

#8 Updated by Roland Schulz about 4 years ago

  • Related to Task #1588: Future of single-node parallel coding added

#9 Updated by Roland Schulz about 4 years ago

  • Related to Task #1411: Future of thread_mpi added

#10 Updated by Roland Schulz about 4 years ago

I think there is a good reason that we use static scheduling instead of OpenMP scheduling for our current loop level parallelism. The overhead of dynamic scheduling is large for many of our loop which have little work. And even for the nonbonded we failed to keep the dynamic scheduling overhead small enough to make it useful for dynamic load balancing. I think any tasking framework with many small task and a scheduler distributing small tasks on to cores (for load balancing and communication overlap) will have the same problem. I think we might do better with a static scheduled task framework.

I think this could work because MD steps are usually very similar between one and the next. We would measure how long a (macro) task takes (e.g. all of bonded) and how long the communication takes we want to overlap. Task would be started the same as in dynamic scheduled frameworks (e.g. TBB) - just with an ID to identify the communication step for the time measurement. Let's assume we want to overlap any non-communication depending force calculation (e.g. bonded) with each of the 6 communication steps of PME. Based on the time measurement from the previous step of the bonded calculation and the time measurement of the communication we could execute the correct amount of bonded calculation without any dynamic scheduling or breaking the bonded calculation into many small steps.

This would be particular easy if we decide to implement the critical path as std non-tasked serial (with loop level parallelism) code. Then each comm step would have a sequential ID and all tasks could be build prior to the execution without any significant synchronization issues. If we don't want to encode the critical path in the code but want to be able to encode the critical path at runtime we would need a bit more sophisticated storage and preparation of the tasks.

TLDR: Instead of using many small tasks and dynamic scheduling we could create tasks of proper size based on measurements of the previous step(s).

#11 Updated by Berk Hess about 4 years ago

We had made a quite concrete plan for the tasking, but Roland probably wasn't at that meeting. The tasking framework is the only choice we haven't made. The plan is to use large tasks which are initially scheduled statically, but can be preempted when tasks on the critical path become ready to run. We might want task stealing at the end for load balancing, but that is something we need to look into as we progress. For run with PME, the critical path is the whole of PME (and constraints). In most cases we should be able to do a good prediction of the time communication takes based on what we observe over previous steps, so we might be able to schedule everything, including the PME communication and calculation tasks statically. So maybe we won't need preemption at all?

#12 Updated by Szilárd Páll about 4 years ago

A lot has been discussed since, but I dug up the meeting notes from the first meeting where the threading/tasking library's specs were drafted:
https://docs.google.com/document/d/1EaYSV8bgNx6n96_Af9QX_5llPbxoU66Q3GX2MtEdnQc/edit?usp=sharing

#13 Updated by Roland Schulz about 4 years ago

I think we should be able to schedule everything and don't need work stealing or preemption. If we implement the critical path (PME) as non-task on the master thread (as it is now), then we wouldn't need 2 queues. We would only need the one queue (the former low priority one).

#14 Updated by Roland Schulz about 4 years ago

If we might not actually need a 2nd queue, what do we need from the tasking which isn't available in OpenMP?

#15 Updated by Roland Schulz about 4 years ago

MSVC doesn't support OpenMP 3 and thus OpenMP isn't supported by all our tier 1 compilers.

We had today a chat with CJ from Intel about hstreams. Currently it is only available as part of MPSS and requires a MIC to be present in the system. But both will be fixed soon. It should be able to do all tasking we want. For loop-level parallelism we would keep using OpenMP which it inter-operates with. We haven't tried it out yet but it might be another option.

#16 Updated by John Eblen about 4 years ago

Found this project for compiling TBB with CMake:

https://github.com/wjakob/tbb

#17 Updated by Roland Schulz about 4 years ago

John uploaded our proof of concept of a static scheduling framework: https://gerrit.gromacs.org/#/c/5417

#18 Updated by Roland Schulz almost 4 years ago

Any new thoughts on the tasking in general and the choice of the framework in particular? Has anyone have a chance to look at our proof of concept and has any opinion of whether writing our own framework or using TBB is the better choice?

#19 Updated by Mark Abraham almost 4 years ago

Roland Schulz wrote:

Any new thoughts on the tasking in general and the choice of the framework in particular? Has anyone have a chance to look at our proof of concept and has any opinion of whether writing our own framework or using TBB is the better choice?

Unfortunately I don't think anyone has had time for this :-(

#20 Updated by Roland Schulz almost 4 years ago

Would it make sense to have a call to agree on the process and timeline of how we want to decide the framework?

#21 Updated by John Eblen almost 4 years ago

I created a doodle poll to decide on a time for a call:

http://doodle.com/poll/nz6m2pigdei9u8gh

#22 Updated by John Eblen almost 4 years ago

Here is a list of options that have been considered so far (in no particular order) and, briefly, their pros and cons as I understand them. Please feel free to add to, make corrections, criticize, etc.:

1) TBB
Pros: well-known, good documentation, has hooks to modify scheduling.
Cons: Design is based on the concept of dynamic scheduling with work-stealing.

2) hStreams
Pros: well-designed, good community, interoperates with OpenMP.
Cons: only guaranteed to work with Intel OpenMP and does not support loop-level parallelism without OpenMP.

3) OpenMP Tasking
Pros: Could be built on top of current code and would allow incremental development.
Cons: No control of individual threads, poor pragma-oriented design, and MSVC still only supports OpenMP 2, with no known plans to change.

4) Expanding MPI. (Expand our use of MPI to support tasking and continue using OpenMP for loop-level parallelism.)
Pros: Could be built on top of current code and would allow incremental development.
Cons: Reliance on OS starting and stopping of processes. Requirement to divide OpenMP threads among MPI processes at startup would make scheduling difficult.

5) STS (static scheduling framework) mentioned above
Pros: Full control of individual threads and thread scheduling.
Cons: It is our own code, and so we would take care of maintenance, debugging, etc.

Thoughts:
1) If we could find a way to support non-OpenMP loop-level parallelism with hStreams, hStreams becomes a reasonable option.
2) If we could use modern OpenMP tasking support, 3 is at least worth considering. Being able to develop incrementally is, IMO, a huge benefit.

#23 Updated by Roland Schulz almost 4 years ago

Even if we could require OpenMP 3 (MSVC is only needed for CUDA on Windows), AFAIK OpenMP still has the problem that you cannot change the number of threads used by a task while the task is running.

Other frameworks are: Lightweight schedulers:

I think the larger frameworks do much more than we want. The lightweight ones don't do that much more than STS already does. And they don't have the special scheduling strategy. But they might be useful to take lessons learned from. Of course we can also look at TBB (note we cannot copy code because it is GPL) and OpenMP for that.

#24 Updated by Szilárd Páll almost 4 years ago

Roland Schulz wrote:

Even if we could require OpenMP 3 (MSVC is only needed for CUDA on Windows), AFAIK OpenMP still has the problem that you cannot change the number of threads used by a task while the task is running.

As you don't have control over the thread-pool, AFAIK it is not even safe to change the number of threads across invocations of a task without screwing up locality and/or affinity of the threads.

#25 Updated by Berk Hess almost 4 years ago

My question on gerrit was on how to manage the thread pool. The framework could do this, but then we probably need to build a lot if intelligence in there. The other solution is to manually manage the (hardware) thread pool. An example could be that at the start of the force calculation we divide the pool into M PP and N PME threads. We can later divide the M PP threads into a bonded and a non-bonded pools. Then the question is indeed how we would do computation during e.g. the PME communication. We could explicitly schedule a task that fits in there, which would get scheduled when PME releases the thread-pool during FFT communication.
Note that we also have to deal with the issue of (not) running MPI thread-parallel. The funnelled mode seems to be slow. Ideally we would use a separate thread for MPI, but we can only afford that with very many threads.

#26 Updated by Roland Schulz almost 4 years ago

Yes that part is missing. And there a multiple possible strategies. And it is possible to do it more automatic and less automatic.

Is it true that funneled is slow for all MPI implementation or is that something which depends on the MPI/hardware?
I assume we still want to basically get a task layout as right?

I think for thread-0 (MPI thread) it would make sense to manually program the order of tasks. Also the relative order of local/non-local/PME is probably easier when hard-coded. But it is probably better if the number of threads and number of iterations (when filling in) is auto-tuned not hard-coded.

Would it always be ideal to do comm-x before starting PME and comm-f after PME is finished? Or would we want to sometimes overlap comm-x with spread and/or comm-f with gather?

I think it would help to make a more complete version of overlap.png which contains all communications and all potential fastest ways of how to schedule the tasks.

If we know how exactly we want to schedule, then we can probably see better which parts are easier to accomplish with manual-coding and which with auto-tuning and how much flexibility we need. Then we can find a good way to express this in the code.

#27 Updated by Szilárd Páll almost 4 years ago

I just realized that we have a meeting tomorrow and I owe the discussion thread a set of benchmark numbers - sorry for the delay.

These are the OpenMP vs TBB micro-benchmarking measurements I did for our meeting last fall. However, I verified and reran the benchmarks, just to see that it's all consistent.

The codes used are:

I compared the parallel for overhead on an 8C Core i7-5960X, clocked at 3.6 GHz (fixed for all cores):

TBB 
"PARALLEL FOR overhead" in microsec (defaults options)
Compiled with -O2 [-march=native causes up to a few % slowdown]
----------------------------------------------------------
          8T                           16T
gcc 4.8   3.065300 +/- 0.151823        4.143187 +/- 0.111542        
gcc 5.2   2.958170 +/- 0.173167        4.016831 +/- 0.017453
icc 16    3.018276 +/- 0.107819        4.007823 +/- 0.146283
----------------------------------------------------------

OpenMP
----------------------------------------------------------
          8T                           16T    
gomp 4.8  1.679461 +/- 0.026376        2.729979 +/- 8.323580
gomp 5.2  1.368941 +/- 0.018772        2.071785 +/- 0.094030
iomp 16   0.813586 +/- 0.041378        1.240630 +/- 1.385311
----------------------------------------------------------

Note that I made sure to spin up the processor before starting each run.

#28 Updated by Roland Schulz almost 4 years ago

The overlap of PME-comm with other calculation could be done with OpenMP. We could create chunks of work of the right size (as we also plan for sts based on measurements of previous steps). Those could be executed on the underutilized threads during the PME-comm (either by using OpenMP tasks or simply have a omp-for and have the non-zero loop iterations execute those while zero executes the PME comm). This should be sufficient for small number of threads (small being less than 4-16 depending on hardware and system-size), because for small numbers we just need the overlapping. But for larger number we would also would like to do multiple calculations in parallel (the graph showing PME-comp only using some threads). Because some computations don't scale well to larger number of threads. Doing multiple calculations in parallel is easy in OpenMP with nested parallelism. But the performance might depend on weather the OpenMP runtime uses hot-teams for nested parallelism.

#29 Updated by Szilárd Páll almost 4 years ago

Last time I tried nested tasks, the performance wasn't very good. Do you know if this has changed - and and if it did whether the change can be expected across a wider range of OpenMP implementations?

#30 Updated by Szilárd Páll almost 4 years ago

Szilárd Páll wrote:

Last time I tried nested tasks, the performance wasn't very good. Do you know if this has changed - and and if it did whether the change can be expected across a wider range of OpenMP implementations?

Had a quick look at what the OMP microbenchmark gives for the tasking-related tests. Attached are iomp 16 / gomp 5.2 results. The numbers are somewhat alarming, at least with HT the gomp tasking overheads measured are in the hundreds of microsec. I haven't had time to double-check, so this needs further analysis.

#31 Updated by Berk Hess about 3 years ago

At the hackathlon in Barcelona this week we had fruitful discussion on task parellelization in GROMACS. It seems that will relatively little effort and code we can achieve all the basic things we want in GROMACS with STS. The discussions focussed on two important aspects that STS does not cove yet:

How to preempt low-priority (usually local) work to run high-priority (usually non-local) work when the dependencies of the non-local work are fulfilled. Options:
1) Static by dividing the low-priority work into #high-priority-regions+1, or many more
Disadvantages: For some tasks it might be difficult to generate accurate splits, leading to imbalance. Dynamic imbalance between low-priority tasks can not be handled. Dynamic wait times for high priority tasks can not be handled optimally, e.g. first FFT-comm takes much more time due to wait for imbalance, next once are faster because no imbalance left.
2) Dynamic by pausing tasks. A relatively simple way of achieving this is by passing a yield function to the tasks that support pausing and the task should then periodically call the yield function. The yield functions returns immediately when no high-priority task can run. When a high priority task is ready, the yield function calls STS and the low-priority task remains on the stack and continues execution when the high priority task is finished and the yield function returns.
Disadvantages: Call stack for tasks can be different, nested STS calls. STS needs to execute the high-priority task in case the low priority task finished before the high-priority task could run.
Advantages: Fully dynamic, does not require any "manual" task partitioning. The overhead of the inlined yield function can be so low that it can be called often, e.g. every outer loop iteration in the non-bonded kernels.

The other aspect is how to specify the schedule and the dependencies.
We already decided before to, for the time being, focus on pre-defined schedules only. Multiple of these could apply to the same simulation and we time multiple during a run and see which is the fastest. We hope that there are a limited number of relevant ones, so this stays manageable. We have not yet discussed how to implement these schedules.
Dependencies are straightforward to implement at the low level. The question is how we should specify these. Should they be specified in the schedule? Or should we define them in the code at (or close to) the lambdas? Can we automatically verify that the combination of schedule and dependencies is sane?

#32 Updated by John Eblen about 3 years ago

Below are loop overhead results for OMP and STS

Runs were on a single-node of NERSC Edison (2 sockets, 12 cores, 2 hyperthreads)

Thread affinity should be "core,compact" for both (set explicitly using KMP_AFFINITY env. variable for OMP and "sched_setaffinity" function for STS)

Each result is the median of three runs

Threads OMP STS (microseconds)
12 1.316482 1.858640
24 1.686927 2.288970
48 3.057280 4.524137

STS Benchmarking code is here:

https://github.com/eblen/STS-Benchmark

This code modifies the OpenMP benchmarking code available here:

https://www.epcc.ed.ac.uk/research/computing/performance-characterisation-and-benchmarking/epcc-openmp-micro-benchmark-suite

#33 Updated by Szilárd Páll about 3 years ago

Thanks for the data. This is #omp parallel for overhead? Do you measure barrier overheads?
I assume 12 threads means 6 cores 2 threads/core and compact means consecutive threads are assigned to consecutive cores (2/core)?

#34 Updated by John Eblen about 3 years ago

Yes, this is #omp parallel for overhead.

The current STS barrier overhead is quite high. Here are the results:

Threads OMP STS (microseconds)
12 0.83 18.80
24 1.07 69.51
48 2.01 468.32

The thread affinity is as you described for OMP, but is actually not what I intended. I meant to have 1 thread/core for 12 and 24, which is how the STS runs are configured.

I will rerun and update the results.

#35 Updated by Szilárd Páll about 3 years ago

John Eblen wrote:

Threads OMP STS (microseconds)
12 0.83 23.31
24 1.07 172.32
48 2.01 450.98

Indeed, that's massive. :)

The thread affinity is as you described for OMP, but is actually not what I intended. I meant to have 1 thread/core for 12 and 24, which is how the STS runs are configured.

Note that both cases are actually interesting for us, so for both OpenMP and STS it would be useful to see: 1/2, 1, and 2 sockets both 1 and 2 threads/rank. Perhaps even lower than 6 cores in a team is valuable data. Same goes for barrier where it's probably even more relevant to see both 1/2 threads/core (but that can wait till STS gets better).

#36 Updated by John Eblen about 3 years ago

The above times for parallel for and barrier should now be correct. I reran STS using a "core, compact" layout (2 threads/core for 12 and 24).

I'll post more results later.

#37 Updated by Berk Hess about 3 years ago

These timing differences are so large that they can likely be improved with little effort. Do you have an idea what could cause such a large difference?

#38 Updated by John Eblen about 3 years ago

First, note that the barrier timings are only for the explicit many-to-many barrier, which hopefully will not be used much in GROMACS. The loop timings, meanwhile, include implicit one-to-many and many-to-one barriers.

The many-to-many barriers may be slow because they were made to be reusable, since they are often used inside loops. This requires extra work to maintain consistency.

#39 Updated by Berk Hess about 3 years ago

But we do have barriers in many places, right? I guess that explains the large improvements when running the LINCS task of fewer threads next to SETTLE.

Can you simply replace the many-to-many barrier by an one-to-many plus a many-to-one?

#40 Updated by John Eblen about 3 years ago

git grep "omp barr"

src/gromacs/fft/fft5d.cpp:#pragma omp barrier /*barrier required before AllToAll (all input has to be their) - before timing to make timing more acurate*/
src/gromacs/fft/fft5d.cpp:#pragma omp barrier /*both needed for parallel and non-parallel dimension (either have to wait on data from AlltoAll or from last FFT*/
src/gromacs/gmxana/gmx_hbond.cpp:#pragma omp barrier
src/gromacs/gmxana/gmx_hbond.cpp:#pragma omp barrier
src/gromacs/gmxana/gmx_hbond.cpp:#pragma omp barrier
src/gromacs/mdlib/clincs.cpp:#pragma omp barrier
src/gromacs/mdlib/clincs.cpp:#pragma omp barrier
src/gromacs/mdlib/clincs.cpp:#pragma omp barrier
src/gromacs/mdlib/clincs.cpp:#pragma omp barrier
src/gromacs/mdlib/clincs.cpp:#pragma omp barrier
src/gromacs/mdlib/clincs.cpp:#pragma omp barrier
src/gromacs/mdlib/clincs.cpp:#pragma omp barrier
src/gromacs/mdlib/clincs.cpp:#pragma omp barrier
src/gromacs/mdlib/clincs.cpp:#pragma omp barrier
src/gromacs/mdlib/clincs.cpp:#pragma omp barrier
src/gromacs/mdlib/minimize.cpp:#pragma omp barrier
src/gromacs/mdlib/minimize.cpp:#pragma omp barrier
src/gromacs/mdlib/nbnxn_atomdata.cpp:#pragma omp barrier
src/gromacs/mdlib/vsite.cpp:#pragma omp barrier
src/gromacs/mdlib/vsite.cpp:#pragma omp barrier

So barriers are in a few places, but not too common.

I agree that it should not be hard to improve m-to-m barrier performance.

Also available in: Atom PDF