Project

General

Profile

Bug #1374

add warning that separate PME ranks are never used with GPUs

Added by Szilárd Páll almost 4 years ago. Updated over 2 years ago.

Status:
Closed
Priority:
Low
Category:
mdrun
Target version:
Affected version - extra info:
Affected version:
Difficulty:
uncategorized
Close

Description

With GPUs separate PME ranks are never used. This is partly because resource splitting between PP and PME ranks is too complex in GPU accelerated tuns to manage in an automated fashion. Secondly, as the CPU does PME and bondeds overlapping with the GPU, offloading PME to separate ranks leaves very little computation on the CPU to overlap with the GPU. Additionally, the number if PP (or PP+PME) ranks is closely linked to the number of GPUs and therefore the user would anyway have to make sure that enough ranks are launched to accomodate PP ranks for all GPUs as well as separate PME ranks.

However, in practice, partly due to multi-threaded scaling, using separate PME ranks (which implicitly leads to less threads being used in both PP and PME) does improve performance. Most notably, on the Cray XK7 this is true already at moderate node counts.

Some related discussion can be found here: https://gerrit.gromacs.org/#/c/2694 (comments on patch set 15-17).


Related issues

Related to GROMACS - Bug #1148: switching to separate PME nodes with hybrid parallelizationClosed

Associated revisions

Revision 094302b3 (diff)
Added by Berk Hess over 3 years ago

Updated mdrun -npme documentation

The number of nodes at which PME nodes are used has increased.
Added note on PME nodes not being selected automatically with GPUs.

Fixes #1374.

Change-Id: Ie1de87abd3d1204d99af8b4f8e6809e7806f5c08

Revision 31cc5ae9 (diff)
Added by Berk Hess over 2 years ago

Don't use PME ranks with GPUs and -npme=-1

The code disabling the automated PME rank choice with GPUs was
accidentally moved after init_domain_decomposition. This caused
PME ranks to be set up, but later a fatal_error occured for
inconsistent PP rank and GPU counts.
Refs #1374.

Change-Id: I5f6bcc90fecac7f63b332b8f1acca7368b5f71bc

History

#1 Updated by Szilárd Páll almost 4 years ago

  • Description updated (diff)

#2 Updated by Berk Hess almost 4 years ago

So there seem to be two regimes where PME nodes can help with GPUs:
1) Many physical nodes/MPI ranks. Here the user start MPI with a number of ranks and we have the number of GPUs given, so we can't really freely choose PME ranks, since #MPI-ranks==#GPUs.
2) A single node. But here we won't reach our current MPI rank limit, since we won't have more than 4 GPUs in a single node.
So although PME ranks might help, we can't automate for 1) and case 2) is very different from the current switching.

#3 Updated by Szilárd Páll almost 4 years ago

Berk Hess wrote:

So although PME ranks might help, we can't automate for 1) and case 2) is very different from the current switching.

I agree. Note that i ma not suggesting to automate the switching, but to warn the user that separate PME ranks may help, but he/she has to set it up manually. Some criteria on when to trigger the warning would be nice (based on the total number of cores/hardware threads used?) to avoid issuing spurious notes. However, as noted in the gerrit discussion, to be able to do this we'd need to solve #1148 first.

#4 Updated by Erik Lindahl over 3 years ago

  • Tracker changed from Bug to Feature

#5 Updated by Szilárd Páll over 3 years ago

  • Tracker changed from Feature to Bug
  • Affected version set to 4.6.x

Warning about a limitation of the implementation is definitely not a feature.

#6 Updated by Gerrit Code Review Bot over 3 years ago

Gerrit received a related patchset '1' for Issue #1374.
Uploader: Berk Hess ()
Change-Id: Ie1de87abd3d1204d99af8b4f8e6809e7806f5c08
Gerrit URL: https://gerrit.gromacs.org/3590

#7 Updated by Berk Hess over 3 years ago

  • Status changed from New to Fix uploaded
  • Target version changed from 4.6.x to 4.6.6

I don't think this requires a warning. I updated mdrun -h with a note on PME nodes not being selected automatically with GPUs. Since now mdrun does what it says, a warning is not necessary.

#8 Updated by Berk Hess over 3 years ago

  • Status changed from Fix uploaded to Resolved
  • % Done changed from 0 to 100

#9 Updated by Mark Abraham over 3 years ago

  • Status changed from Resolved to Closed

#10 Updated by Szilárd Páll over 3 years ago

I missed the part that the recently merged change "fixes" this.

Most users expect that mdrun or mpirun -np N mdrun_mpi just works optimally. Don't you think that the user should be explicitly warned, if nothing else, at the rank count where the switching would automatically happen in CPU-only runs (unless #1148 gets fixed) that he/she should really consider separate PME ranks. On all machines where I ran so far, using separate PME ranks was always faster above 4-8 sockets.

#11 Updated by Szilárd Páll over 2 years ago

Actually, this is not true, the fixing commit added an incorrect statement to the docs:


$ $mdrun -version 2>&1 | grep '   VERSION'
Gromacs version:    VERSION 4.6.6-dev-20140522-d77dddb

$mdrun -ntmpi 32 -ntomp 1 -gpu_id $(strrep 0 16)$(strrep 1 16) -s ../topol.tpr
[...]
Will use 24 particle-particle and 8 PME only nodes
This is a guess, check the performance at the end of the log file
Using 32 MPI threads
Using 1 OpenMP thread per tMPI thread
Compiled acceleration: AVX_256 (Gromacs could use AVX_128_FMA on this machine, which is better)

2 GPUs detected:
  #0: NVIDIA GeForce GTX TITAN, compute cap.: 3.5, ECC:  no, stat: compatible
  #1: NVIDIA GeForce GTX TITAN, compute cap.: 3.5, ECC:  no, stat: compatible

2 GPUs user-selected for this run.
Mapping of GPUs to the 24 PP ranks in this node: #0, #0, #0, #0, #0, #0, #0, #0, #0, #0, #0, #0, #0, #0, #0, #0, #1, #1, #1, #1, #1, #1, #1, #1, #1, #1, #1, #1, #1, #1, #1, #1

-------------------------------------------------------
Program mdrun, VERSION 4.6.8-dev-20150212-c060264
Source code file: /nethome/pszilard/projects/gromacs/gromacs-4.6/src/gmxlib/gmx_detect_hardware.c, line: 380

Fatal error:
Incorrect launch configuration: mismatching number of PP thread-MPI threads and GPUs.
mdrun was started with 24 PP thread-MPI threads, but you provided 32 GPUs.
For more information and tips for troubleshooting, please check the GROMACS
website at http://www.gromacs.org/Documentation/Errors
-------------------------------------------------------

#12 Updated by Gerrit Code Review Bot over 2 years ago

Gerrit received a related patchset '1' for Issue #1374.
Uploader: Berk Hess ()
Change-Id: I5f6bcc90fecac7f63b332b8f1acca7368b5f71bc
Gerrit URL: https://gerrit.gromacs.org/4619

Also available in: Atom PDF