Project

General

Profile

Bug #2798

Default mpi rank number fails when there are 16 cores and 3 gpus

Added by Joe Jordan 5 months ago. Updated 5 months ago.

Status:
New
Priority:
Normal
Assignee:
Category:
mdrun
Target version:
Affected version - extra info:
Affected version:
Difficulty:
uncategorized
Close

Description

Starting mdrun with as follows
srun gmx mdrun -s part.tpr
on a node with 3 gpus gives a domain decomposition error.

GROMACS: gmx mdrun, version 2018.4
Executable: /opt/tcbsys/gromacs/2018.4/AVX2_256/bin/gmx
Data prefix: /opt/tcbsys/gromacs/2018.4/AVX2_256
Working dir: /nethome/jjordan/projects/em-md/glyR/3jaf-apo/production2
Command line:
gmx mdrun -deffnm md.part0001 -s md.part0001.tpr -maxh 23.8

Reading file md.part0001.tpr, VERSION 2018.1 (single precision)

-------------------------------------------------------
Program: gmx mdrun, version 2018.4
Source file: src/gromacs/domdec/domdec.cpp (line 6594)
MPI rank: 0 (out of 15)

Fatal error:
There is no domain decomposition for 15 ranks that is compatible with the
given box and a minimum cell size of 3.2475 nm
Change the number of ranks or mdrun option -rcon or -dds or your LINCS
settings
Look in the log file for details on the domain decomposition

For more information and tips for troubleshooting, please check the GROMACS
website at http://www.gromacs.org/Documentation/Errors
-------------------------------------------------------
srun: error: gpu13: task 0: Exited with exit code 1

This if fixed if I run
srun mdrun -s part.tpr -gpu_id 12

I think at least the error message should change to reflect that the problem is with the hardware configuration.

#md.part0001.log.1# (15.6 KB) #md.part0001.log.1# Joe Jordan, 12/10/2018 11:58 AM

History

#1 Updated by Mark Abraham 5 months ago

15 got chosen because it has a factor of three, but we should probably work on such code to permit three ranks of 5, 5, 6 OpenMP threads in such cases.

#2 Updated by Berk Hess 5 months ago

We could even use 3 ranks with 5 threads.

#3 Updated by Berk Hess 5 months ago

  • Category set to mdrun
  • Assignee set to Berk Hess
  • Target version set to 2020

The issue here is that that we will not use 3 ranks with 10 threads, because 10 threads is too many for efficiency.
Fixing this requires checking DD restrictions at the rank/thread decision point. In turn, this requires allowing DD restriction checking, which is currently technically not possible, but which would be really useful.

#4 Updated by Joe Jordan 5 months ago

I immediately guessed that the issue was a problem of the work not being divisible and was able to just use less resources to make the simulation run. I suspect that not all users will know to do this and think that at the very least the error message should be changed to reflect that this problem can be overcome by specifying the number of ranks or gpus (perhaps in addition to changing the dd options; I did not actually test if that solves the problem, though I suspect it would not).

#5 Updated by Gerrit Code Review Bot 5 months ago

Gerrit received a related patchset '1' for Issue #2798.
Uploader: Berk Hess ()
Change-Id: gromacs~release-2019~I7b729a821530c8a0158b8c3fcba0f48bab276e6c
Gerrit URL: https://gerrit.gromacs.org/8808

#6 Updated by Szilárd Páll 5 months ago

Berk Hess wrote:

We could even use 3 ranks with 5 threads.

I don't think that will be a good configuration. Generally, if the system is small it's not worth decomposing with GPUs (and the multi-GPU code facilities). Otherwise, >=2 ranks / GPU are better, and therefore here 8 ranks with 2 cores, and a 00001122 or similar mapping would be better. Also note that the GPUs in this particular machine are of vastly different speed so uneven rank mapping would in fact be best -- I know because this is a dev machine set up with intentional heterogeneity.

#7 Updated by Szilárd Páll 5 months ago

@Berk: for smaller systems the setup you tested is indeed somewhat better, i.e. 4x8 is faster than 8x4, and in fact for many larger inputs 4 ranks is about hte same as 8 ranks. However, as we've seen the shift is not so clear-cut and in 12-core cases 2x12 will be a lot worse than both 4x6 or 6x4.

Side-note: with the current defaults, what makes an even larger impact is not using SMT; there is an up to 10% penalty to that (even with a 200k system on 16C + 2 GPUs with DD there is 5-6% penalty).

See:
Threadripper 1950X + 2x GV100
/nethome/pszilard-projects/gromacs/testing/grappa/grappa-045/bonded-test/test_threadripper-gpu01
/nethome/pszilard-projects/gromacs/testing/grappa/grappa-180/bonded-test/prof_threadripper-gpu01

i9-7920X + 2x GV100
/nethome/pszilard-projects/gromacs/testing/grappa/grappa-045/bonded-test/test_skylake-x-gpu01
/nethome/pszilard-projects/gromacs/testing/grappa/grappa-180/bonded-test/test_skylake-x-gpu01

#8 Updated by Berk Hess 5 months ago

The change I made tuning to using 8 or fewer OpenMP threads (instead of 6). The limit of 12 threads is only used for warning the user.

#9 Updated by Berk Hess 5 months ago

Without DD, we already have a limit on SMT to #atoms < 1000*#core^2.
We don't have that with DD, should we?

#10 Updated by Berk Hess 5 months ago

The limit on SMT (without DD) is only with PME on GPU. We should actually have that without PME as well.
With DD we have, by default, PME on the CPU, so we would need a limit for where PME becomes less efficient with SMT.

Also available in: Atom PDF