Bug #2798

Default mpi rank number fails when there are 16 cores and 3 gpus

Added by Joe Jordan over 1 year ago. Updated 6 months ago.

Target version:
Affected version - extra info:
Affected version:


Starting mdrun with as follows
srun gmx mdrun -s part.tpr
on a node with 3 gpus gives a domain decomposition error.

GROMACS: gmx mdrun, version 2018.4
Executable: /opt/tcbsys/gromacs/2018.4/AVX2_256/bin/gmx
Data prefix: /opt/tcbsys/gromacs/2018.4/AVX2_256
Working dir: /nethome/jjordan/projects/em-md/glyR/3jaf-apo/production2
Command line:
gmx mdrun -deffnm md.part0001 -s md.part0001.tpr -maxh 23.8

Reading file md.part0001.tpr, VERSION 2018.1 (single precision)

Program: gmx mdrun, version 2018.4
Source file: src/gromacs/domdec/domdec.cpp (line 6594)
MPI rank: 0 (out of 15)

Fatal error:
There is no domain decomposition for 15 ranks that is compatible with the
given box and a minimum cell size of 3.2475 nm
Change the number of ranks or mdrun option -rcon or -dds or your LINCS
Look in the log file for details on the domain decomposition

For more information and tips for troubleshooting, please check the GROMACS
website at
srun: error: gpu13: task 0: Exited with exit code 1

This if fixed if I run
srun mdrun -s part.tpr -gpu_id 12

I think at least the error message should change to reflect that the problem is with the hardware configuration.

#md.part0001.log.1# (15.6 KB) #md.part0001.log.1# Joe Jordan, 12/10/2018 11:58 AM


#1 Updated by Mark Abraham over 1 year ago

15 got chosen because it has a factor of three, but we should probably work on such code to permit three ranks of 5, 5, 6 OpenMP threads in such cases.

#2 Updated by Berk Hess over 1 year ago

We could even use 3 ranks with 5 threads.

#3 Updated by Berk Hess over 1 year ago

  • Category set to mdrun
  • Assignee set to Berk Hess
  • Target version set to 2020

The issue here is that that we will not use 3 ranks with 10 threads, because 10 threads is too many for efficiency.
Fixing this requires checking DD restrictions at the rank/thread decision point. In turn, this requires allowing DD restriction checking, which is currently technically not possible, but which would be really useful.

#4 Updated by Joe Jordan over 1 year ago

I immediately guessed that the issue was a problem of the work not being divisible and was able to just use less resources to make the simulation run. I suspect that not all users will know to do this and think that at the very least the error message should be changed to reflect that this problem can be overcome by specifying the number of ranks or gpus (perhaps in addition to changing the dd options; I did not actually test if that solves the problem, though I suspect it would not).

#5 Updated by Gerrit Code Review Bot over 1 year ago

Gerrit received a related patchset '1' for Issue #2798.
Uploader: Berk Hess ()
Change-Id: gromacs~release-2019~I7b729a821530c8a0158b8c3fcba0f48bab276e6c
Gerrit URL:

#6 Updated by Szilárd Páll over 1 year ago

Berk Hess wrote:

We could even use 3 ranks with 5 threads.

I don't think that will be a good configuration. Generally, if the system is small it's not worth decomposing with GPUs (and the multi-GPU code facilities). Otherwise, >=2 ranks / GPU are better, and therefore here 8 ranks with 2 cores, and a 00001122 or similar mapping would be better. Also note that the GPUs in this particular machine are of vastly different speed so uneven rank mapping would in fact be best -- I know because this is a dev machine set up with intentional heterogeneity.

#7 Updated by Szilárd Páll over 1 year ago

@Berk: for smaller systems the setup you tested is indeed somewhat better, i.e. 4x8 is faster than 8x4, and in fact for many larger inputs 4 ranks is about hte same as 8 ranks. However, as we've seen the shift is not so clear-cut and in 12-core cases 2x12 will be a lot worse than both 4x6 or 6x4.

Side-note: with the current defaults, what makes an even larger impact is not using SMT; there is an up to 10% penalty to that (even with a 200k system on 16C + 2 GPUs with DD there is 5-6% penalty).

Threadripper 1950X + 2x GV100

i9-7920X + 2x GV100

#8 Updated by Berk Hess over 1 year ago

The change I made tuning to using 8 or fewer OpenMP threads (instead of 6). The limit of 12 threads is only used for warning the user.

#9 Updated by Berk Hess over 1 year ago

Without DD, we already have a limit on SMT to #atoms < 1000*#core^2.
We don't have that with DD, should we?

#10 Updated by Berk Hess over 1 year ago

The limit on SMT (without DD) is only with PME on GPU. We should actually have that without PME as well.
With DD we have, by default, PME on the CPU, so we would need a limit for where PME becomes less efficient with SMT.

#11 Updated by Paul Bauer 8 months ago

  • Target version changed from 2020 to 2020.1

I guess this is not important?

#12 Updated by Szilárd Páll 7 months ago

  • Status changed from New to Feedback wanted

Paul Bauer wrote:

I guess this is not important?

No, I am not sure there is anything concrete to fix here and if there is it would be best to distill it and file a separate issue.

#13 Updated by Paul Bauer 6 months ago

  • Status changed from Feedback wanted to Rejected

There doesn't seem to be anything to fix here

Also available in: Atom PDF