Default mpi rank number fails when there are 16 cores and 3 gpus
Starting mdrun with as follows
srun gmx mdrun -s part.tpr
on a node with 3 gpus gives a domain decomposition error.
GROMACS: gmx mdrun, version 2018.4
Data prefix: /opt/tcbsys/gromacs/2018.4/AVX2_256
Working dir: /nethome/jjordan/projects/em-md/glyR/3jaf-apo/production2
gmx mdrun -deffnm md.part0001 -s md.part0001.tpr -maxh 23.8
Reading file md.part0001.tpr, VERSION 2018.1 (single precision)
Program: gmx mdrun, version 2018.4
Source file: src/gromacs/domdec/domdec.cpp (line 6594)
MPI rank: 0 (out of 15)
There is no domain decomposition for 15 ranks that is compatible with the
given box and a minimum cell size of 3.2475 nm
Change the number of ranks or mdrun option -rcon or -dds or your LINCS
Look in the log file for details on the domain decomposition
For more information and tips for troubleshooting, please check the GROMACS
website at http://www.gromacs.org/Documentation/Errors
srun: error: gpu13: task 0: Exited with exit code 1
This if fixed if I run
srun mdrun -s part.tpr -gpu_id 12
I think at least the error message should change to reflect that the problem is with the hardware configuration.
- Category set to mdrun
- Assignee set to Berk Hess
- Target version set to 2020
The issue here is that that we will not use 3 ranks with 10 threads, because 10 threads is too many for efficiency.
Fixing this requires checking DD restrictions at the rank/thread decision point. In turn, this requires allowing DD restriction checking, which is currently technically not possible, but which would be really useful.
#4 Updated by Joe Jordan 3 months ago
I immediately guessed that the issue was a problem of the work not being divisible and was able to just use less resources to make the simulation run. I suspect that not all users will know to do this and think that at the very least the error message should be changed to reflect that this problem can be overcome by specifying the number of ranks or gpus (perhaps in addition to changing the dd options; I did not actually test if that solves the problem, though I suspect it would not).
#6 Updated by Szilárd Páll 3 months ago
Berk Hess wrote:
We could even use 3 ranks with 5 threads.
I don't think that will be a good configuration. Generally, if the system is small it's not worth decomposing with GPUs (and the multi-GPU code facilities). Otherwise, >=2 ranks / GPU are better, and therefore here 8 ranks with 2 cores, and a
00001122 or similar mapping would be better. Also note that the GPUs in this particular machine are of vastly different speed so uneven rank mapping would in fact be best -- I know because this is a dev machine set up with intentional heterogeneity.
#7 Updated by Szilárd Páll 3 months ago
@Berk: for smaller systems the setup you tested is indeed somewhat better, i.e. 4x8 is faster than 8x4, and in fact for many larger inputs 4 ranks is about hte same as 8 ranks. However, as we've seen the shift is not so clear-cut and in 12-core cases 2x12 will be a lot worse than both 4x6 or 6x4.
Side-note: with the current defaults, what makes an even larger impact is not using SMT; there is an up to 10% penalty to that (even with a 200k system on 16C + 2 GPUs with DD there is 5-6% penalty).
Threadripper 1950X + 2x GV100
i9-7920X + 2x GV100