DLB issue with 2D/3D domain decomposition
Using the lignocelluse_rf benchmark (version from prace, md5sum 592f8fbcc77e7dfe221d6068b3c96b6b) with 2560 ranks the 2019 version fails with:
step 200: The domain decomposition grid has shifted too much in the
Z-direction around cell 20 5 1.
With GCC 7.1 or ICC 18u2 on SKL or KNL. Also happens with GMX_DLB_BASED_ON_FLOPS=1 or -dds .75. Also whether using -dd 128 20 1 or -dd 40 8 8 (default) doesn't matter. It is fine with fewer ranks or with 2018.3.
CC=gcc CXX=g++ cmake .. -DGMX_MPI=on -DGMX_SIMD=AVX_512_KNL -DGMX_HWLOC=no
ibrun ~/gromacs/gcc7.3/bin/gmx_mpi mdrun -s lignocellulose-rf.tpr -nsteps 3000 -noconfout
I assume this uses update groups. I haven't changed much between 2018 and 2019 in the DLB code. Looking at the git log I actually don't see any functional change at all. So this might be an old bug which gets triggered more easily with update groups, either because this enables more out of sync running, or because this, I think, changes the DLB limits.
You can at least check if running with GMX_NO_UPDATEGROUPS makes the issue go away.
Indeed would unit tests be nice. I always thought it would be hard to test the strange conditions under which things can easily go wrong. But actually it is not so hard. Just use a 2D and 3D decomposition and put nearly all load in one or two, well chosen, cells.