Bug #1024
REMD and Verlet lists
Description
As far as I can see, the DD redistribution required after successful replica exchange in REMD doesn't work with the new Verlet lists with more than one PP processor. The simulation system runs perfectly if you change Verlet to group, or use Verlet with 1 processor per replica.
A fatal error is generated in make_specat_communication() from both replicas that exchanged, e.g.:
DD cell 1 0 0: Neighboring cells do not have atoms: 2422 2032 2514 2065 2312 270 412 2210 2374 54 2298 2557 2414 2247 DD cell 0 0 0: Neighboring cells do not have atoms: 341 2607 361 512 2461 2574 173 198 87 2032 2005 2348 ------------------------------------------------------- Program mdrun_mpi, VERSION 4.6-dev-20121012-6a9499a-local Source code file: /home/224/mxa224/git/r46/src/mdlib/domdec_con.c, line: 704 Fatal error: DD cell 0 0 0 could only obtain 71 of the 83 atoms that are connected via constraints from the neighboring cells. This probably means your constraint lengths are too long compared to the domain decomposition cell size. Decrease the number of domain decomposition grid cells or lincs-order. For more information and tips for troubleshooting, please check the GROMACS website at http://www.gromacs.org/Documentation/Errors ------------------------------------------------------- "I Had So Many Problem, and Then I Got Me a Walkman" (F. Black) Error on node 8, will try to stop all the nodes Halting parallel program mdrun_mpi on CPU 8 out of 12 ------------------------------------------------------- Program mdrun_mpi, VERSION 4.6-dev-20121012-6a9499a-local Source code file: /home/224/mxa224/git/r46/src/mdlib/domdec_con.c, line: 704 Fatal error: DD cell 1 0 0 could only obtain 72 of the 86 atoms that are connected via constraints from the neighboring cells. This probably means your constraint lengths are too long compared to the domain decomposition cell size. Decrease the number of domain decomposition grid cells or lincs-order. For more information and tips for troubleshooting, please check the GROMACS website at http://www.gromacs.org/Documentation/Errors -------------------------------------------------------
This combination of observations makes me think something is not begin triggered to start from a clean slate after an exchange with Verlet lists, but I don't yet know enough to say what. I observed this with patch #9 of https://gerrit.gromacs.org/#/c/1426/, and also with release-4-6 commit 11b5a654b19d9a3 (except that replica exchange is broken there and replicas always attempt to exchange). That is, the problem is external to the changes of https://gerrit.gromacs.org/#/c/1426/)
Associated revisions
History
#1 Updated by Mark Abraham over 8 years ago
- File redmine1246.tbz redmine1246.tbz added
*begin=>being
Attached .tpr files. Sample command line
mpirun -np 12 mdrun_mpi -multi 6 -deffnm spc900-sim -replex 10 -debug 1
#2 Updated by Berk Hess about 8 years ago
Could you try if adding a conditional on line 9265 of domdec.c helps:
case ecutsVERLET:
if (!bMasterState)
nbnxn_get_ncells(fr->nbv->nbs,&ncells_old[XX],&ncells_old[YY]);
break;
#3 Updated by Mark Abraham about 8 years ago
No, it doesn't.
The problem occurs right after replicas exchange. bMasterState is true then. I think something is persisting from the old decomposition, because the lists of which cells have which atoms are out of date according to the error message. From current release-4-6 plus the above patch, on replica 0:
Replica exchange at step 2000 time 4 Repl 0 <-> 1 dE_term = 5.849e-01 (kT) Repl ex 0 x 1 2 3 4 5 Repl pr .56 .11 .08 DD step 1999 load imb.: force 8.3% Charge group distribution at step 2000: 678 693 680 649 ------------------------------------------------------- Program mdrun_mpi, VERSION 4.6-beta3-dev-20130111-f3dd8cb-dirty-local Source code file: /home/224/mxa224/git/r46/src/mdlib/domdec_con.c, line: 722 Fatal error: DD cell 0 0 0 could only obtain 90 of the 99 atoms that are connected via constraints from the neighboring cells. This probably means your constraint lengths are too long compared to the domain decomposition cell size. Decrease the number of domain decomposition grid cells or lincs-order. For more information and tips for troubleshooting, please check the GROMACS website at http://www.gromacs.org/Documentation/Errors -------------------------------------------------------
Replica 1 reads about the same. Higher replicas, which did not exchange, have no problem.
#4 Updated by Berk Hess about 8 years ago
- Status changed from New to Closed
fixed resetting states with parallel Verlet scheme
The simulation state might need to be reset with e.g. REMD
or energy minimization. With the Verlet scheme in parallel
the contents of one outdated array was used, leading to gmx_fatal.
Fixes #1024
Change-Id: I24d07ea7dfd41e9689dd083cafbf778a7c8033fd