Project

General

Profile

Bug #1024

REMD and Verlet lists

Added by Mark Abraham almost 7 years ago. Updated over 6 years ago.

Status:
Closed
Priority:
Normal
Assignee:
Category:
mdrun
Target version:
Affected version - extra info:
Affected version:
Difficulty:
uncategorized
Close

Description

As far as I can see, the DD redistribution required after successful replica exchange in REMD doesn't work with the new Verlet lists with more than one PP processor. The simulation system runs perfectly if you change Verlet to group, or use Verlet with 1 processor per replica.

A fatal error is generated in make_specat_communication() from both replicas that exchanged, e.g.:

DD cell 1 0 0: Neighboring cells do not have atoms: 2422 2032 2514 2065 2312 270 412 2210 2374 54 2298 2557 2414 2247

DD cell 0 0 0: Neighboring cells do not have atoms: 341 2607 361 512 2461 2574 173 198 87 2032 2005 2348

-------------------------------------------------------
Program mdrun_mpi, VERSION 4.6-dev-20121012-6a9499a-local
Source code file: /home/224/mxa224/git/r46/src/mdlib/domdec_con.c, line: 704

Fatal error:
DD cell 0 0 0 could only obtain 71 of the 83 atoms that are connected via constraints from the neighboring cells. This probably means your constraint lengths are too long compared to the domain decomposition cell size. Decrease the number of domain decomposition grid cells or lincs-order.
For more information and tips for troubleshooting, please check the GROMACS
website at http://www.gromacs.org/Documentation/Errors
-------------------------------------------------------

"I Had So Many Problem, and Then I Got Me a Walkman" (F. Black)

Error on node 8, will try to stop all the nodes
Halting parallel program mdrun_mpi on CPU 8 out of 12

-------------------------------------------------------
Program mdrun_mpi, VERSION 4.6-dev-20121012-6a9499a-local
Source code file: /home/224/mxa224/git/r46/src/mdlib/domdec_con.c, line: 704

Fatal error:
DD cell 1 0 0 could only obtain 72 of the 86 atoms that are connected via constraints from the neighboring cells. This probably means your constraint lengths are too long compared to the domain decomposition cell size. Decrease the number of domain decomposition grid cells or lincs-order.
For more information and tips for troubleshooting, please check the GROMACS
website at http://www.gromacs.org/Documentation/Errors
-------------------------------------------------------

This combination of observations makes me think something is not begin triggered to start from a clean slate after an exchange with Verlet lists, but I don't yet know enough to say what. I observed this with patch #9 of https://gerrit.gromacs.org/#/c/1426/, and also with release-4-6 commit 11b5a654b19d9a3 (except that replica exchange is broken there and replicas always attempt to exchange). That is, the problem is external to the changes of https://gerrit.gromacs.org/#/c/1426/)

redmine1246.tbz (367 KB) redmine1246.tbz Mark Abraham, 10/12/2012 03:39 AM

Associated revisions

Revision be7e30fa (diff)
Added by Berk Hess over 6 years ago

fixed resetting states with parallel Verlet scheme

The simulation state might need to be reset with e.g. REMD
or energy minimization. With the Verlet scheme in parallel
the contents of one outdated array was used, leading to gmx_fatal.
Fixes #1024

Change-Id: I24d07ea7dfd41e9689dd083cafbf778a7c8033fd

History

#1 Updated by Mark Abraham almost 7 years ago

*begin=>being

Attached .tpr files. Sample command line

mpirun -np 12 mdrun_mpi -multi 6 -deffnm spc900-sim -replex 10 -debug 1

#2 Updated by Berk Hess over 6 years ago

Could you try if adding a conditional on line 9265 of domdec.c helps:

case ecutsVERLET:
if (!bMasterState)
nbnxn_get_ncells(fr->nbv->nbs,&ncells_old[XX],&ncells_old[YY]);
break;

#3 Updated by Mark Abraham over 6 years ago

No, it doesn't.

The problem occurs right after replicas exchange. bMasterState is true then. I think something is persisting from the old decomposition, because the lists of which cells have which atoms are out of date according to the error message. From current release-4-6 plus the above patch, on replica 0:

Replica exchange at step 2000 time 4
Repl 0 <-> 1  dE_term =  5.849e-01 (kT)
Repl ex  0 x  1    2    3    4    5
Repl pr   .56       .11       .08

DD  step 1999 load imb.: force  8.3%

Charge group distribution at step 2000: 678 693 680 649

-------------------------------------------------------
Program mdrun_mpi, VERSION 4.6-beta3-dev-20130111-f3dd8cb-dirty-local
Source code file: /home/224/mxa224/git/r46/src/mdlib/domdec_con.c, line: 722

Fatal error:
DD cell 0 0 0 could only obtain 90 of the 99 atoms that are connected via constraints from the neighboring cells. This probably means your constraint lengths are too long compared to the domain decomposition cell size. Decrease the number of domain decomposition grid cells or lincs-order.
For more information and tips for troubleshooting, please check the GROMACS
website at http://www.gromacs.org/Documentation/Errors
-------------------------------------------------------

Replica 1 reads about the same. Higher replicas, which did not exchange, have no problem.

#4 Updated by Berk Hess over 6 years ago

  • Status changed from New to Closed

Also available in: Atom PDF