Project

General

Profile

Bug #317

mdrun -rerun works incorrectly for multiple processors and domain decomposition

Added by Mark Abraham over 10 years ago. Updated over 10 years ago.

Status:
Closed
Priority:
Normal
Assignee:
Erik Lindahl
Category:
mdrun
Target version:
Affected version - extra info:
Affected version:
Difficulty:
uncategorized
Close

Description

Using 4.0.4 to do reruns on the same positions-only NPT peptide+water trajectory with the same run input file:

a) compiled without MPI, a single-processor rerun worked correctly, including "zero" KE and temperature at each frame

b) compiled with MPI, a single-processor run worked correctly, including zero KE and temperature, and agreed with a) within machine precision

c) compiled with MPI, a 4-processor run worked incorrectly : an approximately-correct temperature and plausible positive KE were reported, all PE terms were identical to about machine precision with the first step of a) and b), and the reported pressure was different. The KE reported was the same at each step.

d) compiled with MPI, a 4-processor run using particle decomposition worked correctly, agreeing with a).

Similar observations were made on gmx-users recently. See http://www.gromacs.org/pipermail/gmx-users/2009-April/041408.html and http://www.gromacs.org/pipermail/gmx-users/2009-April/041387.html

Thus it seems that a multi-processor mdrun is not updating the structure for subsequent steps in the loop over structures, and/or is getting some KE from somewhere that a single-processor calculation is not.

From stepping through a run, I think the rerun DD problem arises in that in do_md(), a rerun loads the data from the rerun trajectory into rerun_fr, and later copies those into state, and not into state_global. state_global is initialized to that of the .tpr file (which has velocities), which is used for the DD initialization, and state_global is never subsequently updated. So, for each rerun step, the same .tpr state gets propagated, which leads to all the symptoms I describe above. The KE comes from the velocities in the .tpr file, and is thus constant. Single-processor and PD works fine, because either they only use state, or state is equivalent to state_global in those cases.

So, a preliminary work-around is to use mdrun -rerun -pd to get particle decomposition with multiple processors.

I tried to hack a fix for the DD code. It seemed that using

for (i=0; i<state_global->natoms; i++)
copy_rvec(rerun_fr.x[i],state_global.x[i])

before about line 1060 of do_md() in src/kernel/md.c should do the trick, since with bMasterState set for a rerun, dd_partition_system() should propagate state_global to the right places. However I got a segfault in that copy_rvec with i==0, despite state_global.x being allocated and of the right dimensions according to Totalview's memory debugger. That made no sense :-(

OK, now it's over to someone who understands the DD implementation better!

dd_rerun.patch (2.63 KB) dd_rerun.patch patch to fix DD mdrun -rerun bug Mark Abraham, 04/25/2009 08:58 AM
dd_rerun.patch (2.64 KB) dd_rerun.patch updated version of patch Mark Abraham, 04/30/2009 03:17 PM

Related issues

Has duplicate GROMACS - Bug #318: rerun produces inconsistent values for (shifted) virial when external force is usedClosed04/27/2009

History

#1 Updated by Mark Abraham over 10 years ago

Created an attachment (id=364)
patch to fix DD mdrun -rerun bug

#2 Updated by Mark Abraham over 10 years ago

OK I have a fix.

Under DD, non-DDMASTER processes should skip this step copying data structures from rerun_fr to state_global. The sizes of the various loop copies should be state_global->natoms, not mdatoms->nr (which is smaller than state_global->natoms under DD, but equal under other conditions). So attached is a patchfile to be applied from src/kernel with

patch md.c < dd_mdrun.patch

I tested single-processor with and without mpi, four processors with DD and no separate PME nodes, four processors with PD, and 16 processors with DD and 5 PME nodes and got agreement between the leading significant figures reported in the .log file for all cases compared with single-processor plain 4.0.4.

I had a quick look at CVS releases-4.0-patches and since this bit of do_md() looks the same, this patch can probably be applied there and/or to the head 4.1 branch painlessly.

#3 Updated by Mark Abraham over 10 years ago

Created an attachment (id=366)
updated version of patch

I removed a C++ style comment in favour of a C one.

#4 Updated by Berk Hess over 10 years ago

I fixed this bug, but in a somewhat different way from the attached patch.
There were some more complicating issues related also with bugzilla 318,
especially when velocities were present.
I have now turned of the vsite coordinate regeneration, since that is complicated
with -rerun with DD. There is a hidden switch -rerunvsite to turn it on again.

Berk

Also available in: Atom PDF