Project

General

Profile

Bug #1882

issue with DD missing impossible exclusions

Added by Mark Abraham almost 4 years ago. Updated almost 4 years ago.

Status:
Closed
Priority:
Normal
Assignee:
Category:
mdrun
Target version:
Affected version - extra info:
Affected version:
Difficulty:
uncategorized
Close

Description

While adding some integration tests for exact restarts, I made an argon NVE setup that would run 10 steps, and then 10 more steps after restart, and exactly reproduce a single run of 20 steps. This worked well with a single rank with md, md-vv, bd and sd integrators. With 2 thread-MPI ranks, all worked well, except that the restarted run with md-vv gave an inexplicable error. I can get the same .tpr and .cpt to reproduce with plain mdrun:

gmx mdrun -s WithVariousIntegrators_MdrunContinuationIsExact_ArgonSimulation_1 -cpi WithVariousIntegrators_MdrunContinuationIsExact_ArgonSimulation_1_firsthalf.cpt -ntmpi 2

...
Reading file WithVariousIntegrators_MdrunContinuationIsExact_ArgonSimulation_1.tpr, VERSION 2016-dev-20151222-af63944 (single precision)
Can not increase nstlist because an NVE ensemble is used

Reading checkpoint file WithVariousIntegrators_MdrunContinuationIsExact_ArgonSimulation_1_firsthalf.cpt generated: Tue Dec 22 14:09:02 2015

GROMACS patchlevel, binary or parallel settings differ from previous run.
Continuation is exact, but not guaranteed to be binary identical.
See the log file for details.

Using 2 MPI threads
Using 2 OpenMP threads per tMPI thread

starting mdrun 'Argon'
20 steps,      0.0 ps (continuing from step 10,      0.0 ps).

A list of missing interactions:
          exclusions of      0 missing      1

-------------------------------------------------------
Program:     gmx mdrun, version 2016-dev-20151222-af63944
Source file: ../src/gromacs/domdec/domdec_topology.cpp (line 435)
MPI rank:    0 (out of 2)

Fatal error:
1 of the 0 bonded interactions could not be calculated because some atoms
involved moved further apart than the multi-body cut-off distance (-1 nm) or
the two-body cut-off distance (1.003 nm), see option -rdd, for pairs and
tabulated bonds also see option -ddcheck

For more information and tips for troubleshooting, please check the GROMACS
website at http://www.gromacs.org/Documentation/Errors
-------------------------------------------------------

This is an argon system, so there can never be an exclusion, never mind one to miss. Issue reproduces with both gcc 5.2 and clang 3.7. My first suspect was the recent patch to bump the state rvec arrays by one (somehow) but reverting that patch didn't solve the issue. I haven't got enough gear at the moment to look into the issue, so here's the .tpr and .cpt for when somebody does.

Associated revisions

Revision 8023fcdc (diff)
Added by Mark Abraham almost 4 years ago

Fix logic for DD missing-interactions check

The logic for the active part of this check got broken in 90aa9d65e0c3
before v4.5. Historically, there was no significant effect, because
all that patch changed is that we sum a double whose result won't
always be read. For all integrators, there's eventually a call where
CGLO_ENERGY is set and the sum is read, so the check is active and
works correctly. Thus, the worst-case result was that interaction(s)
were missing and not flagged until the next global-energies
communication step.

However, cleanup for #1793 moved the responsibility for the check out
to do_md, exposing the fact that for VV integrators (only),
CGLO_ENERGY isn't set for the "global signalling + leapfrog" call to
compute_globals. Thus the sum wasn't read, so the check failed when
totalNumberOfBondedInteractions still had the value -1 with which it
was initialized. Apparently, the regressiontests don't have enough
coverage of VV integrators to find this.

This also made me realise that this DD check has also been inactive
for the calls to compute_globals after initial DD, DD after replica
exchange, and DD during reruns, because none of those cases set
CGLO_ENERGY.

Fixes #1882. Refs #1793.

Change-Id: I0b5cc448175873ec0e5cee3c3d5023654b4f1b27

Revision efa13a69 (diff)
Added by Mark Abraham 3 months ago

Add integration tests for exact restarts

These tests demonstrates the extent to which mdrun checkpoint restarts
reproduce the same run that would have taken place without the
restart.

I've been working on these, and the bugs they exposed, for a few
years, but the code has been fixed for a few years now.

The tests don't run with OpenCL because they have caused driver out of
memory issues.

Refs #1137, #1793, #1882, #1883

Change-Id: I8bc441d945f13158bbe10f097e772ea87cc6a559

History

#1 Updated by Mark Abraham almost 4 years ago

This issue does seem to be specific to md-vv integrator, somehow!

#2 Updated by Mark Abraham almost 4 years ago

  • Status changed from New to In Progress
  • Assignee set to Mark Abraham

The origins of this issue are a long way in the past, but some of my recent cleanup has exposed it.

#3 Updated by Gerrit Code Review Bot almost 4 years ago

Gerrit received a related patchset '1' for Issue #1882.
Uploader: Mark Abraham ()
Change-Id: I0b5cc448175873ec0e5cee3c3d5023654b4f1b27
Gerrit URL: https://gerrit.gromacs.org/5525

#4 Updated by Mark Abraham almost 4 years ago

  • Status changed from In Progress to Resolved

#6 Updated by Mark Abraham almost 4 years ago

  • Status changed from Resolved to Closed

Also available in: Atom PDF