Project

General

Profile

Bug #2316

Previously OK parallel PME run crashes since 69470f

Added by Erik Lindahl almost 2 years ago. Updated almost 2 years ago.

Status:
Closed
Priority:
High
Assignee:
Category:
mdrun
Target version:
Affected version - extra info:
Master branch since 20161012
Affected version:
Difficulty:
uncategorized
Close

Description

A ligand-gated ion channel system that worked fine in 2016.3 crashes suddenly after ~180k steps since Berk's restructuring of leap-frog integrators in 69470f, with a message about particles more than 2/3 out of the cell when doing PME communication. It works fine in the commit just before.

Details in the attached log, running CPU-only on dev-purley02 in reproducible mode.

Setting to high priority, since this could be a silent bug when it doesn't crash.

pr.edr (7.61 KB) pr.edr Erik Lindahl, 11/30/2017 11:24 PM
pr.xtc (590 KB) pr.xtc Erik Lindahl, 11/30/2017 11:24 PM
pr.log (26.8 KB) pr.log Erik Lindahl, 11/30/2017 11:24 PM
pr.cpt (3.66 MB) pr.cpt Erik Lindahl, 11/30/2017 11:24 PM
pr.tpr (10.4 MB) pr.tpr Erik Lindahl, 11/30/2017 11:24 PM

Associated revisions

Revision cf5e082b (diff)
Added by Berk Hess almost 2 years ago

Clear vsite velocities for simple integrators

The simple integrator loops (introduced in 69470fc4) do not clear
the velocities of virtual sites. This allows velocities of virtual
sites to slowly increase over time. To prevent this, velocities
of virtual sites are now cleared in a separate loop.

Fixes #2316

Change-Id: I12ff0fae2cd3c45ad4e63bfeccfc8c88505cdb1e

History

#1 Updated by Erik Lindahl almost 2 years ago

Although the crash is reproducible (warnings at step 182490, crash six steps later), it is not reproducible when starting from the checkpoint, even with the -reprod flags being present in both cases.

#2 Updated by Berk Hess almost 2 years ago

With the exact continuation patch I got a checkpoint that quickly crashes. Virtual sites in propofol are flying away. I am now running it on my laptop without DD and see that just before the crash several, but not all, virtual site hydrogen are PBC shifted by one unit-cell in different directions. Currently I don't know what causes this, but this is a clear artefact, so I should be able to find the cause of this.

#3 Updated by Berk Hess almost 2 years ago

The issue seems to be that the constraints code modifies vsite coordinates, which should never happen. I will debug this further later today.

#4 Updated by Erik Lindahl almost 2 years ago

No rush; I need to head out with the kids a few hours, but tonight I'll be working. I'll also try to get the host with a single AVX-512 FMA up and running to test that code.

#5 Updated by Gerrit Code Review Bot almost 2 years ago

Gerrit received a related patchset '1' for Issue #2316.
Uploader: Berk Hess ()
Change-Id: gromacs~release-2018~I12ff0fae2cd3c45ad4e63bfeccfc8c88505cdb1e
Gerrit URL: https://gerrit.gromacs.org/7267

#6 Updated by Berk Hess almost 2 years ago

  • Status changed from New to Fix uploaded
  • Assignee set to Berk Hess

This issue is that the new simple integrator loops no longer clear the velocities of virtual sites and integrate the coordinates. This can make the velocities of virtual sites run away until the displacement in one step is of the size of the box and then the vsite shifts by a unit cell.
Solved by adding a separate clearing loop for which I measured 0 overhead.
Note that this likely did not cause silent errors, but immediate crashes.

#7 Updated by Erik Lindahl almost 2 years ago

  • Status changed from Fix uploaded to Resolved

Much appreciated fix Berk!

#8 Updated by Erik Lindahl almost 2 years ago

  • Status changed from Resolved to Closed

Also available in: Atom PDF