Previously OK parallel PME run crashes since 69470f
A ligand-gated ion channel system that worked fine in 2016.3 crashes suddenly after ~180k steps since Berk's restructuring of leap-frog integrators in 69470f, with a message about particles more than 2/3 out of the cell when doing PME communication. It works fine in the commit just before.
Details in the attached log, running CPU-only on dev-purley02 in reproducible mode.
Setting to high priority, since this could be a silent bug when it doesn't crash.
Clear vsite velocities for simple integrators
The simple integrator loops (introduced in 69470fc4) do not clear
the velocities of virtual sites. This allows velocities of virtual
sites to slowly increase over time. To prevent this, velocities
of virtual sites are now cleared in a separate loop.
#2 Updated by Berk Hess over 1 year ago
With the exact continuation patch I got a checkpoint that quickly crashes. Virtual sites in propofol are flying away. I am now running it on my laptop without DD and see that just before the crash several, but not all, virtual site hydrogen are PBC shifted by one unit-cell in different directions. Currently I don't know what causes this, but this is a clear artefact, so I should be able to find the cause of this.
#6 Updated by Berk Hess over 1 year ago
- Status changed from New to Fix uploaded
- Assignee set to Berk Hess
This issue is that the new simple integrator loops no longer clear the velocities of virtual sites and integrate the coordinates. This can make the velocities of virtual sites run away until the displacement in one step is of the size of the box and then the vsite shifts by a unit cell.
Solved by adding a separate clearing loop for which I measured 0 overhead.
Note that this likely did not cause silent errors, but immediate crashes.