Project

General

Profile

Bug #1060

4.6-beta1 box implodes on GPU

Added by Carsten Kutzner over 6 years ago. Updated over 6 years ago.

Status:
Closed
Priority:
High
Assignee:
Category:
mdrun
Target version:
Affected version - extra info:
Affected version:
Difficulty:
uncategorized
Close

Description

This protein in a water box runs fine (at least for >10 ns) when using the group cutoff scheme (see attached .tpr file "group.tpr"). However, when switching to Verlet ("verlet.tpr") and running on a GTX680 GPU, the volume of the box starts to shrink until the simulation finally crashes. See the attached .pdf file: red line is the box volume from "group.tpr", black from "verlet.tpr".
In another setting (no input files provided), where the protein is centered in the box at the beginning, everything runs fine until a part of the protein touches the box boundary; then the shrinking of the box begins.

energy.pdf (29.4 KB) energy.pdf Carsten Kutzner, 12/05/2012 10:45 AM
verlet.tpr (1.83 MB) verlet.tpr Carsten Kutzner, 12/05/2012 10:45 AM
group.tpr (1.83 MB) group.tpr Carsten Kutzner, 12/05/2012 10:45 AM
noPcoupl.tpr (1.83 MB) noPcoupl.tpr Carsten Kutzner, 12/05/2012 03:30 PM
noPcouplVrescale.tpr (1.83 MB) noPcouplVrescale.tpr Carsten Kutzner, 12/05/2012 03:54 PM

Associated revisions

Revision 005b6a6f (diff)
Added by Berk Hess over 6 years ago

fixed incorrect virial with virtual sites and OpenMP

The virial was incorrect with more than 1 OpenMP thread
and virtual sites crossing a box boundary.
Fixes #1060

Change-Id: Ib659f6eb7719f4808b37e828b96526587f8a0e69

History

#1 Updated by Carsten Kutzner over 6 years ago

This also happens when switching to Berendsen temperature and pressure coupling.

#2 Updated by Martin Hoefling over 6 years ago

so my previous git bisect had some problems

5ba1edc6227f073edf2ad27297a8d078dc9a3d6d not crashed AND drift seems ok

while

86125d5e90a3cf3328d72d4e12bbdb03bc3b57d6 not crashed but drift is not ok

I will double check if this is the case...

#3 Updated by Szilárd Páll over 6 years ago

  • Category set to mdrun
  • Priority changed from Normal to High
  • Target version set to 4.6

Reproduced, both with and without PP-PME load balancing.

#4 Updated by Berk Hess over 6 years ago

This also happens with Verlet kernels on the CPU.
Could you attach a tpr without p-coupling, so I can check the pressure?

#6 Updated by Berk Hess over 6 years ago

It seems to run fine with a single thread (tested with only MPI threads). So Szilard's guess could be correct that it's in the bonded reduction. For using valgrind we should dump a checkpoint just before and/or after the jump and restart from there with valgrind.

#7 Updated by Szilárd Páll over 6 years ago

Berk Hess wrote:

It seems to run fine with a single thread (tested with only MPI threads). So Szilard's guess could be correct that it's in the bonded reduction. For using valgrind we should dump a checkpoint just before and/or after the jump and restart from there with valgrind.

As far as I can tell, the issue does not always get triggered at the same step, so restarting from the last checkpoint (I guess that's what you meant), might not get the bug triggered immediately after restart.

#8 Updated by Berk Hess over 6 years ago

Thanks Carsten, but could you change the tpr to V-rescale coupling, such that I can check energy conservation? Now we have default energy conservation with the Verlet scheme, we can much easier track down bugs. If the energy is conserved, the forces are fine and it's the virial. If energy is not conserved, the forces are incorrect.

#10 Updated by Berk Hess over 6 years ago

Thanks. Energy is conserved, so the forces are fine and there something wrong with the virial.

#11 Updated by Berk Hess over 6 years ago

It's the threaded virtual sites.
Running with OMP_NUM_THREADS=8 GMX_VSITE_NUM_THREADS=1 gives correct pressures.
But I don't see any obvious virial error there and the forces are correct as energy is conserved.

#12 Updated by Martin Hoefling over 6 years ago

So according to my last bisect

3e85f335235083cba75d75d3cbdf1217da5b23f5 not crashed AND drift seems ok
c61e2d0315a3f833376e624793c6ff4ee1907387 not crashed but drift is not ok
678937cb61ab4a8d24e702f00158d2023baee095 not crashed but drift is not ok

so

678937cb61ab4a8d24e702f00158d2023baee095 is the first bad commit

#13 Updated by Berk Hess over 6 years ago

  • Status changed from New to Feedback wanted
  • Assignee set to Berk Hess

#14 Updated by Berk Hess over 6 years ago

  • Status changed from Feedback wanted to Closed

Also available in: Atom PDF