Project

General

Profile

Bug #1351

Simulations which are stable on Gromacs 4.5.3 are crashing on Gromacs 4.6.3.

Added by Floris van Eerden about 4 years ago. Updated almost 4 years ago.

Status:
Closed
Priority:
Normal
Assignee:
Category:
mdrun
Target version:
-
Affected version - extra info:
Affected version:
Difficulty:
uncategorized
Close

Description

I have systems which I can run without problem with GROMACS 4.5.5, but crash sooner or later with Gromacs 4.6.3.
The system is a protein monomer (Photosystem II) embedded in a thylakoid membrane, of which some of the lipids contain constraints. Everything is in a hexagonal box.
In this particular simulation all the charges of the protein are set to zero. But the crashes also occur when I try to simulate the dimer, with all charges at their normal value.
I tried running it with domain decomposition, particle decomposition or on one thread, but in all cases i get sooner or later (sometimes the simulation runs quite some time) LINCS erros or a segmentation fault.

Sincerly

Floris

runfiles (9.53 MB) runfiles Floris van Eerden, 10/04/2013 01:38 PM
membrane (2.2 MB) membrane Floris van Eerden, 10/09/2013 08:36 PM
NVT_noMembrane (3.61 MB) NVT_noMembrane Floris van Eerden, 10/11/2013 03:56 PM

Associated revisions

Revision d5af279d (diff)
Added by Berk Hess about 4 years ago

avoid division by zero in SIMD angles and dihedrals

The SIMD accelerated angle and dihedral code did not (correctly)
check for dividing by zero, which can happen with aligned bonds.
Fixes #1351

Change-Id: I326f90fca87ab5cca493204de4a58655465634ca

History

#1 Updated by Floris van Eerden about 4 years ago

PS It is a Coarse Grained Martini system

#2 Updated by Mark Abraham about 4 years ago

  • Status changed from New to Accepted

Indeed, your startmd10tpr.tpr ran for 40ns with 4.5.7 and mdrun -rdd 1.8, and died with release-4-6 branch almost immediately.

I am trying a bisection to narrow down the point in the commit history that the crash first becomes reproducible, but that will take considerable time because success can take hours to measure! That said, there's a half dozen commits that are known offenders for latent bugs, so probably I will identify one of them and further diagnosis will be required.

Simpler examples of simulations that are stable or not (i.e. with respect to segfault) would be very useful - e.g. is your membrane stable (or not) on its own? Are any of the protein sub-units stable in water on their own? Is one of them fully representative of the complexity of the interactions present in any of them? I'm not worried if a protein sub-unit that isn't stable in the folding sense - just whether it is useful for probing what the numerical instability is.

#3 Updated by Floris van Eerden about 4 years ago

Hello,

I am happy to read that you got the same error! So it is not just me ;)

About other systems:
The only other system which I ran on Gromacs 4.6.3 is the PSII dimer (with all charges normal) embedded in the thylakoid membrane (so an even bigger system) and that one was also not stable.

On Gromacs 4.5.5 the PSII dimer is stable at least up to 2 micro seconds, and the membrane itself also. But I do not have any of these simulations with Gromacs 4.6.3. I will try to run a patch of membrane overnight with Gromacs 4.6.3 and give you the result tomorrow.

Floris

#4 Updated by Mark Abraham about 4 years ago

Thanks! I have tentatively identified the commit that introduced the Verlet cut-off scheme as the source of the problem. That commit is one of the serial offenders, and is relatively good news, because it should have had very little impact on your simulation. That might make the next stage easier.

#5 Updated by Floris van Eerden about 4 years ago

I did simulations of a membrane without protein, and also this system crashed really quick, see attachment.
In the coming days I will try simulations of only the protein, without lipids.

#6 Updated by Mark Abraham about 4 years ago

Thanks - that's already a very helpful simplification.

#7 Updated by Mark Abraham about 4 years ago

Good news - running the first simulation with no SIMD at all also ran for 40ns (so the problem is with a SIMD path). That patch was nasty and introduced some SIMD to bonded interactions, as well as the Verlet cut-off scheme.

With non-bonded SIMD only: ran > 2ns.
With SIMD everywhere except proper dihedrals: run > 0.5ns
With SIMD everywhere except force-only proper dihedrals: run > 1ns.
Otherwise, crash.

This suggests that with your large time steps, something like that some atoms are going close enough to co-linear that the SIMD routines are returning a garbage value. We do have some checks for this kind of thing, but it would seem we have missed one.

#8 Updated by Berk Hess about 4 years ago

Mark, it seems you didn't check the SIMD angles, or did you?

My guess was that the problem is in there, since in Martini angles could be 180 degrees.
Could one of you try the change below, which should fix this issue. If it works I will submit a proper, clean fix.

--- a/src/gmxlib/bondfree.c
+++ b/src/gmxlib/bondfree.c
@@ -1141,6 +1141,14 @@ angles_noener_simd(int nbonds,

         cos_S     = gmx_mul_pr(rij_rkj_S, gmx_mul_pr(nrij_1_S, nrkj_1_S));

+        {
+            gmx_mm_pr minone_S;
+
+            minone_S = gmx_set1_pr(-1.0);
+
+            cos_S = gmx_max_pr(cos_S, minone_S);
+        }
+
         theta_S   = gmx_acos_pr(cos_S);

         sin_S     = gmx_invsqrt_pr(gmx_max_pr(gmx_sub_pr(one_S, gmx_mul_pr(cos_S, cos_S)),

#9 Updated by Berk Hess about 4 years ago

The angles are not the problem, but rereading Mark's comment I think he means to say that the issues are in the SIMD dihedral code.

But the dihedral code looks very robust. There are is only one division, which involves the distance between atoms j and k. But in Martini atoms j and k can not end up on top of each other, can they?

#10 Updated by Mark Abraham about 4 years ago

By "SIMD everywhere" I meant including SIMD angles (ie. by hacking the #ifdef SIMD_BONDED suitably), but I will try your hack anyway.

#11 Updated by Mark Abraham about 4 years ago

Yes - my diagnosis so far is that the issue is shown by pdihs_noener_simd(), because if the only change I make to the code is to call its fallback routine pdihs_noener(), then I get the > 1ns run I reported earlier.

#12 Updated by Mark Abraham about 4 years ago

Tried Berk's angle hack, still crashed around 24000 steps.

#13 Updated by Berk Hess about 4 years ago

  • Status changed from Accepted to Resolved

The SIMD dihedral handled 180 degree bond angles incorrectly.I uploaded a fix:
https://gerrit.gromacs.org/#/c/2697/

#14 Updated by Floris van Eerden about 4 years ago

Hello,
I am happy that you found the problem.
As I said I would, I am also doing a simulation without the membrane (the protein has however some co-crystalized lipids which I left), with NVT. And now the simulation already ran for 329800 steps. Would anybody have an idea while it does not crash in this case?

#15 Updated by Berk Hess about 4 years ago

It will only crash when two adjacent bonds in a dihedral are (nearly) co-linear. The chance of this happening is very small and it could be that that only happens in Martini lipids.

#16 Updated by Floris van Eerden about 4 years ago

Thanks for the reply.
It seems indeed the case, because the previous simulation also seems to work well with pressure coupling.

#17 Updated by Mark Abraham about 4 years ago

I have the protein+membrane stable with Berk's fix on gerrit for > 4ns so far. Will run the full 40 ns

#18 Updated by Berk Hess about 4 years ago

  • % Done changed from 0 to 100

#19 Updated by Rossen Apostolov almost 4 years ago

  • Status changed from Resolved to Closed

Also available in: Atom PDF