Project

General

Profile

Bug #1128

SEGV in nonbonded kernels during FEP foreign lambda calculations

Added by Sander Pronk over 7 years ago. Updated about 4 years ago.

Status:
Closed
Priority:
Normal
Category:
mdrun
Target version:
-
Affected version - extra info:
4.6.0
Affected version:
Difficulty:
uncategorized
Close

Description

During foreign lambda force/potential evaluation, there is an intermittent SEGV in mdrun. This appears to happen in all CPU accelerations, and only if there is COM pulling and the energy differences are zero (fep-lambdas is something like "1.0 1.0 1.0 .." ).

I believe this is a FEP issue because it happens with all kernel types and is sensitive to the exact FEP settings (any non-zero delta H appears to fix this - but I'm not 100% sure because this a not reliably reproducible crash).

A stack trace is:
#0 0x00007ffff70f5ef3 in nb_kernel_ElecEw_VdwLJ_GeomW3W3_F_c (
nlist=<optimized out>, xx=0x7fffe80c67d0, ff=0x7fffe80a6e40,
fr=<optimized out>, mdatoms=<optimized out>, kernel_data=<optimized out>,
nrnb=0x7fffe801ec60)
at /nethome/sander/git/gromacs-4.6/src/gmxlib/nonbonded/nb_kernel_c/nb_kernel_ElecEw_VdwLJ_GeomW3W3_c.c:870
#1 0x00007ffff704aedd in do_nonbonded (cr=<optimized out>, fr=0x7fffe801eff0,
x=0x7fffe80c67d0, f_shortrange=0x7fffe80a6e40, f_longrange=0x7fffe80a6e40,
mdatoms=0x7fffe80446b0, excl=0x7fffe80a5a48, grppener=0x7fffe80a3bc8,
box_size=0x7ffff2ccbb30, nrnb=0x7fffe801ec60, lambda=0x7fffe8093c10,
dvdl=0x7ffff2ccbad0, nls=-1, eNL=-1, flags=18)
at /nethome/sander/git/gromacs-4.6/src/gmxlib/nonbonded/nonbonded.c:411
#2 0x00007ffff7507492 in do_force_lowlevel (fplog=0x0, step=1172,
fr=0x7fffe801eff0, ir=0x7fffe800ac90, idef=0x7fffe80a51a0,
cr=0x7fffe800ac10, nrnb=0x7fffe801ec60, wcycle=0x7fffe801e7e0,
md=0x7fffe80446b0, opts=0x7fffe800af38, x=0x7fffe80c67d0,
hist=0x7fffe80a5c00, f=0x7fffe80a6e40, f_longrange=0x7fffe80a6e40,
enerd=0x7fffe80a3a70, fcd=0x7fffe801d290, mtop=0x7fffe800b0c0,
top=0x7fffe80a51a0, born=0x0, atype=0x7fffe80a59f8, bBornRadii=0,
box=0x7fffe80a5aa8, fepvals=0x7fffe800b720, lambda=0x7fffe8093c10,
graph=0x0, excl=0x7fffe80a5a48, mu_tot=0x7fffe801f098, flags=243,
cycles_pme=0x7ffff2ccbf2c)
at /nethome/sander/git/gromacs-4.6/src/mdlib/force.c:285
#3 0x00007ffff752e0b5 in do_force_cutsGROUP (fplog=0x0, cr=0x7fffe800ac10,
inputrec=0x7fffe800ac90, step=1172, nrnb=0x7fffe801ec60,
wcycle=0x7fffe801e7e0, top=0x7fffe80a51a0, mtop=0x7fffe800b0c0,
groups=0x7fffe800b180, box=0x7fffe80a5aa8, x=0x7fffe80c67d0,
hist=0x7fffe80a5c00, f=0x7fffe80a6e40, vir_force=0x7ffff2ccc5f0,
mdatoms=0x7fffe80446b0, enerd=0x7fffe80a3a70, fcd=0x7fffe801d290,
lambda=0x7fffe8093c10, graph=0x0, fr=0x7fffe801eff0, vsite=0x0,
mu_tot=0x7ffff2ccc7a0, t=2.3439999999999999, field=0x0, ed=0x0,
bBornRadii=0, flags=243)
at /nethome/sander/git/gromacs-4.6/src/mdlib/sim_util.c:1792
#4 0x00000000004313e7 in do_md (fplog=0x0, cr=0x7fffe800ac10, nfile=36,
fnm=0x7fffe8009400, oenv=0x653640, bVerbose=1, bCompact=1,
nstglobalcomm=10, vsite=0x0, constr=0x7fffe807fd60, stepout=100,
ir=0x7fffe800ac90, top_global=0x7fffe800b0c0, fcd=0x7fffe801d290,
state_global=0x7fffe800b2c0, mdatoms=0x7fffe80446b0, nrnb=0x7fffe801ec60,
wcycle=0x7fffe801e7e0, ed=0x0, fr=0x7fffe801eff0, repl_ex_nst=0,
repl_ex_nex=0, repl_ex_seed=-1, membed=0x0, cpt_period=15, max_hours=-1,
deviceOptions=0x442079 "", Flags=1055744, runtime=0x7ffff2cccbc0)
at /nethome/sander/git/gromacs-4.6/src/kernel/md.c:1177
#5 0x0000000000411016 in mdrunner (hw_opt=<optimized out>, fplog=0x0,
cr=0x7fffe800ac10, nfile=36, fnm=0x7fffe8009400, oenv=0x653640,
bVerbose=1, bCompact=1, nstglobalcomm=-1, ddxyz=0x7ffff2cccdac,
dd_node_order=1, rdd=0, rconstr=0, dddlb_opt=0x442093 "auto",
dlb_scale=0.800000012, ddcsx=0x0, ddcsy=0x0, ddcsz=0x0,
nbpu_opt=<optimized out>, nsteps_cmdline=-2, nstepout=100, resetstep=-1,
nmultisim=0, repl_ex_nst=0, repl_ex_nex=0, repl_ex_seed=-1, pforce=0,
cpt_period=15, max_hours=-1, deviceOptions=0x442079 "", Flags=1055744)
at /nethome/sander/git/gromacs-4.6/src/kernel/runner.c:1576
#6 0x00000000004123ea in mdrunner_start_fn (arg=0x66f5b0)
at /nethome/sander/git/gromacs-4.6/src/kernel/runner.c:173
#7 0x00007ffff70237be in tMPI_Thread_starter (arg=<optimized out>)
at /nethome/sander/git/gromacs-4.6/src/gmxlib/thread_mpi/tmpi_init.c:367
#8 0x00007ffff6c77e9a in start_thread (arg=0x7ffff2ccd700)
at pthread_create.c:308
#9 0x00007ffff69a4cbd in clone ()
at ../sysdeps/unix/sysv/linux/x86_64/clone.S:112
#10 0x0000000000000000 in ?? ()

gmx version:

Gromacs version: VERSION 4.6.1-dev-20130122-a3a100c
GIT SHA1 hash: a3a100c27369906cd6485ecbc9830639707041cb
Precision: single
Memory model: 64 bit
MPI library: thread_mpi
OpenMP support: enabled
GPU support: disabled
invsqrt routine: gmx_software_invsqrt(x)
CPU acceleration: NONE
FFT library: fftw-3.3-sse2
Large file support: enabled
RDTSCP usage: enabled
Built on: Tue Jan 22 21:51:38 CET 2013
Built by: sander@cpc00 [CMAKE]
Build OS/arch: Linux 3.2.0-31-generic x86_64
Build CPU vendor: GenuineIntel
Build CPU brand: Intel(R) Xeon(R) CPU E5645 @ 2.40GHz
Build CPU family: 6 Model: 44 Stepping: 2
Build CPU features: apic clfsh cmov cx8 cx16 htt lahf_lm mmx msr nonstop_tsc pcid pdcm pdpe1gb popcnt pse rdtscp sse2 sse3 sse4.1 sse4.2 ssse3
C compiler: /usr/bin/gcc GNU gcc (Ubuntu/Linaro 4.6.3-1ubuntu5) 4.6.3
C compiler flags: -Wextra -Wno-missing-field-initializers -Wno-sign-compare -Wall -Wno-unused -Wunused-value -fomit-frame-pointer -funroll-all-loops -fexcess-precision=fast -O3 -DNDEBUG -g -Wall -Wno-unused

fe_kernel_crash.tar.gz (101 KB) fe_kernel_crash.tar.gz Input files needed to reproduce Sander Pronk, 01/22/2013 10:28 PM
fe-kernel-crash.tar.gz (89.9 KB) fe-kernel-crash.tar.gz Sander Pronk, 02/14/2013 08:51 AM

History

#1 Updated by Michael Shirts over 7 years ago

TPR did not work (wrong version?) but when I regenerated from the .mdp, I didn't get any crash parallelized on 4 threads out for about 150 ps. Is there a more consistent way of getting the crash?

#2 Updated by Michael Shirts over 7 years ago

Hah, wrong version running - tpr posted now works. However, still no crash. SSE2 kernels, 4 threads, OS X build with gcc.

#3 Updated by Sander Pronk over 7 years ago

I get a roughly 50% chance of a crash with 8 threads after 2000 steps or so. I think valgrind would be a more useful way to find out what's going on.

#4 Updated by Michael Shirts over 7 years ago

Sander Pronk wrote:

I get a roughly 50% chance of a crash with 8 threads after 2000 steps or so. I think valgrind would be a more useful way to find out what's going on.

I can't get valgrind to work right now -- I'm getting the 'cannot execute binary file' message. Perhaps you could run valgrind, and let me know if anything promising comes up? I still can't replace the error (OS X 10.6, gcc 4.2.1, double precision).

Maybe it fails only in single? I'll check that.

#5 Updated by Sander Pronk over 7 years ago

I took a closer look at it yesterday, and it turned out to be caused by a NaN in gmx_nb_free_energy_kernel() which leads to a NaN in a force calculation, sending a particle off to infinity. The system I sent has sc-alpha=0, but is at lambda=1 which is fully decoupled, so it should work. If I set sc-alpha to 1 the issue goes away, so I guess it's a question of non-canceling terms.

I've attached a new archive with a tpr that reliably crashes at step 2955 with mdrun -reprod -nt 1, with mdrun compiled single precision with no cpu acceleration.

#6 Updated by Michael Shirts over 7 years ago

Sander Pronk wrote:

I took a closer look at it yesterday, and it turned out to be caused by a NaN in gmx_nb_free_energy_kernel() which leads to a NaN in a force calculation, sending a particle off to infinity. The system I sent has sc-alpha=0, but is at lambda=1 which is fully decoupled, so it should work. If I set sc-alpha to 1 the issue goes away, so I guess it's a question of non-canceling terms.

I've attached a new archive with a tpr that reliably crashes at step 2955 with mdrun -reprod -nt 1, with mdrun compiled single precision with no cpu acceleration.

Crashes, but later for me -- step 5863, and by then, everything is already NaN'ed, so it's not clear what's causing the crash. Good to know that it only occur with sc-alpha=0 -- that helps identify code paths.

#7 Updated by Sander Pronk over 7 years ago

I think in this case -reprod means reproducible for this binary only.. You can enable SIGFPEs to happen on NaNs in Linux by calling this early on in mdrun:

feenableexcept(FE_DIVBYZERO | FE_INVALID | FE_OVERFLOW);

with

#define _GNU_SOURCE
#include <fenv.h>

in that source file.

That will make the program stop as soon as there is a NaN.

#8 Updated by Michael Shirts over 7 years ago

So, it looks like two hydrogens are getting very close to each other - about 0.0075 nm - and this is causing one of the components to blow up to NaN, even though the force itself would be (big*0) = 0. It looks like it's a general numerical issue, not necessarily a specific problem in this case. Since the interactions are turned off, the atoms can go anywhere.

I'll have to think a bit more about how to solve this, since it's not specific to the problem -- instead it's a general numerical issue that atoms that are free floating can overlap, and their interaction force will be calculated before being multiplied by zero, and therefore could overflow, especially in single. We could short circuit with the conditional when the potential is zero to get around this problem, but that would make the loop even more complicated, so I want to make sure first!

#9 Updated by Mark Abraham over 7 years ago

  • Status changed from New to Accepted
  • Target version deleted (4.6.1)
  • Affected version set to 4.6

#10 Updated by Rossen Apostolov over 6 years ago

  • Target version set to 4.6.x

Michael, do you plan to look at a fix?

#11 Updated by Rossen Apostolov over 6 years ago

@Michael - would you defer this to a later release?

#12 Updated by Mark Abraham about 6 years ago

  • Target version changed from 4.6.x to 5.x

#13 Updated by Erik Lindahl over 5 years ago

Another ping here Michael - it's been waiting for two years, so I have no idea what the status of this is.

#14 Updated by Mark Abraham about 4 years ago

  • Status changed from Accepted to Closed
  • Target version deleted (5.x)

Presumably nothing will ever happen here

Also available in: Atom PDF