Project

General

Profile

Bug #431

GB simulation explodes or crashes in nonbonded kernel

Added by Szilárd Páll about 9 years ago. Updated about 9 years ago.

Status:
Closed
Priority:
Normal
Assignee:
Erik Lindahl
Category:
mdrun
Target version:
Affected version - extra info:
Affected version:
Difficulty:
uncategorized
Close

Description

A GB simulation on a ~5k atom system (attached) keeps on exploding/crashing. The bug is 100% reproducible with the following conditions:
- git version 5922b72
- gcc 4.1.3, dynamically linked mdrun binary (NOT statically linked)
- optimization levels O3/2/1, but NOT O0
- also debug version

Tested on AMD X6 1090T + Ubuntu 9.04 x86_64 and the same binary on Core i5 750 + Ubuntu 9.10 x86_64.

Crash details (on AMD X6): ======== 1-3 thread(s) ========
step 700, remaining runtime: -15 s Warning: 1-4 interaction between 3907 and 3912 at distance 7.505 which is larger than the 1-4 table size 2.200 nm
These are ignored for the rest of the simulation
This usually means your system is exploding,
if not, you should increase table-extension in your mdp file
or with user tables increase the table size

Program received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7fa0d9d676f0 (LWP 2027)]
nb_kernel410_x86_64_sse (p_nri=<value optimized out>, iinr=0x1f1b2b0, jindex=0x1f56b00, jjnr=0x7fa0d8d8b010,
shift=0x1f42d90, shiftvec=0x1b47490, fshift=0x1b476c0, gid=0x1f2f020, pos=0x1aa4f30, faction=0x1dca520,
charge=0x1f0c3f0, p_facel=0x1b38148, p_krf=0x1b38150, p_crf=0x1b38154, vc=0x1dc9880, type=0x1f11350,
p_ntype=0x1b38328, vdwparam=0x1b478f0, vvdw=0x1dc9670, p_tabscale=0x1d867b8, VFtab=0x0,
invsqrta=0x1ce08c0, dvda=0x1ce5820, p_gbtabscale=0x1b38398, GBtab=0x7fa0d9b91020,
p_nthreads=0x7fff6c3fbdfc, count=0x1d86840, mtx=0x0, outeriter=0x7fff6c3fbdf8, inneriter=0x7fff6c3fbdf4,
work=0x7fff6c3fbde0) at /usr/lib/gcc/x86_64-linux-gnu/4.1.3/include/xmmintrin.h:876
876 return (_m128) *(_v4sf *)__P;

======== >=4 threads ========
Step 950:
The charge group starting at atom 3925 moved than the distance allowed by the domain decomposition (3.826748) in direction X
distance out of cell -15.151535
Old coordinates: 16.482 2.773 3.745
New coordinates: -2.707 -9.060 -36.987
Old cell boundaries in direction X: 12.381 16.151
New cell boundaries in direction X: 12.445 16.271

-------------------------------------------------------
Program mdrun_gcc413_dynamic_debug, VERSION 4.0.99-dev-20100608-5922b72
Source code file: ../../../src/mdlib/domdec.c, line: 4081
[...]

topol.tpr.bz2 (320 KB) topol.tpr.bz2 spectrin implicit GB system Szilárd Páll, 06/10/2010 12:04 PM

History

#1 Updated by Szilárd Páll about 9 years ago

Created an attachment (id=465)
spectrin implicit GB system

#2 Updated by Berk Hess about 9 years ago

This is quite probably a gcc 4.1 optimization bug (in this case 4.1.3).
gcc 4.1.3 gives incorrect results with -O2 for the test program at:
https://bugs.launchpad.net/ubuntu/+source/gcc-4.1/+bug/158799

So we are not 100% sure this is not a GB bug, but lets not waist more
time on this.

We will put a warning or error in the Gromacs configure script
when it detects gcc 4.1.?

Berk

#3 Updated by Szilárd Páll about 9 years ago

The GB 1-2, LJ and Coulomb terms are already different at step 0 when running binaries compiled with gcc 4.1.3 -O0 and -O3. This could be due to a bug in the code and not in gcc.

I'll mail Per about this.

Are there any gcc 4.1.x issues without GB?

#4 Updated by Per Larsson about 9 years ago

I've isolated this to a smaller 23 atom test-system on the same machine.
Resolution will follow...

#5 Updated by Per Larsson about 9 years ago

This was indeed a GB bug and not gcc 4.1.3

It was due to a faulty routine for updating two potential values at once using SSE.
I have replaced it with a routine that updates a single value, calling that routine twice.
The double-updating thing should work also of course (will look into that), but for now the code works again.

Also available in: Atom PDF