Project

General

Profile

Bug #74

segmentation fault in mdrun (with PME) on Opteron

Added by Diane Fournier over 13 years ago. Updated over 12 years ago.

Status:
Closed
Priority:
High
Assignee:
Erik Lindahl
Category:
mdrun
Target version:
Affected version - extra info:
Affected version:
Difficulty:
uncategorized
Close

Description

segmentation fault every time in mdrun when using PME in the steepest descents
minimization step of John Kerrigan's Tutorial for Drug-Enzyme Complex. .log
file ends without any iteration.

The Altix is running Red Hat Enterprise Linux AS release 3 with Intel
Math Kernel Librarary (MKL) v. 8.0.1 as FFT library and these compilers:

C++ Version 9                9.0-023 -> 9.0-031 
C++ Version 8 8.1-033 -> 8.1-036
Fortran 9 9.0-021 -> 9.0-032
Fortran 8 8.1-029 -> 8.1-033
IPP 4.1 -> 5.0

This machine is used remotely via a connection using ssh and cygwinX for
graphics.

mdrun works fine using the same .tpr file in mdrun 3.3.1 on a Pentium 4 under
linux Fedora Core 4 and fftw3.

Bug happens when using single node (have not tried MPI mode), with both
versions compiled with or without Fortran, except that "segmentation fault" is
not mentionned anymore in output, but .log file is identical.

trp_em.tpr (2.21 MB) trp_em.tpr tpr file for minimization step in Kerrigan tutorial Diane Fournier, 05/02/2006 10:29 PM

History

#1 Updated by Erik Lindahl over 13 years ago

Hi Diane,

Let's try to isolate this a bit.
First, I assume the segfault is happening "immediately" during the first timestep?

Check what happens if you set this environment variable before starting mdrun:

export NOASSEMBLYLOOPS=1

Unfortunately, we have experienced bugs in some versions of the Intel FFT libraries. So, if it still crashes
when not using the ia64-specific kernels you should try to recompile and use either FFTW or the builtin
FFTPACK for FFTs. (--with-fft=fftpack).

For reference, it would also be nice if you could upload the tpr file as an attachment!

Cheers,

Erik

#2 Updated by Diane Fournier over 13 years ago

Created an attachment (id=39)
tpr file for minimization step in Kerrigan tutorial

#3 Updated by Diane Fournier over 13 years ago

The result is the same (no iteration in em.log file) when I set the variable
(This is with a version that was compiled in single precision with no
Fortran). The people responsible for the system will try different FFT
libraries.

#4 Updated by Erik Lindahl over 13 years ago

Great. In worst case it's an MKL bug, but I can probably work around that, or at least report to Intel so it
gets fixed.

The time spent on FFTs is only a small fraction of the runtime, so you will get almost exactly the same
performance from using FFTW3.

Cheers,

Erik

#5 Updated by Timm Essigke over 13 years ago

We had the same problems with PME using Intel Cluster MKL 8.0 on Opteron machines.

Using FFTW3 seems to fix the problem, but I had problems to compile it:
--enable-sse gives with gcc version 3.4.4 20050314 (prerelease) (Debian 3.4.3-13):

gcc -DHAVE_CONFIG_H -I. -I. -I.. -I../kernel
-I/opt/btbs/gromacs/fftw-3.0.1/install/include/ -O3 -fomit-frame-pointer
-fno-schedule-insns -fstrict-aliasing -mpreferred-stack-boundary=4 -pthread
-msse -MT sse.lo -MD -MP -MF .deps/sse.Tpo -c sse.c -o sse.o
/tmp/ccfieh24.s: Assembler messages:
/tmp/ccfieh24.s:83: Error: suffix or operands invalid for `push'
/tmp/ccfieh24.s:85: Error: suffix or operands invalid for `pop'

with gcc version 4.0.3 20051111 (prerelease) (Debian 4.0.2-4) I get:

/usr/bin/gcc-4.0 -DHAVE_CONFIG_H -I. -I. -I.. -I../kernel
-I/opt/btbs/gromacs/fftw-3.0.1/install/include/ -O3 -fomit-frame-pointer
-fno-schedule-insns -fstrict-aliasing -mpreferred-stack-boundary=4 -pthread
-msse -MT sse.lo -MD -MP -MF .deps/sse.Tpo -c sse.c -o sse.o
In file included from simd.h:22,
from sse.c:24:
simd-sse.h:30: warning: specifying vector types with attribute ((mode)) is
deprecated
simd-sse.h:30: warning: use attribute ((vector_size)) instead
/tmp/ccqIolhZ.s: Assembler messages:
/tmp/ccqIolhZ.s:75: Error: suffix or operands invalid for `push'
/tmp/ccqIolhZ.s:77: Error: suffix or operands invalid for `pop'

without --enable-sse I can compile it, but I get

/usr/bin/gcc-4.0 -shared .libs/calcmu.o .libs/calcvir.o .libs/constr.o
.libs/coupling.o .libs/ebin.o .libs/edsam.o .libs/ewald.o .libs/force.o
.libs/ghat.o .libs/init.o .libs/mdatom.o .libs/mdebin.o .libs/minimize.o
.libs/ns.o .libs/nsb.o .libs/nsgrid.o .libs/pme.o .libs/pppm.o .libs/fftgrid.o
.libs/pull.o .libs/pullinit.o .libs/pullio.o .libs/pullutil.o .libs/rf_util.o
.libs/shakef.o .libs/sim_util.o .libs/splittop.o .libs/tables.o .libs/tgroup.o
.libs/update.o .libs/vcm.o .libs/vsite.o .libs/wnblist.o .libs/csettle.o
.libs/clincs.o .libs/qmmm.o .libs/gmx_fft.o .libs/gmx_parallel_3dfft.o
.libs/qm_gaussian.o .libs/gmx_fft_fftw3.o
-L/opt/btbs/gromacs/fftw-3.0.1/install/lib/ -L/usr/X11R6/lib -lnsl
/opt/btbs/gromacs/fftw-3.0.1/install.new//lib/libfftw3f.a -lm /usr/lib/libXm.so
-lXt -lSM -lICE -lXext -lXp -lX11 -Wl,-soname -Wl,libmd.so.4 -o
.libs/libmd.so.4.0.0
/usr/bin/ld: /opt/btbs/gromacs/fftw-3.0.1/install.new//lib/libfftw3f.a(alloc.o):
relocation R_X86_64_32 against `a local symbol' can not be used when making a
shared object; recompile with -fPIC
/opt/btbs/gromacs/fftw-3.0.1/install.new//lib/libfftw3f.a: could not read
symbols: Bad value

when linking gromacs.

I get a running version of FFTW3 with:
./configure --enable-float --enable-threads
--prefix=/opt/btbs/gromacs/fftw-3.0.1/install.new/ --with-pic

and also compiled gromacs with --with-pic (don't know if neccessary)
./configure --prefix=/opt/btbs/gromacs/gromacs-3.3.1/install.serial.fftw3
--with-fft=fftw3 --enable-shared --with-pic

Maybe the documentation should be updated including the --with-pic issue.

Timm

#6 Updated by Erik Lindahl over 13 years ago

Using FFTW3 seems to fix the problem, but I had problems to compile it:
--enable-sse gives with gcc version 3.4.4 20050314 (prerelease) (Debian 3.4.3-13):

SSE did not work for x86-64 with FFTW 3.0.1, unless you hacked the source yourself to replace some
assembly symbols. I think this has been fixed in FFTW 3.1.1.

Cheers,

Erik

#7 Updated by David van der Spoel about 13 years ago

Do we really want to support different FFT libraries? From the different reports
it seems that only FFTW 3 works reliably.

#8 Updated by David van der Spoel over 12 years ago

Can we close this bug?
I'm still not convinced that we would like to support a whole range of FFT
packages if they are buggy.

#9 Updated by David van der Spoel over 12 years ago

Since this is due to libraries outside gromacs I'm closing this bug.

Also available in: Atom PDF