Project

General

Profile

Bug #2801

MdrunTests segfault with intel compilers+mpi and AVX_512

Added by Mike Nolta about 1 year ago. Updated about 1 year ago.

Status:
Closed
Priority:
Normal
Assignee:
-
Category:
-
Target version:
Affected version - extra info:
Affected version:
Difficulty:
uncategorized
Close

Description

I'm compiling gromacs 2018.4 with the intel compilers + MPI (version 2018u4) on a Skylake Xeon Gold system:

cmake .. \
    -DCMAKE_C_COMPILER=mpiicc \
    -DCMAKE_CXX_COMPILER=mpiicpc \
    -DGMX_FFT_LIBRARY=mkl \
    -DGMX_MPI=ON \
    -DGMX_SIMD=AVX_512

It builds successfully, but when i run 'make check', i get the following error:

      Start 32: MdrunTests
32/33 Test #32: MdrunTests .......................***Exception: SegFault  0.04 sec

      Start 33: MdrunMpiTests
33/33 Test #33: MdrunMpiTests ....................   Passed    1.11 sec

97% tests passed, 1 tests failed out of 33

If i build with AVX2_256, the error goes away.

Full transcript attached.

Associated revisions

Revision a6776ef4 (diff)
Added by Roland Schulz about 1 year ago

Remove unused MdrunComparisonFixture

For 2018 branch it is unused. For 2019 branch it has been
rewritten (c74114338714). Global variable causes SegFault
with ICC and GCC new C++ ABI (_GLIBCXX_USE_CXX11_ABI).

Fixes #2801

Change-Id: Ia41178aaa6e963b05cd4c1e52b8ca0d5946a569c

History

#1 Updated by Szilárd Páll about 1 year ago

Does the crash happen without MPI too? Could you provide the output of running manually the bin/mdrun-test binary in the build tree?

#2 Updated by Mike Nolta about 1 year ago

Yes, it happens without MPI.

$ ./bin/mdrun-test 
Segmentation fault

Here's the gdb backtrace:

#0  0x00000000006111f4 in __intel_skx_avx512_memcpy ()
#1  0x0000000000474837 in std::_Rb_tree_node<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > >* std::_Rb_tree<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::_Select1st<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > >, std::less<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > >::_M_copy<std::_Rb_tree<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::_Select1st<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > >, std::less<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > >::_Alloc_node>(std::_Rb_tree_node<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > const*, std::_Rb_tree_node_base*, std::_Rb_tree<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::_Select1st<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > >, std::less<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > >::_Alloc_node&) ()
#2  0x00000000004753c4 in std::map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::less<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > >, std::less<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, std::map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::less<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > > > > >::map(std::initializer_list<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, std::map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::less<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > > > >, std::less<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > const&, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, std::map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::less<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > > > > const&) ()
#3  0x000000000047224e in __sti___ZN3gmx4test12_GLOBAL__N_122mdpFileValueDatabase_gB5cxx11E ()
#4  0x000000000062015d in __libc_csu_init ()
#5  0x00007fffef792b95 in __libc_start_main () from /lib64/libc.so.6
#6  0x0000000000416f29 in _start ()

Might be an intel compiler error.

#3 Updated by Szilárd Páll about 1 year ago

Indeed, it might be a compiler bug.

Can you try a couple of things:
- try O2, -O1, and -O0 to see whether the crash relates to optimization levels
try building a -DCMAKE_BUILD_TYPE=RelWithDebInfo and get a backtrace.

Thanks!

#4 Updated by Mike Nolta about 1 year ago

Still segfaults with -DCMAKE_BUILD_TYPE=RelWithDebInfo (which sets -O2), but not with -DCMAKE_BUILD_TYPE=ASan (which sets -O1).

Here's the RelWithDebInfo backtrace:

#0  0x000000000060e3f4 in __intel_skx_avx512_memcpy ()
#1  0x0000000000471e97 in copy (__s1=<optimized out>, __s2=<optimized out>, __n=<optimized out>) at /gpfs/fs1/scinet/niagara/software/2018a/core/bin/../include/c++/6.4.0/bits/char_traits.h:290
#2  _S_copy (__d=<optimized out>, __s=<optimized out>, __n=<optimized out>) at /gpfs/fs1/scinet/niagara/software/2018a/core/bin/../include/c++/6.4.0/bits/basic_string.h:300
#3  _S_copy_chars (__p=<optimized out>, __k1=<optimized out>, __k2=<optimized out>) at /gpfs/fs1/scinet/niagara/software/2018a/core/bin/../include/c++/6.4.0/bits/basic_string.h:342
#4  _M_construct (this=<optimized out>, __beg=0x92b030 "free-energy       = yes\nsc-alpha          = 0.5\nsc-r-power        = 6\nnstdhdl", ' ' <repeats 11 times>, "= 4\ninit-lambda-state = 3\nfep_lambdas       = 0.00 0.50 1.00 1.00 1.00\nvdw_lambdas       = 0.00 0.00 0.00 0.50 1"..., __end=<optimized out>) at /gpfs/fs1/scinet/niagara/software/2018a/core/bin/../include/c++/6.4.0/bits/basic_string.tcc:225
#5  _M_construct_aux (this=<optimized out>, __beg=0x92b030 "free-energy       = yes\nsc-alpha          = 0.5\nsc-r-power        = 6\nnstdhdl", ' ' <repeats 11 times>, "= 4\ninit-lambda-state = 3\nfep_lambdas       = 0.00 0.50 1.00 1.00 1.00\nvdw_lambdas       = 0.00 0.00 0.00 0.50 1"..., __end=<optimized out>) at /gpfs/fs1/scinet/niagara/software/2018a/core/bin/../include/c++/6.4.0/bits/basic_string.h:196
#6  _M_construct (this=<optimized out>, __beg=0x92b030 "free-energy       = yes\nsc-alpha          = 0.5\nsc-r-power        = 6\nnstdhdl", ' ' <repeats 11 times>, "= 4\ninit-lambda-state = 3\nfep_lambdas       = 0.00 0.50 1.00 1.00 1.00\nvdw_lambdas       = 0.00 0.00 0.00 0.50 1"..., __end=0x92bbc0 "free-energy       = yes\nsc-alpha          = 0.5\nsc-r-power        = 6\nnstdhdl", ' ' <repeats 11 times>, "= 4\ninit-lambda-state = 3\nfep_lambdas       = 0.00 0.50 1.00 1.00 1.00\nvdw_lambdas       = 0.00 0.00 0.00 0.50 1"...) at /gpfs/fs1/scinet/niagara/software/2018a/core/bin/../include/c++/6.4.0/bits/basic_string.h:215
#7  basic_string (this=<optimized out>, __str=...) at /gpfs/fs1/scinet/niagara/software/2018a/core/bin/../include/c++/6.4.0/bits/basic_string.h:400
#8  pair (this=<optimized out>) at /gpfs/fs1/scinet/niagara/software/2018a/core/bin/../include/c++/6.4.0/bits/stl_pair.h:288
#9  construct (this=<optimized out>, __p=<optimized out>, __args=...) at /gpfs/fs1/scinet/niagara/software/2018a/core/bin/../include/c++/6.4.0/ext/new_allocator.h:120
#10 construct (__a=..., __p=0x92bb70, __args=...) at /gpfs/fs1/scinet/niagara/software/2018a/core/bin/../include/c++/6.4.0/bits/alloc_traits.h:475
#11 _M_construct_node (this=<optimized out>, __node=<optimized out>, __args=...) at /gpfs/fs1/scinet/niagara/software/2018a/core/bin/../include/c++/6.4.0/bits/stl_tree.h:543
#12 _M_create_node (this=0x9032cc <__libirc_copy_loop_threshold>, __args=...) at /gpfs/fs1/scinet/niagara/software/2018a/core/bin/../include/c++/6.4.0/bits/stl_tree.h:560
#13 operator() (this=<optimized out>, __arg=...) at /gpfs/fs1/scinet/niagara/software/2018a/core/bin/../include/c++/6.4.0/bits/stl_tree.h:473
#14 _M_clone_node (this=<optimized out>, __x=<optimized out>, __node_gen=...) at /gpfs/fs1/scinet/niagara/software/2018a/core/bin/../include/c++/6.4.0/bits/stl_tree.h:583
#15 std::_Rb_tree<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::_Select1st<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > >, std::less<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > >::_M_copy<std::_Rb_tree<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::_Select1st<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > >, std::less<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > >::_Alloc_node> (this=0x947000, __x=0x946570, __p=0x9032cc <__libirc_copy_loop_threshold>, __node_gen=...) at /gpfs/fs1/scinet/niagara/software/2018a/core/bin/../include/c++/6.4.0/bits/stl_tree.h:1612
#16 0x000000000047281e in _M_copy (this=<optimized out>, __x=0x92aee0, __p=0x2, __node_gen=...) at /gpfs/fs1/scinet/niagara/software/2018a/core/bin/../include/c++/6.4.0/bits/stl_tree.h:1606
#17 _M_copy (this=<optimized out>, __x=<optimized out>, __p=<optimized out>) at /gpfs/fs1/scinet/niagara/software/2018a/core/bin/../include/c++/6.4.0/bits/stl_tree.h:797
#18 _Rb_tree (this=<optimized out>, __x=...) at /gpfs/fs1/scinet/niagara/software/2018a/core/bin/../include/c++/6.4.0/bits/stl_tree.h:833
#19 map (this=<optimized out>, __x=...) at /gpfs/fs1/scinet/niagara/software/2018a/core/bin/../include/c++/6.4.0/bits/stl_map.h:186
#20 pair (this=<optimized out>) at /gpfs/fs1/scinet/niagara/software/2018a/core/bin/../include/c++/6.4.0/bits/stl_pair.h:288
#21 construct (this=<optimized out>, __p=<optimized out>, __args=...) at /gpfs/fs1/scinet/niagara/software/2018a/core/bin/../include/c++/6.4.0/ext/new_allocator.h:120
#22 construct (__a=..., __p=0x2, __args=...) at /gpfs/fs1/scinet/niagara/software/2018a/core/bin/../include/c++/6.4.0/bits/alloc_traits.h:475
#23 _M_construct_node (this=<optimized out>, __node=<optimized out>, __args=...) at /gpfs/fs1/scinet/niagara/software/2018a/core/bin/../include/c++/6.4.0/bits/stl_tree.h:543
#24 _M_create_node (this=<optimized out>, __args=...) at /gpfs/fs1/scinet/niagara/software/2018a/core/bin/../include/c++/6.4.0/bits/stl_tree.h:560
#25 operator() (this=<optimized out>, __arg=...) at /gpfs/fs1/scinet/niagara/software/2018a/core/bin/../include/c++/6.4.0/bits/stl_tree.h:473
#26 _M_insert_ (this=<optimized out>, __x=<optimized out>, __p=<optimized out>, __v=..., __node_gen=...) at /gpfs/fs1/scinet/niagara/software/2018a/core/bin/../include/c++/6.4.0/bits/stl_tree.h:1535
#27 _M_insert_unique_ (this=<optimized out>, __position=..., __v=..., __node_gen=...) at /gpfs/fs1/scinet/niagara/software/2018a/core/bin/../include/c++/6.4.0/bits/stl_tree.h:2004
#28 _M_insert_unique (this=<optimized out>, __first=<optimized out>, __last=0x946570) at /gpfs/fs1/scinet/niagara/software/2018a/core/bin/../include/c++/6.4.0/bits/stl_tree.h:2250
#29 std::map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::less<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > >, std::less<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, std::map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::less<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > > > > >::map (this=0x947000, __l=..., __comp=..., __a=...) at /gpfs/fs1/scinet/niagara/software/2018a/core/bin/../include/c++/6.4.0/bits/stl_map.h:215
#30 0x000000000046e859 in __sti___ZN3gmx4test12_GLOBAL__N_122mdpFileValueDatabase_gB5cxx11E () at /tmp/tmp.ZSYVcgGLvl/gromacs-2018.4/src/programs/mdrun/tests/mdruncomparisonfixture.cpp:72
#31 0x000000000061cf8d in __libc_csu_init ()
#32 0x00007fffef634b95 in __libc_start_main () from /lib64/libc.so.6
#33 0x00000000004167e9 in _start ()

#5 Updated by Mike Nolta about 1 year ago

Yep, this is a compiler bug. Sorry for the noise.

#6 Updated by Roland Schulz about 1 year ago

I can reproduce it but only if I use GCC >= 5 for the C++ std library. If I use e.g. 4.8.5 the problem doesn't appear. You can choose which GCC version is used by putting the version in the path (see also install-guide for other options). Also if I compile with a newer GCC but run cmake with "CXXFLAGS=-D_GLIBCXX_USE_CXX11_ABI=0 cmake <all other flags>" it also works (prior to GCC 5 this was the default). Could you verify that with older GCC or the old ABI the problem doesn't appear?

Did you do any further tests than the ones you already reported here to make sure this is a compiler bug rather than a GROMACS or GCC bug? If so please share those details too. We like to add workarounds to the GROMACS code if possible and I like to add give as much details as possible to the compiler team. In case you already filed a bug report yourself please also let me know what the issue number is. In case you want to send any info per email rather than posting it here, please send it to send it to .

#7 Updated by Mike Nolta about 1 year ago

Yes, adding -D_GLIBCXX_USE_CXX11_ABI=0 makes the segfault go away.

I was able to extract a simple ~100 line testcase which fails with the same error, so this doesn't appear to be a GROMACS bug. Intel ticket number is 03996974.

#8 Updated by Gerrit Code Review Bot about 1 year ago

Gerrit received a related patchset '1' for Issue #2801.
Uploader: Roland Schulz ()
Change-Id: gromacs~release-2018~Ia41178aaa6e963b05cd4c1e52b8ca0d5946a569c
Gerrit URL: https://gerrit.gromacs.org/8823

#9 Updated by Mike Nolta about 1 year ago

Intel support says this is a tail-call elimination optimization bug, and will be fixed in an upcoming release.

#10 Updated by Roland Schulz about 1 year ago

  • Target version set to 2018.5

The part of the code which causes the issue isn't used at all in GROMACS (it is infrastructure which was intended for new tests but no new tests were added yet for 2018 and for 2019 the infrastructure was refactored which avoids the issue). I uploaded a patch which removes it. This should fix the issue for the release-2018 branch and 2018.5 version.

#11 Updated by Roland Schulz about 1 year ago

  • Status changed from New to Resolved

#12 Updated by Paul Bauer about 1 year ago

  • Status changed from Resolved to Closed

Also available in: Atom PDF