Project

General

Profile

Bug #2844

Test SEGV with -DGMX_DOUBLE due to LAPACK

Added by David van der Spoel 10 months ago. Updated 10 months ago.

Status:
Closed
Priority:
High
Category:
testing
Target version:
Affected version - extra info:
Affected version:
Difficulty:
uncategorized
Close

Description

Compiling with -DGMX_DOUBLE=ON leads to SEGV in mdrun-non-integrator-test.

History

#1 Updated by David van der Spoel 10 months ago

Works on Apple with Mojave but crashes on Linux with gcc 8.2.

#2 Updated by David van der Spoel 10 months ago

Found the culprit with git bisect

88825ae46b77e3ce510fddd0f205c45c5abe8c73

Added new testing code for normal mode analysis.

(my own patch in other words).

#3 Updated by David van der Spoel 10 months ago

Unfortunately the issue seems to be in the linear algebra code. Commenting out this fortran function removes the SEGV.

F77_FUNC(dsyevr, DSYEVR) (jobz, "I", "L", &n, a, &n, &vl, &vu, &index_lower, &index_upper,
&abstol, &m, eigenvalues, eigenvectors, &n,
isuppz, &w0, &lwork, &iw0, &liwork, &info);

#4 Updated by David van der Spoel 10 months ago

To make it more incomprehensible, the bug appears if normal mode testing is combined with minimizer testing in one executable. It seems that calling the LAPACK routine compromises memory or something like that, but valgrind cannot find anything.

#5 Updated by David van der Spoel 10 months ago

  • Subject changed from Test SEGV with -DGMX_DOUBLE to Test SEGV with -DGMX_DOUBLE due to LAPACK

When turning off the external LAPACK and BLAS the code seems to work fine.

#6 Updated by Roland Schulz 10 months ago

The error doesn't happen in Jenkins but only locally on your machine?
Which external BLAS are you using? OpenBLAS, MKL or other? Which version? If you use a different external BLAS do you still see the error?
Have you tried with ASAN? It can detect e.g. stack memory corruption which valgrind doesn't detect.

#7 Updated by David van der Spoel 10 months ago

Thanks for the tip. The problem shows itself in a FPE in printing...

25345ERROR: AddressSanitizer: FPE on unknown address 0x003ba3641b20 (pc 0x003ba3641b20 bp 0x7ffc8aaadc00 sp 0x7ffc8aaad988 T0)
#0 0x3ba3641b1f (/lib64/libc.so.6+0x3ba3641b1f)
#1 0x3ba364b82a in GIprintf_fp (/lib64/libc.so.6+0x3ba364b82a)
#2 0x3ba364570f in _IO_vfprintf (/lib64/libc.so.6+0x3ba364570f)
#3 0x3ba366f5c1 in _IO_vsnprintf (/lib64/libc.so.6+0x3ba366f5c1)
#4 0x2ae5015e49b5 in __interceptor_vsnprintf ../../../../libsanitizer/sanitizer_common/sanitizer_common_interceptors.inc:1509
#5 0x2ae502c1f9b3 in gmx::formatStringV[abi:cxx11](char const*, __va_list_tag*) /home/spoel/GG/testmaster/gromacs/src/gromacs/utility/stringutil.cpp:158
#6 0x2ae502c1fbcf in gmx::formatString[abi:cxx11](char const*, ...) /home/spoel/GG/testmaster/gromacs/src/gromacs/utility/stringutil.cpp:140
#7 0x2ae5070f0df2 in formatListSetup /home/spoel/GG/testmaster/gromacs/src/gromacs/mdlib/nbnxn_tuning.cpp:476
#8 0x2ae5070f34f2 in setupDynamicPairlistPruning(gmx::MDLogger const&, t_inputrec const*, gmx_mtop_t const*, double () [3], int, interaction_const_t const, NbnxnListParameters*) /home/spoel/GG/testmaster/gromacs/src/gromacs/mdlib/nbnxn_tuning.cpp:566
#9 0x2ae5034615ff in init_nb_verlet /home/spoel/GG/testmaster/gromacs/src/gromacs/mdlib/forcerec.cpp:2150
#10 0x2ae5034615ff in init_forcerec(_IO_FILE*, gmx::MDLogger const&, t_forcerec*, t_fcdata*, t_inputrec const*, gmx_mtop_t const*, t_commrec const*, double () [3], char const, char const*, gmx::ArrayRef<std::
_cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const>, gmx_hw_info_t const&, gmx_device_info_t const*, bool, bool, double) /home/spoel/GG/testmaster/gromacs/src/gromacs/mdlib/forcerec.cpp:3019
#11 0x2ae50741c19d in gmx::Mdrunner::mdrunner() /home/spoel/GG/testmaster/gromacs/src/gromacs/mdrun/runner.cpp:1218
#12 0x4b9a0b in gmx::gmx_mdrun(int, char**) /home/spoel/GG/testmaster/gromacs/src/programs/mdrun/mdrun.cpp:290
#13 0x477bd6 in gmx::test::SimulationRunner::callMdrun(gmx::test::CommandLine const&) /home/spoel/GG/testmaster/gromacs/src/programs/mdrun/tests/moduletest.cpp:279
#14 0x44c190 in TestBody /home/spoel/GG/testmaster/gromacs/src/programs/mdrun/tests/normalmodes.cpp:134
#15 0x5d9a91 in void testing::internal::HandleSehExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*) /home/spoel/GG/testmaster/gromacs/src/external/googletest/googletest/src/gtest.cc:2402
#16 0x5d9a91 in void testing::internal::HandleExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*) /home/spoel/GG/testmaster/gromacs/src/external/googletest/googletest/src/gtest.cc:2438
#17 0x5b8ce4 in testing::Test::Run() /home/spoel/GG/testmaster/gromacs/src/external/googletest/googletest/src/gtest.cc:2474
#18 0x5b8fb5 in testing::TestInfo::Run() /home/spoel/GG/testmaster/gromacs/src/external/googletest/googletest/src/gtest.cc:2656
#19 0x5b91a8 in testing::TestCase::Run() /home/spoel/GG/testmaster/gromacs/src/external/googletest/googletest/src/gtest.cc:2774
#20 0x5ba056 in testing::internal::UnitTestImpl::RunAllTests() /home/spoel/GG/testmaster/gromacs/src/external/googletest/googletest/src/gtest.cc:4649
#21 0x5daa3d in bool testing::internal::HandleSehExceptionsInMethodIfSupported<testing::internal::UnitTestImpl, bool>(testing::internal::UnitTestImpl*, bool (testing::internal::UnitTestImpl::*)(), char const*) /home/spoel/GG/testmaster/gromacs/src/external/googletest/googletest/src/gtest.cc:2402
#22 0x5daa3d in bool testing::internal::HandleExceptionsInMethodIfSupported<testing::internal::UnitTestImpl, bool>(testing::internal::UnitTestImpl*, bool (testing::internal::UnitTestImpl::*)(), char const*) /home/spoel/GG/testmaster/gromacs/src/external/googletest/googletest/src/gtest.cc:2438
#23 0x5ba6a8 in testing::UnitTest::Run() /home/spoel/GG/testmaster/gromacs/src/external/googletest/googletest/src/gtest.cc:4257
#24 0x4608c9 in RUN_ALL_TESTS() /home/spoel/GG/testmaster/gromacs/src/external/googletest/googletest/include/gtest/gtest.h:2233
#25 0x4608c9 in main /home/spoel/GG/testmaster/gromacs/src/testutils/unittest_main.cpp:85
#26 0x3ba361ed1c in __libc_start_main (/lib64/libc.so.6+0x3ba361ed1c)
#27 0x43ba18 (/home/spoel/GG/testmaster/gromacs/build/bin/normal-modes-test+0x43ba18)

AddressSanitizer can not provide additional info.
SUMMARY: AddressSanitizer: FPE (/lib64/libc.so.6+0x3ba3641b1f)
25345ABORTING

#8 Updated by Roland Schulz 10 months ago

ASAN just reported the Segfault. It gives a detailed report just before any invalid memory access.

The loop in formatStringV is suspicious but I can't see a bug. But maybe we should simplify it nonetheless by passing 0 to vsnprintf as length and remove the staticBuf. We might need to use _vscprintf on Windows (https://stackoverflow.com/questions/8488671/unix-to-windows-alternative-to-vsnprintf-to-determine-length).

#9 Updated by David van der Spoel 10 months ago

In fact it is a floating point error. Experience says that stack problems often show up in printf.

On another note, older versions of the lapack code may not be thread-safe. Could that cause issues?

https://stackoverflow.com/questions/18216314/shouldnt-lapacks-dsyevr-function-for-eigenvalues-and-eigenvectors-be-thread-s

#10 Updated by David van der Spoel 10 months ago

Upgrading the old rpms (lapack 3.2.1 and blas 3.2.1) to lapack and blas compiled from source (3.8.0) fixes the problem.

#11 Updated by Paul Bauer 10 months ago

can this be closed then because the issue was in the lapack version being used?

#12 Updated by David van der Spoel 10 months ago

  • Status changed from New to Rejected

#13 Updated by David van der Spoel 10 months ago

  • Status changed from Rejected to Closed

Also available in: Atom PDF