Project

General

Profile

Bug #2813

regressiontests/complex fails on Fedora30 with x86_64, i686 and other archs.

Added by Christoph Junghans 29 days ago. Updated 2 days ago.

Status:
Accepted
Priority:
Normal
Assignee:
-
Category:
testing
Target version:
Affected version - extra info:
Affected version:
Difficulty:
uncategorized
Close

Description

I did a test build on version 2019-rc1 on Fedora 30 with the following result:

Thanx for Using GROMACS - Have a Nice Day
Mdrun cannot use the requested (or automatic) number of ranks, retrying with 8.
Abnormal return value for ' gmx mdrun    -nb cpu   -notunepme >mdrun.out 2>&1' was 1
Retrying mdrun with better settings...
sh: line 1:  4161 Aborted                 (core dumped) gmx mdrun -nb cpu -notunepme > mdrun.out 2>&1
Abnormal return value for ' gmx mdrun    -nb cpu   -notunepme >mdrun.out 2>&1' was -1
FAILED. Check mdrun.out, md.log file(s) in urea for urea
1 out of 51 complex tests FAILED

Details build log for x86 and i686 attached.

build.log (7.72 MB) build.log Christoph Junghans, 12/20/2018 06:10 PM
build (1).log (7.72 MB) build (1).log Christoph Junghans, 12/20/2018 06:11 PM
build.log.txt (7.58 MB) build.log.txt Build log on x86_64 with dd43a2b.diff patch in. Christoph Junghans, 01/03/2019 04:39 PM

Associated revisions

Revision 3911191a (diff)
Added by Berk Hess 15 days ago

Fix segmentation fault in DD code

mdrun could exit with a segmentation fault in DD when DLB was disabled.

Fixes #2813

Change-Id: Ie20ca1995c93fa74f41d8db5becfce1cb20348a3

History

#1 Updated by Paul Bauer 29 days ago

Error occurs in the serial build with openblas according to the log for x86 (so that other people don't need to dig through it)

/usr/bin/cmake -DCMAKE_C_FLAGS_RELEASE:STRING=-DNDEBUG -DCMAKE_CXX_FLAGS_RELEASE:STRING=-DNDEBUG -DCMAKE_Fortran_FLAGS_RELEASE:STRING=-DNDEBUG -DCMAKE_VERBOSE_MAKEFILE:BOOL=ON
-DCMAKE_INSTALL_PREFIX:PATH=/usr -DINCLUDE_INSTALL_DIR:PATH=/usr/include -DLIB_INSTALL_DIR:PATH=/usr/lib64 -DSYSCONF_INSTALL_DIR:PATH=/etc -DSHARE_INSTALL_PREFIX:PATH=/usr/share -DLIB_SUFFIX=64 
-DBUILD_SHARED_LIBS:BOOL=ON -DBUILD_TESTING:BOOL=ON -DCMAKE_SKIP_RPATH:BOOL=ON -DCMAKE_SKIP_BUILD_RPATH:BOOL=ON -DGMX_BLAS_USER=openblas -DGMX_BUILD_UNITTESTS:BOOL=ON -DGMX_EXTERNAL_LMFIT:BOOL=ON 
-DGMX_USE_LMFIT=external -DGMX_EXTERNAL_TNG:BOOL=ON -DGMX_EXTERNAL_TINYXML2:BOOL=OFF -DGMX_LAPACK_USER=openblas -DGMX_USE_RDTSCP=OFF -DGMX_SIMD=SSE2 ' ' 
-DREGRESSIONTEST_PATH=/builddir/build/BUILD/gromacs-2019-rc1/serial/tests -DGMX_GPU:BOOL=ON -DGMX_USE_OPENCL:BOOL=ON ..

-- The C compiler identification is GNU 8.2.1
-- The CXX compiler identification is GNU 8.2.1

Does this also show up in Fedora 29 or before?

#2 Updated by Paul Bauer 28 days ago

I tried reproducing this on my machine, but it doesn't have gcc-8.2.1 and I couldn't trigger the same error.

#3 Updated by Paul Bauer 28 days ago

  • Target version changed from 2019-rc2 to 2019

no second release candidate is planned

#4 Updated by Christoph Junghans 28 days ago

Paul Bauer wrote:

Does this also show up in Fedora 29 or before?

I didn't see it it on Fedora 29, but I also didn't see with gromacs-2018.4 on Fedora 30.

#5 Updated by Christoph Junghans 28 days ago

To reproduce this (using Fedora):

#6 Updated by Paul Bauer 28 days ago

We would still need to have access to the build host to try to reproduce this. As I said, I can't reproduce it with gcc-8.2.0-12 on Debian Unstable.

#7 Updated by Christoph Junghans 27 days ago

Paul Bauer wrote:

We would still need to have access to the build host to try to reproduce this. As I said, I can't reproduce it with gcc-8.2.0-12 on Debian Unstable.

Accessing koji directly is not possible! Maybe one can reproduce this in the fedora:rawhide docker container?

#8 Updated by Christoph Junghans 26 days ago

I was able to reproduce this in docker starting from the fedora:rawhide container:

docker run -it fedora:rawhide /bin/bash

and then:
dnf install -y rpm-build git bash-completion cmake fftw-devel gcc-c++ gsl-devel hwloc-devel libX11-devel lmfit-devel motif-devel mpich-devel ocl-icd-devel openblas-devel opencl-headers openmpi-devel tng-devel hwloc make spectool
adduser user
su - user
git clone -b v2019 https://src.fedoraproject.org/rpms/gromacs.git
cd gromacs/
spectool -g gromacs.spec
mkdir -p /home/user/rpmbuild/SOURCES
for i in *2019-rc1* *.patch *.fedora; do ln -s $PWD/$i /home/user/rpmbuild/SOURCES; done
. /etc/profile.d/modules.sh
rpmbuild -ba gromacs.spec

#9 Updated by Paul Bauer 22 days ago

  • Status changed from New to Feedback wanted

Tried building a debug version in the docker container

/home/user/rpmbuild/BUILD/gromacs-2019-rc1/src/gromacs/correlationfunctions/gmx_lmcurve.cpp: In function ‘void gmx_lmcurve(int, double*, int, const double*, const double*, const double*, double (*)(double, const double*), const lm_control_struct*, lm_status_struct*)’:
/home/user/rpmbuild/BUILD/gromacs-2019-rc1/src/gromacs/correlationfunctions/gmx_lmcurve.cpp:98:39: error: cannot convert ‘lmcurve_data_struct*’ to ‘void (*)(const double*, int, const void*, double*, int*)’
     lmmin(n_par, par, m_dat, nullptr, &data, lmcurve_evaluate,
                                       ^~~~~
In file included from /home/user/rpmbuild/BUILD/gromacs-2019-rc1/src/gromacs/correlationfunctions/gmx_lmcurve.cpp:53:
/usr/include/lmmin.h:33:20: note:   initializing argument 5 of ‘void lmmin(int, double*, int, const void*, void (*)(const double*, int, const void*, double*, int*), const lm_control_struct*, lm_status_struct*)’
             void (*evaluate) (
             ~~~~~~~^~~~~~~~~~~
                 const double* par, const int m_dat, const void* data,
                 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
                 double* fvec, int* userbreak),
                 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

#10 Updated by Paul Bauer 22 days ago

  • Status changed from Feedback wanted to Blocked, need info
  • Target version changed from 2019 to 2019.1

I think this might be related to Fedora not having the bundled lmfit?
When building directly from git everything passes, so the error gets introduced in the process of preparing the Fedora version.
I bumped this to 2019.1, because whatever the issue is I don't think it will get fixed before the release.

#11 Updated by Christoph Junghans 21 days ago

I did a test build with the external lmfit here: https://koji.fedoraproject.org/koji/taskinfo?taskID=31709353, let's see what happens.

#12 Updated by Christoph Junghans 18 days ago

  • Status changed from Blocked, need info to Accepted

Ok the problem persists even with the internal lmfit.

#13 Updated by Paul Bauer 17 days ago

  • Status changed from Accepted to Blocked, need info

I tried it just right now again in the docker container. Building straight from gromacs.git with internal lmfit works perfectly fine.

docker run -it fedora:rawhide /bin/bash

dnf install -y rpm-build git bash-completion cmake fftw-devel gcc-c++ gsl-devel hwloc-devel libX11-devel lmfit-devel motif-devel mpich-devel ocl-icd-devel openblas-devel opencl-headers openmpi-devel tng-devel hwloc make spectool
adduser user
su - user
git clone https://github.com/gromacs/gromacs.git -b release-2019
git clone https://github.com/gromacs/regressiontests.git -b release-2019
cd gromacs
mkdir build
cd build
/usr/bin/cmake -DCMAKE_C_FLAGS_RELEASE:STRING=-DNDEBUG -DCMAKE_CXX_FLAGS_RELEASE:STRING=-DNDEBUG -DCMAKE_Fortran_FLAGS_RELEASE:STRING=-DNDEBUG -DCMAKE_VERBOSE_MAKEFILE:BOOL=ON \
-DBUILD_SHARED_LIBS:BOOL=ON -DBUILD_TESTING:BOOL=ON -DCMAKE_SKIP_RPATH:BOOL=ON -DCMAKE_SKIP_BUILD_RPATH:BOOL=ON -DGMX_BLAS_USER=openblas -DGMX_BUILD_UNITTESTS:BOOL=ON \
-DGMX_EXTERNAL_TNG:BOOL=ON -DGMX_EXTERNAL_TINYXML2:BOOL=OFF -DGMX_LAPACK_USER=openblas -DGMX_USE_RDTSCP=OFF -DGMX_SIMD=SSE2 ' ' \
-DREGRESSIONTEST_PATH=/home/user/regressiontests -DGMX_GPU:BOOL=ON -DGMX_USE_OPENCL:BOOL=ON ..
make gmx && make tests
exit
ln -s /home/user/gromacs/build/lib/libgromacs.so.4 /lib64
su - user
cd gromacs/build
make check

What are the changes applied to the code in order to prepare the Fedora package?

#14 Updated by Mark Abraham 17 days ago

Note that the lmfit package did not have reasonable packaging or versioning last time I looked at it, so it may not be reasonable to try to support it in a distro.

#15 Updated by Christoph Junghans 16 days ago

Mark Abraham wrote:

Note that the lmfit package did not have reasonable packaging or versioning last time I looked at it, so it may not be reasonable to try to support it in a distro.

I think that discussion before, fedora has something against bundle libraries (https://fedoraproject.org/wiki/Bundled_Libraries?rd=Packaging:Bundled_Libraries). The main problem is that libgromacs provide the same lmfit symbols as liblmfit for the lmfit package itself, which could lead to some unwanted effect if an executable links both.

Also I am not sure what you mean by reasonable packaging, lmfit uses autotools and don't use anything fancy like mpi hence has the most simplest packaging process even: https://src.fedoraproject.org/rpms/lmfit/blob/master/f/lmfit.spec.

Paul's build issue with lmfit above, which doesn't show up in my original build ("Found lmfit: /usr/include (found version "6.4") is still unclear to me, but an lmfit-7 update is on the way, too: https://src.fedoraproject.org/rpms/lmfit/pull-request/2

#16 Updated by Paul Bauer 16 days ago

GROMACS depends on lmfit 7.0, so it can't work with the lower versions (see cmake/gmxManageLmfit.cmake). So the bug might be that it didn't pick up the requirements for lmfit correctly.

#17 Updated by Christoph Junghans 16 days ago

Paul Bauer wrote:

What are the changes applied to the code in order to prepare the Fedora package?

in spec (see https://src.fedoraproject.org/rpms/gromacs/blob/v2019/f/gromacs.spec#_234) we do:

%patch0 -p1
%if 0%{?fedora} <= 29
%patch1 -p1
%endif
rm -r src/external/{fftpack,tng_io,lmfit}

Patch0 changes the path of dssp: https://src.fedoraproject.org/rpms/gromacs/blob/v2019/f/gromacs-dssp-path.patch
Patch1, which only get applied for Fedora 29 and below (so not in the original build posted here): https://src.fedoraproject.org/rpms/gromacs/blob/v2019/f/gromacs-issue-2366.patch, disables a piece of a test on aarch64 only. See #2366, which seemed to have been an issue in earlier version of hwloc.

#18 Updated by Christoph Junghans 16 days ago

By the end of the day, I could simply disable the regressiontests in the rpm build, but I just don't want to deploy a possibly broken binary to all of Fedora's userbase.

#19 Updated by Christoph Junghans 16 days ago

Paul Bauer wrote:

GROMACS depends on lmfit 7.0, so it can't work with the lower versions (see cmake/gmxManageLmfit.cmake). So the bug might be that it didn't pick up the requirements for lmfit correctly.

It built with lmfit-6.4 not sure if this was intended, but anyhow this should be discussed on a different issue as the test still fail with the internal lmfit library as well.

#20 Updated by Paul Bauer 16 days ago

I saw that the build fails in debug mode for the packaged version as well, likely related to the wrong version of lmfit.
I just now build the packaged version with the modifications to the gromacs.spec file needed to turn off external lmfit.
I managed to get a backtrace for the bug

==7680== Process terminating with default action of signal 6 (SIGABRT): dumping core
==7680==    at 0x5B7E00F: raise (in /usr/lib64/libc-2.28.9000.so)
==7680==    by 0x5B68894: abort (in /usr/lib64/libc-2.28.9000.so)
==7680==    by 0x49F1F17: std::__replacement_assert(char const*, int, char const*, char const*) (c++config.h:2391)
==7680==    by 0x4A4515E: operator[] (stl_vector.h:932)
==7680==    by 0x4A4515E: get_load_distribution(gmx_domdec_t*, gmx_wallcycle*) (partition.cpp:882)
==7680==    by 0x4A4DBD5: dd_partition_system(_IO_FILE*, gmx::MDLogger const&, long, t_commrec const*, bool, int, t_state*, gmx_mtop_t const*, t_inputrec const*, t_state*, gmx::PaddedVector<gmx::BasicVector<float>, gmx::Allocator<gmx::BasicVector<float>, gmx::AlignedAllocationPolicy> >*, gmx::MDAtoms*, gmx_localtop_t*, t_forcerec*, gmx_vsite_t*, gmx::Constraints*, t_nrnb*, gmx_wallcycle*, bool) (partition.cpp:3104)
==7680==    by 0x52A512B: em_dd_partition_system(_IO_FILE*, gmx::MDLogger const&, int, t_commrec const*, gmx_mtop_t*, t_inputrec*, em_state_t*, gmx_localtop_t*, gmx::MDAtoms*, t_forcerec*, gmx_vsite_t*, gmx::Constraints*, t_nrnb*, gmx_wallcycle*) (minimize.cpp:738)
==7680==    by 0x52B52D6: gmx::Integrator::do_steep() (minimize.cpp:2610)
==7680==    by 0x529D814: gmx::Integrator::run(unsigned int, bool) (integrator.cpp:86)
==7680==    by 0x52CA708: gmx::Mdrunner::mdrunner() (runner.cpp:1434)
==7680==    by 0x52CC307: gmx::mdrunner_start_fn(void const*) (runner.cpp:219)
==7680==    by 0x5334985: tMPI_Thread_starter(void*) (tmpi_init.cpp:399)
==7680==    by 0x82F2582: start_thread (in /usr/lib64/libpthread-2.28.9000.so)

#21 Updated by Paul Bauer 16 days ago

  • Status changed from Blocked, need info to Accepted

#22 Updated by Mark Abraham 16 days ago

Christoph Junghans wrote:

Mark Abraham wrote:

Note that the lmfit package did not have reasonable packaging or versioning last time I looked at it, so it may not be reasonable to try to support it in a distro.

I think that discussion before, fedora has something against bundle libraries (https://fedoraproject.org/wiki/Bundled_Libraries?rd=Packaging:Bundled_Libraries). The main problem is that libgromacs provide the same lmfit symbols as liblmfit for the lmfit package itself, which could lead to some unwanted effect if an executable links both.

Sure. lmfit is not very easy to write a cmake find_package for, because there's no explicit way to discover the version (and the only implicit way is to see if test or real code compiles and runs).

Also I am not sure what you mean by reasonable packaging, lmfit uses autotools and don't use anything fancy like mpi hence has the most simplest packaging process even: https://src.fedoraproject.org/rpms/lmfit/blob/master/f/lmfit.spec.

See f3410c301f5af. They made breaking API changes and have provided no way for anyone to discover the version of the code in liblmfit.so or its headers. So we can't spend the effort to support more than one version of it, and have to bundle that one, and provide for distros to be able to provide their own version of it.

Paul's build issue with lmfit above, which doesn't show up in my original build ("Found lmfit: /usr/include (found version "6.4") is still unclear to me, but an lmfit-7 update is on the way, too: https://src.fedoraproject.org/rpms/lmfit/pull-request/2

"GROMACS 2019 requires lmfit 7" is one of our boundary conditions :-)

#23 Updated by Mark Abraham 16 days ago

  • Related to Bug #2584: regressiontests/complex fails on i686 added

#24 Updated by Christoph Junghans 16 days ago

Mark Abraham wrote:

Christoph Junghans wrote:

Paul's build issue with lmfit above, which doesn't show up in my original build ("Found lmfit: /usr/include (found version "6.4") is still unclear to me, but an lmfit-7 update is on the way, too: https://src.fedoraproject.org/rpms/lmfit/pull-request/2

"GROMACS 2019 requires lmfit 7" is one of our boundary conditions :-)

The problem of lmfit-6.4 not give an error is attacked here: https://gerrit.gromacs.org/#/c/8916/

#25 Updated by Gerrit Code Review Bot 15 days ago

Gerrit received a related patchset '1' for Issue #2813.
Uploader: Berk Hess ()
Change-Id: gromacs~release-2019~Ie20ca1995c93fa74f41d8db5becfce1cb20348a3
Gerrit URL: https://gerrit.gromacs.org/8917

#26 Updated by Berk Hess 15 days ago

  • Status changed from Accepted to Fix uploaded
  • Assignee changed from Paul Bauer to Berk Hess

#27 Updated by Szilárd Páll 15 days ago

  • Related to deleted (Bug #2584: regressiontests/complex fails on i686)

#28 Updated by Szilárd Páll 15 days ago

Not related to #2584, that has been shown to be a 32-bit only issue in AWH.

#29 Updated by Christoph Junghans 15 days ago

I patched https://gerrit.gromacs.org/8917 in, but it didn't help, build log attached.

#30 Updated by Berk Hess 14 days ago

  • Status changed from Fix uploaded to Resolved

#31 Updated by Christoph Junghans 14 days ago

  • Status changed from Resolved to Accepted

Problem persists.

#32 Updated by Mark Abraham 14 days ago

I can't reproduce this with SSE2+OpenCL build with gcc 8.2 on ubuntu 18.02

#33 Updated by Christoph Junghans 14 days ago

Mark Abraham wrote:

I can't reproduce this with SSE2+OpenCL build with gcc 8.2 on ubuntu 18.02

The gcc version basically doesn't mean anything as Ubuntu and Fedora both patch their gcc heavily.

#34 Updated by Mark Abraham 7 days ago

  • Assignee deleted (Berk Hess)

#35 Updated by Mark Abraham 4 days ago

We might come back to this in a week or two, no assignee for now

#36 Updated by Mark Abraham 2 days ago

In master, we just converted this test to run with the verlet scheme. It used no constraints and steepest descent. We observed TSAN find a race with steep+Verlet, and it wasn't stable when converted to leapfrog+Verlet, so we switched it to no temperature coupling and a small timestep, and now it seems stable.

So my guess is that if we run urea on x86 for a tsan build then we will find a race.

Also available in: Atom PDF