Project

General

Profile

Bug #2813

regressiontests/complex fails on Fedora30 with x86_64, i686 and other archs.

Added by Christoph Junghans 12 months ago. Updated 9 months ago.

Status:
Closed
Priority:
Normal
Assignee:
Category:
testing
Target version:
Affected version - extra info:
Affected version:
Difficulty:
uncategorized
Close

Description

I did a test build on version 2019-rc1 on Fedora 30 with the following result:

Thanx for Using GROMACS - Have a Nice Day
Mdrun cannot use the requested (or automatic) number of ranks, retrying with 8.
Abnormal return value for ' gmx mdrun    -nb cpu   -notunepme >mdrun.out 2>&1' was 1
Retrying mdrun with better settings...
sh: line 1:  4161 Aborted                 (core dumped) gmx mdrun -nb cpu -notunepme > mdrun.out 2>&1
Abnormal return value for ' gmx mdrun    -nb cpu   -notunepme >mdrun.out 2>&1' was -1
FAILED. Check mdrun.out, md.log file(s) in urea for urea
1 out of 51 complex tests FAILED

Details build log for x86 and i686 attached.

build.log (7.72 MB) build.log Christoph Junghans, 12/20/2018 06:10 PM
build (1).log (7.72 MB) build (1).log Christoph Junghans, 12/20/2018 06:11 PM
build.log.txt (7.58 MB) build.log.txt Build log on x86_64 with dd43a2b.diff patch in. Christoph Junghans, 01/03/2019 04:39 PM
build.log.txt (7.76 MB) build.log.txt Christoph Junghans, 02/06/2019 09:21 PM

Associated revisions

Revision 3911191a (diff)
Added by Berk Hess 11 months ago

Fix segmentation fault in DD code

mdrun could exit with a segmentation fault in DD when DLB was disabled.

Fixes #2813

Change-Id: Ie20ca1995c93fa74f41d8db5becfce1cb20348a3

Revision c5c3743d (diff)
Added by Berk Hess 10 months ago

Fix segfault with EM, DD and group scheme

Resetting to an old DD state during EM would leave the cg sorting array
used with the group scheme in an invalid state. This could cause
out of bounds vector access one DD step after rejecting an EM step.

When merging this into master branch, prefer to change to use ssize
rather than the static_cast.

Fixes #2813

Change-Id: I7f13b46d7ff5352ce41838b813c46f2e90c93b1c

History

#1 Updated by Paul Bauer 12 months ago

Error occurs in the serial build with openblas according to the log for x86 (so that other people don't need to dig through it)

/usr/bin/cmake -DCMAKE_C_FLAGS_RELEASE:STRING=-DNDEBUG -DCMAKE_CXX_FLAGS_RELEASE:STRING=-DNDEBUG -DCMAKE_Fortran_FLAGS_RELEASE:STRING=-DNDEBUG -DCMAKE_VERBOSE_MAKEFILE:BOOL=ON
-DCMAKE_INSTALL_PREFIX:PATH=/usr -DINCLUDE_INSTALL_DIR:PATH=/usr/include -DLIB_INSTALL_DIR:PATH=/usr/lib64 -DSYSCONF_INSTALL_DIR:PATH=/etc -DSHARE_INSTALL_PREFIX:PATH=/usr/share -DLIB_SUFFIX=64 
-DBUILD_SHARED_LIBS:BOOL=ON -DBUILD_TESTING:BOOL=ON -DCMAKE_SKIP_RPATH:BOOL=ON -DCMAKE_SKIP_BUILD_RPATH:BOOL=ON -DGMX_BLAS_USER=openblas -DGMX_BUILD_UNITTESTS:BOOL=ON -DGMX_EXTERNAL_LMFIT:BOOL=ON 
-DGMX_USE_LMFIT=external -DGMX_EXTERNAL_TNG:BOOL=ON -DGMX_EXTERNAL_TINYXML2:BOOL=OFF -DGMX_LAPACK_USER=openblas -DGMX_USE_RDTSCP=OFF -DGMX_SIMD=SSE2 ' ' 
-DREGRESSIONTEST_PATH=/builddir/build/BUILD/gromacs-2019-rc1/serial/tests -DGMX_GPU:BOOL=ON -DGMX_USE_OPENCL:BOOL=ON ..

-- The C compiler identification is GNU 8.2.1
-- The CXX compiler identification is GNU 8.2.1

Does this also show up in Fedora 29 or before?

#2 Updated by Paul Bauer 12 months ago

I tried reproducing this on my machine, but it doesn't have gcc-8.2.1 and I couldn't trigger the same error.

#3 Updated by Paul Bauer 12 months ago

  • Target version changed from 2019-rc2 to 2019

no second release candidate is planned

#4 Updated by Christoph Junghans 12 months ago

Paul Bauer wrote:

Does this also show up in Fedora 29 or before?

I didn't see it it on Fedora 29, but I also didn't see with gromacs-2018.4 on Fedora 30.

#5 Updated by Christoph Junghans 12 months ago

To reproduce this (using Fedora):

#6 Updated by Paul Bauer 12 months ago

We would still need to have access to the build host to try to reproduce this. As I said, I can't reproduce it with gcc-8.2.0-12 on Debian Unstable.

#7 Updated by Christoph Junghans 12 months ago

Paul Bauer wrote:

We would still need to have access to the build host to try to reproduce this. As I said, I can't reproduce it with gcc-8.2.0-12 on Debian Unstable.

Accessing koji directly is not possible! Maybe one can reproduce this in the fedora:rawhide docker container?

#8 Updated by Christoph Junghans 12 months ago

I was able to reproduce this in docker starting from the fedora:rawhide container:

docker run -it fedora:rawhide /bin/bash

and then:
dnf install -y rpm-build git bash-completion cmake fftw-devel gcc-c++ gsl-devel hwloc-devel libX11-devel lmfit-devel motif-devel mpich-devel ocl-icd-devel openblas-devel opencl-headers openmpi-devel tng-devel hwloc make spectool
adduser user
su - user
git clone -b v2019 https://src.fedoraproject.org/rpms/gromacs.git
cd gromacs/
spectool -g gromacs.spec
mkdir -p /home/user/rpmbuild/SOURCES
for i in *2019-rc1* *.patch *.fedora; do ln -s $PWD/$i /home/user/rpmbuild/SOURCES; done
. /etc/profile.d/modules.sh
rpmbuild -ba gromacs.spec

#9 Updated by Paul Bauer 12 months ago

  • Status changed from New to Feedback wanted

Tried building a debug version in the docker container

/home/user/rpmbuild/BUILD/gromacs-2019-rc1/src/gromacs/correlationfunctions/gmx_lmcurve.cpp: In function ‘void gmx_lmcurve(int, double*, int, const double*, const double*, const double*, double (*)(double, const double*), const lm_control_struct*, lm_status_struct*)’:
/home/user/rpmbuild/BUILD/gromacs-2019-rc1/src/gromacs/correlationfunctions/gmx_lmcurve.cpp:98:39: error: cannot convert ‘lmcurve_data_struct*’ to ‘void (*)(const double*, int, const void*, double*, int*)’
     lmmin(n_par, par, m_dat, nullptr, &data, lmcurve_evaluate,
                                       ^~~~~
In file included from /home/user/rpmbuild/BUILD/gromacs-2019-rc1/src/gromacs/correlationfunctions/gmx_lmcurve.cpp:53:
/usr/include/lmmin.h:33:20: note:   initializing argument 5 of ‘void lmmin(int, double*, int, const void*, void (*)(const double*, int, const void*, double*, int*), const lm_control_struct*, lm_status_struct*)’
             void (*evaluate) (
             ~~~~~~~^~~~~~~~~~~
                 const double* par, const int m_dat, const void* data,
                 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
                 double* fvec, int* userbreak),
                 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

#10 Updated by Paul Bauer 12 months ago

  • Status changed from Feedback wanted to Blocked, need info
  • Target version changed from 2019 to 2019.1

I think this might be related to Fedora not having the bundled lmfit?
When building directly from git everything passes, so the error gets introduced in the process of preparing the Fedora version.
I bumped this to 2019.1, because whatever the issue is I don't think it will get fixed before the release.

#11 Updated by Christoph Junghans 12 months ago

I did a test build with the external lmfit here: https://koji.fedoraproject.org/koji/taskinfo?taskID=31709353, let's see what happens.

#12 Updated by Christoph Junghans 11 months ago

  • Status changed from Blocked, need info to Accepted

Ok the problem persists even with the internal lmfit.

#13 Updated by Paul Bauer 11 months ago

  • Status changed from Accepted to Blocked, need info

I tried it just right now again in the docker container. Building straight from gromacs.git with internal lmfit works perfectly fine.

docker run -it fedora:rawhide /bin/bash

dnf install -y rpm-build git bash-completion cmake fftw-devel gcc-c++ gsl-devel hwloc-devel libX11-devel lmfit-devel motif-devel mpich-devel ocl-icd-devel openblas-devel opencl-headers openmpi-devel tng-devel hwloc make spectool
adduser user
su - user
git clone https://github.com/gromacs/gromacs.git -b release-2019
git clone https://github.com/gromacs/regressiontests.git -b release-2019
cd gromacs
mkdir build
cd build
/usr/bin/cmake -DCMAKE_C_FLAGS_RELEASE:STRING=-DNDEBUG -DCMAKE_CXX_FLAGS_RELEASE:STRING=-DNDEBUG -DCMAKE_Fortran_FLAGS_RELEASE:STRING=-DNDEBUG -DCMAKE_VERBOSE_MAKEFILE:BOOL=ON \
-DBUILD_SHARED_LIBS:BOOL=ON -DBUILD_TESTING:BOOL=ON -DCMAKE_SKIP_RPATH:BOOL=ON -DCMAKE_SKIP_BUILD_RPATH:BOOL=ON -DGMX_BLAS_USER=openblas -DGMX_BUILD_UNITTESTS:BOOL=ON \
-DGMX_EXTERNAL_TNG:BOOL=ON -DGMX_EXTERNAL_TINYXML2:BOOL=OFF -DGMX_LAPACK_USER=openblas -DGMX_USE_RDTSCP=OFF -DGMX_SIMD=SSE2 ' ' \
-DREGRESSIONTEST_PATH=/home/user/regressiontests -DGMX_GPU:BOOL=ON -DGMX_USE_OPENCL:BOOL=ON ..
make gmx && make tests
exit
ln -s /home/user/gromacs/build/lib/libgromacs.so.4 /lib64
su - user
cd gromacs/build
make check

What are the changes applied to the code in order to prepare the Fedora package?

#14 Updated by Mark Abraham 11 months ago

Note that the lmfit package did not have reasonable packaging or versioning last time I looked at it, so it may not be reasonable to try to support it in a distro.

#15 Updated by Christoph Junghans 11 months ago

Mark Abraham wrote:

Note that the lmfit package did not have reasonable packaging or versioning last time I looked at it, so it may not be reasonable to try to support it in a distro.

I think that discussion before, fedora has something against bundle libraries (https://fedoraproject.org/wiki/Bundled_Libraries?rd=Packaging:Bundled_Libraries). The main problem is that libgromacs provide the same lmfit symbols as liblmfit for the lmfit package itself, which could lead to some unwanted effect if an executable links both.

Also I am not sure what you mean by reasonable packaging, lmfit uses autotools and don't use anything fancy like mpi hence has the most simplest packaging process even: https://src.fedoraproject.org/rpms/lmfit/blob/master/f/lmfit.spec.

Paul's build issue with lmfit above, which doesn't show up in my original build ("Found lmfit: /usr/include (found version "6.4") is still unclear to me, but an lmfit-7 update is on the way, too: https://src.fedoraproject.org/rpms/lmfit/pull-request/2

#16 Updated by Paul Bauer 11 months ago

GROMACS depends on lmfit 7.0, so it can't work with the lower versions (see cmake/gmxManageLmfit.cmake). So the bug might be that it didn't pick up the requirements for lmfit correctly.

#17 Updated by Christoph Junghans 11 months ago

Paul Bauer wrote:

What are the changes applied to the code in order to prepare the Fedora package?

in spec (see https://src.fedoraproject.org/rpms/gromacs/blob/v2019/f/gromacs.spec#_234) we do:

%patch0 -p1
%if 0%{?fedora} <= 29
%patch1 -p1
%endif
rm -r src/external/{fftpack,tng_io,lmfit}

Patch0 changes the path of dssp: https://src.fedoraproject.org/rpms/gromacs/blob/v2019/f/gromacs-dssp-path.patch
Patch1, which only get applied for Fedora 29 and below (so not in the original build posted here): https://src.fedoraproject.org/rpms/gromacs/blob/v2019/f/gromacs-issue-2366.patch, disables a piece of a test on aarch64 only. See #2366, which seemed to have been an issue in earlier version of hwloc.

#18 Updated by Christoph Junghans 11 months ago

By the end of the day, I could simply disable the regressiontests in the rpm build, but I just don't want to deploy a possibly broken binary to all of Fedora's userbase.

#19 Updated by Christoph Junghans 11 months ago

Paul Bauer wrote:

GROMACS depends on lmfit 7.0, so it can't work with the lower versions (see cmake/gmxManageLmfit.cmake). So the bug might be that it didn't pick up the requirements for lmfit correctly.

It built with lmfit-6.4 not sure if this was intended, but anyhow this should be discussed on a different issue as the test still fail with the internal lmfit library as well.

#20 Updated by Paul Bauer 11 months ago

I saw that the build fails in debug mode for the packaged version as well, likely related to the wrong version of lmfit.
I just now build the packaged version with the modifications to the gromacs.spec file needed to turn off external lmfit.
I managed to get a backtrace for the bug

==7680== Process terminating with default action of signal 6 (SIGABRT): dumping core
==7680==    at 0x5B7E00F: raise (in /usr/lib64/libc-2.28.9000.so)
==7680==    by 0x5B68894: abort (in /usr/lib64/libc-2.28.9000.so)
==7680==    by 0x49F1F17: std::__replacement_assert(char const*, int, char const*, char const*) (c++config.h:2391)
==7680==    by 0x4A4515E: operator[] (stl_vector.h:932)
==7680==    by 0x4A4515E: get_load_distribution(gmx_domdec_t*, gmx_wallcycle*) (partition.cpp:882)
==7680==    by 0x4A4DBD5: dd_partition_system(_IO_FILE*, gmx::MDLogger const&, long, t_commrec const*, bool, int, t_state*, gmx_mtop_t const*, t_inputrec const*, t_state*, gmx::PaddedVector<gmx::BasicVector<float>, gmx::Allocator<gmx::BasicVector<float>, gmx::AlignedAllocationPolicy> >*, gmx::MDAtoms*, gmx_localtop_t*, t_forcerec*, gmx_vsite_t*, gmx::Constraints*, t_nrnb*, gmx_wallcycle*, bool) (partition.cpp:3104)
==7680==    by 0x52A512B: em_dd_partition_system(_IO_FILE*, gmx::MDLogger const&, int, t_commrec const*, gmx_mtop_t*, t_inputrec*, em_state_t*, gmx_localtop_t*, gmx::MDAtoms*, t_forcerec*, gmx_vsite_t*, gmx::Constraints*, t_nrnb*, gmx_wallcycle*) (minimize.cpp:738)
==7680==    by 0x52B52D6: gmx::Integrator::do_steep() (minimize.cpp:2610)
==7680==    by 0x529D814: gmx::Integrator::run(unsigned int, bool) (integrator.cpp:86)
==7680==    by 0x52CA708: gmx::Mdrunner::mdrunner() (runner.cpp:1434)
==7680==    by 0x52CC307: gmx::mdrunner_start_fn(void const*) (runner.cpp:219)
==7680==    by 0x5334985: tMPI_Thread_starter(void*) (tmpi_init.cpp:399)
==7680==    by 0x82F2582: start_thread (in /usr/lib64/libpthread-2.28.9000.so)

#21 Updated by Paul Bauer 11 months ago

  • Status changed from Blocked, need info to Accepted

#22 Updated by Mark Abraham 11 months ago

Christoph Junghans wrote:

Mark Abraham wrote:

Note that the lmfit package did not have reasonable packaging or versioning last time I looked at it, so it may not be reasonable to try to support it in a distro.

I think that discussion before, fedora has something against bundle libraries (https://fedoraproject.org/wiki/Bundled_Libraries?rd=Packaging:Bundled_Libraries). The main problem is that libgromacs provide the same lmfit symbols as liblmfit for the lmfit package itself, which could lead to some unwanted effect if an executable links both.

Sure. lmfit is not very easy to write a cmake find_package for, because there's no explicit way to discover the version (and the only implicit way is to see if test or real code compiles and runs).

Also I am not sure what you mean by reasonable packaging, lmfit uses autotools and don't use anything fancy like mpi hence has the most simplest packaging process even: https://src.fedoraproject.org/rpms/lmfit/blob/master/f/lmfit.spec.

See f3410c301f5af. They made breaking API changes and have provided no way for anyone to discover the version of the code in liblmfit.so or its headers. So we can't spend the effort to support more than one version of it, and have to bundle that one, and provide for distros to be able to provide their own version of it.

Paul's build issue with lmfit above, which doesn't show up in my original build ("Found lmfit: /usr/include (found version "6.4") is still unclear to me, but an lmfit-7 update is on the way, too: https://src.fedoraproject.org/rpms/lmfit/pull-request/2

"GROMACS 2019 requires lmfit 7" is one of our boundary conditions :-)

#23 Updated by Mark Abraham 11 months ago

  • Related to Bug #2584: regressiontests/complex fails on i686 added

#24 Updated by Christoph Junghans 11 months ago

Mark Abraham wrote:

Christoph Junghans wrote:

Paul's build issue with lmfit above, which doesn't show up in my original build ("Found lmfit: /usr/include (found version "6.4") is still unclear to me, but an lmfit-7 update is on the way, too: https://src.fedoraproject.org/rpms/lmfit/pull-request/2

"GROMACS 2019 requires lmfit 7" is one of our boundary conditions :-)

The problem of lmfit-6.4 not give an error is attacked here: https://gerrit.gromacs.org/#/c/8916/

#25 Updated by Gerrit Code Review Bot 11 months ago

Gerrit received a related patchset '1' for Issue #2813.
Uploader: Berk Hess ()
Change-Id: gromacs~release-2019~Ie20ca1995c93fa74f41d8db5becfce1cb20348a3
Gerrit URL: https://gerrit.gromacs.org/8917

#26 Updated by Berk Hess 11 months ago

  • Status changed from Accepted to Fix uploaded
  • Assignee changed from Paul Bauer to Berk Hess

#27 Updated by Szilárd Páll 11 months ago

  • Related to deleted (Bug #2584: regressiontests/complex fails on i686)

#28 Updated by Szilárd Páll 11 months ago

Not related to #2584, that has been shown to be a 32-bit only issue in AWH.

#29 Updated by Christoph Junghans 11 months ago

I patched https://gerrit.gromacs.org/8917 in, but it didn't help, build log attached.

#30 Updated by Berk Hess 11 months ago

  • Status changed from Fix uploaded to Resolved

#31 Updated by Christoph Junghans 11 months ago

  • Status changed from Resolved to Accepted

Problem persists.

#32 Updated by Mark Abraham 11 months ago

I can't reproduce this with SSE2+OpenCL build with gcc 8.2 on ubuntu 18.02

#33 Updated by Christoph Junghans 11 months ago

Mark Abraham wrote:

I can't reproduce this with SSE2+OpenCL build with gcc 8.2 on ubuntu 18.02

The gcc version basically doesn't mean anything as Ubuntu and Fedora both patch their gcc heavily.

#34 Updated by Mark Abraham 11 months ago

  • Assignee deleted (Berk Hess)

#35 Updated by Mark Abraham 11 months ago

We might come back to this in a week or two, no assignee for now

#36 Updated by Mark Abraham 11 months ago

In master, we just converted this test to run with the verlet scheme. It used no constraints and steepest descent. We observed TSAN find a race with steep+Verlet, and it wasn't stable when converted to leapfrog+Verlet, so we switched it to no temperature coupling and a small timestep, and now it seems stable.

So my guess is that if we run urea on x86 for a tsan build then we will find a race.

#37 Updated by Paul Bauer 10 months ago

  • Related to Bug #2858: Group scheme C kernels fail in complext tests added

#38 Updated by Paul Bauer 10 months ago

So, just tried reproducing this again in the docker container.
The error again did not show up when building GROMACS directly from source, but I got this error while running the rpm build script

/home/user/rpmbuild/BUILD/gromacs-2019/src/gromacs/gpu_utils/gpu_utils_ocl.cpp:279:107: error: format not a string literal and no format arguments [-Werror=format-security]
  279 |         gmx_warning((formatString("While sanity checking device #%zu, ", deviceId) + errorMessage).c_str());
      |                                                                                                           ^
cc1plus: some warnings being treated as errors

Does not show up on a different build under Debian as well, so I'm not sure where this one comes from now.

#39 Updated by Szilárd Páll 10 months ago

That's fixed, see dabd3b9d

#40 Updated by Paul Bauer 10 months ago

Thanks for pointing it out!

#41 Updated by Paul Bauer 10 months ago

so, after patching the issue above, I got this error message now when running gmx mdrun on its own.

/usr/include/c++/9/bits/stl_vector.h:1009: std::vector<_Tp, _Alloc>::reference std::vector<_Tp, _Alloc>::operator[](std::vector<_Tp, _Alloc>::size_type) [with _Tp = DDCellsizesWithDlb; _Alloc = std::allocator<DDCellsizesWithDlb>; std::vector<_Tp, _Alloc>::reference = DDCellsizesWithDlb&; std::vector<_Tp, _Alloc>::size_type = long unsigned int]: Assertion '__builtin_expect(__n < this->size(), true)' failed.
/usr/include/c++/9/bits/stl_vector.h:1009: std::vector<_Tp, _Alloc>::reference std::vector<_Tp, _Alloc>::operator[](std::vector<_Tp, _Alloc>::size_type) [with _Tp = DDCellsizesWithDlb; _Alloc = std::allocator<DDCellsizesWithDlb>; std::vector<_Tp, _Alloc>::reference = DDCellsizesWithDlb&; std::vector<_Tp, _Alloc>::size_type = long unsigned int]: Assertion '__builtin_expect(__n < this->size(), true)' failed.
/usr/include/c++/9/bits/stl_vector.h:1009: std::vector<_Tp, _Alloc>::reference std::vector<_Tp, _Alloc>::operator[](std::vector<_Tp, _Alloc>::size_type) [with _Tp = DDCellsizesWithDlb; _Alloc = std::allocator<DDCellsizesWithDlb>; std::vector<_Tp, _Alloc>::reference = DDCellsizesWithDlb&; std::vector<_Tp, _Alloc>::size_type = long unsigned int]: Assertion '__builtin_expect(__n < this->size(), true)' failed.
/usr/include/c++/9/bits/stl_vector.h:1009: std::vector<_Tp, _Alloc>::reference std::vector<_Tp, _Alloc>::operator[](std::vector<_Tp, _Alloc>::size_type) [with _Tp = DDCellsizesWithDlb; _Alloc = std::allocator<DDCellsizesWithDlb>; std::vector<_Tp, _Alloc>::reference = DDCellsizesWithDlb&; std::vector<_Tp, _Alloc>::size_type = long unsigned int]: Assertion '__builtin_expect(__n < this->size(), true)' failed.

and this when running under valgrind
==7689== 
==7689== Process terminating with default action of signal 6 (SIGABRT): dumping core
==7689==    at 0x5E100F5: raise (in /usr/lib64/libc-2.29.so)
==7689==    by 0x5DFA95D: abort (in /usr/lib64/libc-2.29.so)
==7689==    by 0x4A116E7: std::__replacement_assert(char const*, int, char const*, char const*) (c++config.h:2493)
==7689==    by 0x4A6298E: operator[] (stl_vector.h:1009)
==7689==    by 0x4A6298E: get_load_distribution(gmx_domdec_t*, gmx_wallcycle*) (partition.cpp:882)
==7689==    by 0x4A6B719: dd_partition_system(_IO_FILE*, gmx::MDLogger const&, long, t_commrec const*, bool, int, t_state*, gmx_mtop_t const*, t_inputrec const*, t_state*, gmx::PaddedVector<gmx::BasicVector<float>, gmx::Allocator<gmx::BasicVector<float>, gmx::AlignedAllocationPolicy> >*, gmx::MDAtoms*, gmx_localtop_t*, t_forcerec*, gmx_vsite_t*, gmx::Constraints*, t_nrnb*, gmx_wallcycle*, bool) (partition.cpp:3104)
==7689==    by 0x527DF5B: em_dd_partition_system(_IO_FILE*, gmx::MDLogger const&, int, t_commrec const*, gmx_mtop_t*, t_inputrec*, em_state_t*, gmx_localtop_t*, gmx::MDAtoms*, t_forcerec*, gmx_vsite_t*, gmx::Constraints*, t_nrnb*, gmx_wallcycle*) (minimize.cpp:738)
==7689==    by 0x528DD5E: gmx::Integrator::do_steep() (minimize.cpp:2610)
==7689==    by 0x5276E64: gmx::Integrator::run(unsigned int, bool) (integrator.cpp:86)
==7689==    by 0x52A1C6B: gmx::Mdrunner::mdrunner() (runner.cpp:1438)
==7689==    by 0x52A3207: gmx::mdrunner_start_fn(void const*) (runner.cpp:219)
==7689==    by 0x5309595: tMPI_Thread_starter(void*) (tmpi_init.cpp:399)
==7689==    by 0x85755A1: start_thread (in /usr/lib64/libpthread-2.29.so)
/usr/include/c++/9/bits/stl_vector.h:1009: std::vector<_Tp, _Alloc>::reference std::vector<_Tp, _Alloc>::operator[](std::vector<_Tp, _Alloc>::size_type) [with _Tp = DDCellsizesWithDlb; _Alloc = std::allocator<DDCellsizesWithDlb>; std::vector<_Tp, _Alloc>::reference = DDCellsizesWithDlb&; std::vector<_Tp, _Alloc>::size_type = long unsigned int]: Assertion '__builtin_expect(__n < this->size(), true)' failed.

#42 Updated by Paul Bauer 10 months ago

my fault, did not apply the other patch yet, ignore last message

#43 Updated by Paul Bauer 10 months ago

  • Related to deleted (Bug #2858: Group scheme C kernels fail in complext tests)

#44 Updated by Paul Bauer 10 months ago

So, now I managed to trigger it again

rt_t>; std::vector<_Tp, _Alloc>::reference = gmx_cgsort_t&; std::vector<_Tp, _Alloc>::size_type = long unsigned int]: Assertion '__builtin_expect(__n < this->size(), true)' failed.
/usr/include/c++/9/bits/stl_vector.h:1009: std::vector<_Tp, _Alloc>::reference std::vector<_Tp, _Alloc>::operator[](std::vector<_Tp, _Alloc>::size_type) [with _Tp = gmx_cgsort_t; _Alloc = std::allocator<gmx_cgsort_t>; std::vector<_Tp, _Alloc>::reference = gmx_cgsort_t&; std::vector<_Tp, _Alloc>::size_type = long unsigned int]: Assertion '__builtin_expect(__n < this->size(), true)' failed.
==12805== 
==12805== Process terminating with default action of signal 6 (SIGABRT): dumping core
==12805==    at 0x5E100F5: raise (in /usr/lib64/libc-2.29.so)
==12805==    by 0x5DFA95D: abort (in /usr/lib64/libc-2.29.so)
==12805==    by 0x4A116E7: std::__replacement_assert(char const*, int, char const*, char const*) (c++config.h:2493)
==12805==    by 0x4A6B1A1: operator[] (stl_vector.h:1009)
==12805==    by 0x4A6B1A1: dd_sort_order (partition.cpp:2735)
==12805==    by 0x4A6B1A1: dd_sort_state(gmx_domdec_t*, float (*) [3], t_forcerec*, t_state*, int) (partition.cpp:2819)
==12805==    by 0x4A6CD97: dd_partition_system(_IO_FILE*, gmx::MDLogger const&, long, t_commrec const*, bool, int, t_state*, gmx_mtop_t const*, t_inputrec const*, t_state*, gmx::PaddedVector<gmx::BasicVector<float>, gmx::Allocator<gmx::BasicVector<float>, gmx::AlignedAllocationPolicy> >*, gmx::MDAtoms*, gmx_localtop_t*, t_forcerec*, gmx_vsite_t*, gmx::Constraints*, t_nrnb*, gmx_wallcycle*, bool) (partition.cpp:3458)
==12805==    by 0x527DF6B: em_dd_partition_system(_IO_FILE*, gmx::MDLogger const&, int, t_commrec const*, gmx_mtop_t*, t_inputrec*, em_state_t*, gmx_localtop_t*, gmx::MDAtoms*, t_forcerec*, gmx_vsite_t*, gmx::Constraints*, t_nrnb*, gmx_wallcycle*) (minimize.cpp:738)
==12805==    by 0x527F5AB: (anonymous namespace)::EnergyEvaluator::run(em_state_t*, float*, float (*) [3], float (*) [3], long, bool) (minimize.cpp:860)
==12805==    by 0x528D89F: gmx::Integrator::do_steep() (minimize.cpp:2525)
==12805==    by 0x5276E74: gmx::Integrator::run(unsigned int, bool) (integrator.cpp:86)
==12805==    by 0x52A1C7B: gmx::Mdrunner::mdrunner() (runner.cpp:1438)
==12805==    by 0x52A3217: gmx::mdrunner_start_fn(void const*) (runner.cpp:219)
==12805==    by 0x53095A5: tMPI_Thread_starter(void*) (tmpi_init.cpp:399)
/usr/include/c++/9/bits/stl_vector.h:1009: std::vector<_Tp, _Alloc>::reference std::vector<_Tp, _Alloc>::operator[](std::vector<_Tp, _Alloc>::size_type) [with _Tp = gmx_cgsort_t; _Alloc = std::allocator<gmx_cgsort_t>; std::vector<_Tp, _Alloc>::reference = gmx_cgsort_t&; std::vector<_Tp, _Alloc>::size_type = long unsigned int]: Assertion '__builtin_expect(__n < this->size(), true)' failed.

#45 Updated by Paul Bauer 10 months ago

Checking the different variables it looks like the issue is that sort->sorted[i] in line 2735 gets accesed with an invalid value for i because the previous check of i >= ncg_home_old gets optimized away

#46 Updated by Berk Hess 10 months ago

  • Assignee set to Berk Hess

What system and setting did you use to trigger this?
I can't reproduce it. I added an assertion which should fail if your conclusion is correct:

diff --git a/src/gromacs/domdec/partition.cpp b/src/gromacs/domdec/partition.cpp
index 96ae04536..353a5686a 100644
--- a/src/gromacs/domdec/partition.cpp
+++ b/src/gromacs/domdec/partition.cpp
@ -2712,6 +2712,9 @ static void dd_sort_order(const gmx_domdec_t *dd,

if (ncg_home_old >= 0)
     {
+ GMX_RELEASE_ASSERT(sort->sorted.size() >= static_cast&lt;size_t&gt;(ncg_home_old),
+ "The sorting buffer should contain the old home charge group indices");
+
std::vector&lt;gmx_cgsort_t&gt; &stationary = sort->stationary;
std::vector&lt;gmx_cgsort_t&gt; &moved = sort->moved;

#47 Updated by Paul Bauer 10 months ago

I was only able to trigger this from within the fedora rawhide docker container when running rpmbuild

#48 Updated by Berk Hess 10 months ago

Can you try with this assertion then?
To me this starts looking like a compiler bug. I would think the assertion would fail if there is an actual bug in my code.

#49 Updated by Paul Bauer 10 months ago

Just reset the container :)
Need some time to set it up again, but I'll try it as soon as possible.

#50 Updated by Gerrit Code Review Bot 10 months ago

Gerrit received a related patchset '1' for Issue #2813.
Uploader: Berk Hess ()
Change-Id: gromacs~release-2019~I7f13b46d7ff5352ce41838b813c46f2e90c93b1c
Gerrit URL: https://gerrit.gromacs.org/9097

#51 Updated by Berk Hess 10 months ago

  • Status changed from Accepted to Fix uploaded

#52 Updated by Berk Hess 10 months ago

  • Status changed from Fix uploaded to Resolved

#53 Updated by Christoph Junghans 10 months ago

I patched c5c3743.diff into v2019, but it still fails, see attached build log.

But I am not sure if I am missing another patch here, so I will retest after the 2019.1 release.

#54 Updated by Paul Bauer 10 months ago

  • Status changed from Resolved to Accepted

It is missing this one as well

Szilárd Páll wrote:

That's fixed, see dabd3b9d

#55 Updated by Paul Bauer 10 months ago

  • Status changed from Accepted to Resolved

#56 Updated by Christoph Junghans 10 months ago

Paul Bauer wrote:

It is missing this one as well

Yeah, let me rerun on top of 2019.1!

#57 Updated by Mark Abraham 10 months ago

  • Status changed from Resolved to Feedback wanted

#58 Updated by Paul Bauer 10 months ago

should we bump this now or leave for fixing later?

#59 Updated by Mark Abraham 10 months ago

  • Target version changed from 2019.1 to 2019.2

Christoph will get us a new round of feedbackc after 2019.1

#60 Updated by Christoph Junghans 10 months ago

Fixed confirmed in 2019.1

#61 Updated by Mark Abraham 10 months ago

  • Target version changed from 2019.2 to 2019.1

#62 Updated by Mark Abraham 10 months ago

  • Status changed from Feedback wanted to Resolved

#63 Updated by Mark Abraham 10 months ago

  • Status changed from Resolved to Closed

#64 Updated by Mark Abraham 9 months ago

How do things look with 2019.1 please, Christoph?

Also available in: Atom PDF