Project

General

Profile

Bug #2404

Enabling floating point exceptions makes some tests fail

Added by Aleksei Iupinov over 1 year ago. Updated 8 months ago.

Status:
Closed
Priority:
Normal
Assignee:
-
Category:
testing
Target version:
Affected version - extra info:
Affected version:
Difficulty:
uncategorized
Close

Description

The issues are exposed by https://gerrit.gromacs.org/7544, which enables some debug floating point exceptions for all modules, not just mdrun,
and enables them as early as possible, not after some of the initial setup.
(Group scheme and TPI are excepted from exceptions :-) ).

1) grompp failures on various complex LJPME tests:

complex.ljpme_LB:

Working dir:  /mnt/workspace/Matrix_PreSubmit_master/ea542155/regressiontests/complex/ljpme_LB
Command line:
  gmx grompp -f ./grompp.mdp -c ./conf -r ./conf -p ./topol -maxwarn 10
... 

Generated 231 of the 231 non-bonded parameter combinations
Generating 1-4 interactions: fudge = 0.5
Generated 231 of the 231 1-4 parameter combinations
Excluding 3 bonded neighbours molecule type 'POPC'
Excluding 2 bonded neighbours molecule type 'SOL'
Number of degrees of freedom in T-Coupling group System is 108.00
ASAN:DEADLYSIGNAL
=================================================================
==29709==ERROR: AddressSanitizer: FPE on unknown address 0x7fbdd7512f1c (pc 0x7fbdd7512f1c bp 0x7ffce879ecf0 sp 0x7ffce879eb20 T0)
    #0 0x7fbdd7512f1b in check_combination_rule_differences(gmx_mtop_t const*, int, int*, int*, int*) /home/jenkins/workspace/Matrix_PreSubmit_master/ea542155/gromacs/src/gromacs/gmxpreprocess/readir.cpp
    #1 0x7fbdd75105ff in check_combination_rules(t_inputrec const*, gmx_mtop_t const*, warninp*) /home/jenkins/workspace/Matrix_PreSubmit_master/ea542155/gromacs/src/gromacs/gmxpreprocess/readir.cpp:3990:5
    #2 0x7fbdd750e4a7 in triple_check(char const*, t_inputrec*, gmx_mtop_t*, warninp*) /home/jenkins/workspace/Matrix_PreSubmit_master/ea542155/gromacs/src/gromacs/gmxpreprocess/readir.cpp:4173:9
    #3 0x7fbdd74b18e2 in gmx_grompp(int, char**) /home/jenkins/workspace/Matrix_PreSubmit_master/ea542155/gromacs/src/gromacs/gmxpreprocess/grompp.cpp:2154:5
    #4 0x7fbdd6f4602f in gmx::CommandLineModuleManager::run(int, char**) /home/jenkins/workspace/Matrix_PreSubmit_master/ea542155/gromacs/src/gromacs/commandline/cmdlinemodulemanager.cpp:591:22
    #5 0x506d6d in main /home/jenkins/workspace/Matrix_PreSubmit_master/ea542155/gromacs/src/programs/gmx.cpp:60:26
    #6 0x7fbdd55b1f44 in __libc_start_main /build/eglibc-ripdx6/eglibc-2.19/csu/libc-start.c:287
    #7 0x437efb in _start (/mnt/workspace/Matrix_PreSubmit_master/ea542155/gromacs/bin/gmx+0x437efb)

This happens with clang 5.0 with any optimization level starting with O1. The solution would be to avoid FPEs on clang for now.

2) Some interesting multi-rank GPU failures (segfaulting at the beginning of the run):

Seen failing with CUDA:
complex.nbnxn_vsite
complex.swap_x
complex.swap_y
complex.swap_z
complex.nbnxn_pme_order6

Seen failing with OpenCL:
complex.swap_y
complex.nbnxn_pme
complex.nbnxn_pme_order5
complex.dd_121

complex.nbnxn_vsite with CUDA 8:

Working dir:  /mnt/workspace/Matrix_PreSubmit_master/663a195d/regressiontests/complex/nbnxn_vsite
Command line:
  gmx mdrun -ntmpi 2 -ntomp 2 -notunepme

Reading file topol.tpr, VERSION 2019-dev-20180208-0dbca14 (single precision)
Can not increase nstlist because verlet-buffer-tolerance is not set or used
Using 2 MPI threads
Using 2 OpenMP threads per tMPI thread

On host bs-nix1 2 GPUs auto-selected for this run.
Mapping of GPU IDs to the 2 GPU tasks in the 2 ranks on this node:
  PP:0,PP:1

NOTE: The number of threads is not equal to the number of (logical) cores
      and the -pin option is set to auto: will not pin threads to cores.
      This can lead to significant performance degradation.
      Consider using -pin on (and -pinoffset in case you run multiple jobs).
starting mdrun 'Protein'
20 steps,      0.1 ps.
Floating point exception (core dumped)

complex.swap_y with OpenCL:

Working dir:  /mnt/workspace/Matrix_PreSubmit_master/4bead6f3/regressiontests/complex/swap_y
Command line:
  gmx mdrun -ntmpi 2 -ntomp 2 -notunepme

Reading file topol.tpr, VERSION 2019-dev-20180208-0dbca14 (single precision)
Changing nstlist from 10 to 50, rlist from 1.011 to 1.137

Using 2 MPI threads
Using 2 OpenMP threads per tMPI thread

On host bs-nix-amd-gpu 2 GPUs auto-selected for this run.
Mapping of GPU IDs to the 2 GPU tasks in the 2 ranks on this node:
  PP:0,PP:1
SWAP: Determining initial numbers of ions per compartment.
SWAP: Setting pointers for checkpoint writing
SWAP: Channel 0 flux history for ion type NA+ (charge 1): 0 molecules
SWAP: Channel 1 flux history for ion type NA+ (charge 1): 0 molecules
SWAP: Channel 0 flux history for ion type CL- (charge -1): 0 molecules
SWAP: Channel 1 flux history for ion type CL- (charge -1): 0 molecules
starting mdrun 'Channel_coco in octane membrane'
2 steps,      0.0 ps.
Segmentation fault (core dumped)

complex.nbnxn_pme with OpenCL:

Working dir:  /mnt/workspace/Matrix_PreSubmit_master/4bead6f3/regressiontests/complex/nbnxn_pme
Command line:
  gmx mdrun -ntmpi 2 -ntomp 2 -notunepme

Reading file topol.tpr, VERSION 2019-dev-20180208-0dbca14 (single precision)
Changing nstlist from 10 to 100, rlist from 0.9 to 0.999

Using 2 MPI threads
Using 2 OpenMP threads per tMPI thread

On host bs-nix-amd-gpu 2 GPUs auto-selected for this run.
Mapping of GPU IDs to the 2 GPU tasks in the 2 ranks on this node:
  PP:0,PP:1
starting mdrun 'water+hexane'
20 steps,      0.0 ps.
Segmentation fault (core dumped)

Hopefully this might be related to that OpenCL failure so elusive that I can't even find its redmine issue.

3) Essential dynamics regressions test dying in double precision on those 2 slaves that use it:

In the main build log:

13:47:20 FAILED: Can not read eigenval.xvg at gmxtest.pl line 1064.

To add: log file output.

Interestingly, that fails on build slaves with clang-3.4 and gcc-7.1, and I can't reproduce it on my machine with gcc-7.1.

4) complex/walls:

On build gcc-7 double mpi simd=avx_128_fma host=bs_nix1404,bs_nix1404

Reading file topol.tpr, VERSION 2019-dev-20180209-9adf3bd (double precision)
Changing nstlist from 10 to 20, rlist from 0.907 to 0.948

Using 2 MPI processes
Using 4 OpenMP threads per MPI process

starting mdrun 'Water'
20 steps,      0.0 ps.
[bs-nix1404:28570] *** Process received signal ***
[bs-nix1404:28570] Signal: Floating point exception (8)
[bs-nix1404:28570] Signal code:  (7)
[bs-nix1404:28570] Failing at address: 0x7fe76fa83cef
[bs-nix1404:28570] [ 0] /lib/x86_64-linux-gnu/libpthread.so.0(+0x10330) [0x7fe76cd28330]
[bs-nix1404:28570] [ 1] /mnt/workspace/Matrix_PreSubmit_master/edf4b4b1/gromacs/bin/../lib/libgromacs.so.4(+0x1683cef) [0x7fe76fa83cef]
[bs-nix1404:28570] [ 2] /mnt/workspace/Matrix_PreSubmit_master/edf4b4b1/gromacs/bin/../lib/libgromacs.so.4(+0x168c450) [0x7fe76fa8c450]
[bs-nix1404:28570] [ 3] /opt/gcc/7.1-disable-futex/bin/../lib64/libgomp.so.1(+0x153e6) [0x7fe76d1653e6]
[bs-nix1404:28570] [ 4] /lib/x86_64-linux-gnu/libpthread.so.0(+0x8184) [0x7fe76cd20184]
[bs-nix1404:28570] [ 5] /lib/x86_64-linux-gnu/libc.so.6(clone+0x6d) [0x7fe76ca4603d]
[bs-nix1404:28570] *** End of error message ***


Related issues

Related to GROMACS - Bug #1677: floating-point exceptions foundClosed2015-01-22
Related to GROMACS - Bug #2642: mdrun with SIMD triggers floating point exceptionsClosed

Associated revisions

Revision ab4d87ef (diff)
Added by Aleksei Iupinov over 1 year ago

Revert "Enable debug FP exceptions without TPI, not with TPI"

This reverts commit 9aca3e2865d0cae8283df00f20fbe4b33927d854 until
most issues at #2404 are presumed to be fixed.

Change-Id: I5c3b248f2f265794837a6cf980ed4917108a54bb

Revision cbaeb20b (diff)
Added by Roland Schulz 11 months ago

Enable debug FP exceptions without TPI, not with TPI

This is a second attempt at debugging FP exceptions. It reverts the
revert ab4d87ef1b255504d23b619e57d2e98374ba2b3e.

Remove work-around in tpitest.

Fixes #2404

Change-Id: Ib2f8b2e68eef31eeeaefc4c69cee5edf76e7508c

Revision d805669d (diff)
Added by Roland Schulz 11 months ago

Disable FPEs at end of runner

In mdrun unit tests the exceptions enabled for one test were
still active in the next. This caused issues with group
kernel which is known to throw FPEs.

Related #2404

Change-Id: Ie347a29f25ed16836a3164b61c9fca87ca66fc44

Revision 8293e951 (diff)
Added by Roland Schulz 11 months ago

Disable FPE for GCC 7.* with optimizations

Only GCC 7 with RelWithAsserts has issues. And debugging
has been unsuccessful.

Related #2404

Change-Id: I0ed3161c12cc8c33af3280f30a97a25c5d38cf1b

Revision 85139de3 (diff)
Added by Mark Abraham 9 months ago

Disable FPE explicitly for TPI force calculation

Our init function for GoogleTest binaries turns on FPE detection,
which is good for the unit tests. But for TPI tests, it is better to
manage the behaviour more explicitly.

Refs #2404

Change-Id: I7b6753f9289a98e154158e7e70aa4019ebfb0d44

History

#1 Updated by Aleksei Iupinov over 1 year ago

  • Description updated (diff)

#2 Updated by Aleksei Iupinov over 1 year ago

  • Description updated (diff)

#3 Updated by Aleksei Iupinov over 1 year ago

Oh cool, swap_y is already failing the same way with CUDA in post-submit after https://gerrit.gromacs.org/7542 was merged:
http://jenkins.gromacs.org/job/Matrix_PostSubmit_master/442/OPTIONS=gcc-5%20gpu%20nranks=4%20gpu_id=1%20cuda-8.0%20release-with-assert%20host=bs_nix1204,label=bs_nix1204/testReport/(root)/complex/swap_y/
I've checked that RelWithAssert is also treated as Debug (does not define NDEBUG), and FP exceptions are already enabled in mdrun in master, so it must be related.

#4 Updated by Aleksei Iupinov over 1 year ago

  • Related to Bug #1677: floating-point exceptions found added

#5 Updated by Mark Abraham over 1 year ago

Edited: Swap Rotation code has been known to divide by zero (#1431 used to be reproducibly broken on bluegene, because the run was reproducible enough there to take the cross product of parallel vectors, and a different numerical path everywhere else we tried).

#6 Updated by Carsten Kutzner over 1 year ago

Mark Abraham wrote:

Swap code has been known to divide by zero (used to be reproducibly broken on bluegene, because the run was reproducible enough there to take the cross product of parallel vectors, and a different numerical path everywhere else we tried).

Are you sure you didn't mean the ED code here instead of swap?

#7 Updated by Aleksei Iupinov over 1 year ago

OK, not sure how to group/split the guys best, but here's swap_x and nbnxn_pme_order6 failing with CUDA

http://jenkins.gromacs.org/view/Gerrt%20on-demand/job/Matrix_OnDemand/318/OPTIONS=gcc-5%20gpu%20nranks=4%20gpu_id=1%20cuda-8.0%20release-with-assert%20host=bs_nix1204,label=bs_nix1204/testReport/

Maybe I should just make CUDA/OpenCL lists of tests failing, and cross out, since they all pretty much segfault at the beginning.

#8 Updated by Aleksei Iupinov over 1 year ago

  • Description updated (diff)

#9 Updated by Aleksei Iupinov over 1 year ago

  • Description updated (diff)

#10 Updated by Mark Abraham over 1 year ago

Carsten Kutzner wrote:

Mark Abraham wrote:

Swap code has been known to divide by zero (used to be reproducibly broken on bluegene, because the run was reproducible enough there to take the cross product of parallel vectors, and a different numerical path everywhere else we tried).

Are you sure you didn't mean the ED code here instead of swap?

I meant rotation #1431. More recently, ED found a bug in the double-precision SIMD math something something.

#11 Updated by Aleksei Iupinov over 1 year ago

  • Description updated (diff)

#12 Updated by Aleksei Iupinov over 1 year ago

  • Description updated (diff)

#13 Updated by Mark Abraham over 1 year ago

  • Description updated (diff)

#14 Updated by Mark Abraham over 1 year ago

Since fixing all this will probably take some time, I propose we revert the FPE enablement until we think we've got things covered. Otherwise Jenkins is going to be even more likely to make spurious failures unrelated to the change under test.

#15 Updated by Gerrit Code Review Bot over 1 year ago

Gerrit received a related patchset '2' for Issue #2404.
Uploader: Aleksei Iupinov ()
Change-Id: gromacs~master~I5c3b248f2f265794837a6cf980ed4917108a54bb
Gerrit URL: https://gerrit.gromacs.org/7560

#16 Updated by Aleksei Iupinov over 1 year ago

  • Description updated (diff)

#17 Updated by Aleksei Iupinov over 1 year ago

Trying to reproduce failures from (2), running master with FP exceptions on, on dev-piledriver-gpu01 with AMD FX-8350.

cmake -DCMAKE_BUILD_TYPE=Debug -DCMAKE_CXX_COMPILER=g++-5 '-DCMAKE_CXX_LINK_FLAGS=-Wl,-rpath,/usr/bin/../lib64 -L/usr/bin/../lib64' -DCMAKE_C_COMPILER=gcc-5 -DGMX_COMPILER_WARNINGS=ON -DGMX_GPU=ON -DGMX_SIMD=AVX_128_FMA -DGMX_USE_OPENCL=ON -DGMX_USE_RDTSCP=DETECT -DGMX_BUILD_OWN_FFTW=ON ..

running in complex/nbnxn_vsite

valgrind ../../../../bin/gmx mdrun -nsteps 10 -ntmpi 2

==20709== Process terminating with default action of signal 4 (SIGILL)
==20709==  Illegal opcode at address 0x6596833
==20709==    at 0x6596833: _mm_macc_ps (fma4intrin.h:44)
==20709==    by 0x6596833: gmx::fma(gmx::SimdFloat, gmx::SimdFloat, gmx::SimdFloat) (impl_x86_avx_128_fma_simd_float.h:62)
==20709==    by 0x6596C5F: gmx::rsqrtIter(gmx::SimdFloat, gmx::SimdFloat) (simd_math.h:129)
==20709==    by 0x6596CC3: gmx::invsqrt(gmx::SimdFloat) (simd_math.h:152)
==20709==    by 0x6599222: calc_dr_x_xp_simd(int, int, int const*, float const (*) [3], float const (*) [3], float const*, float const*, float const*, float (*) [3], float*, float*) (clincs.cpp:730)
==20709==    by 0x6599F6B: do_lincs(float (*) [3], float (*) [3], float (*) [3], t_pbc*, gmx_lincsdata*, int, float const*, t_commrec*, int, float, int*, float, float (*) [3], int, float (*) [3]) (clincs.cpp:937)
==20709==    by 0x65A0531: constrain_lincs._omp_fn.3 (clincs.cpp:2438)
==20709==    by 0x9127CBE: GOMP_parallel (in /usr/lib/x86_64-linux-gnu/libgomp.so.1.0.0)
==20709==    by 0x659E980: constrain_lincs (clincs.cpp:2425)
==20709==    by 0x6532ACC: constrain (constr.cpp:395)
==20709==    by 0x657C201: do_constrain_first(_IO_FILE*, gmx_constr*, t_inputrec*, t_mdatoms*, t_state*, t_commrec*, t_nrnb*, t_forcerec*, gmx_localtop_t*) (sim_util.cpp:2159)
==20709==    by 0x43AAA5: gmx::do_md(_IO_FILE*, t_commrec*, gmx::MDLogger const&, int, t_filenm const*, gmx_output_env_t const*, MdrunOptions const&, gmx_vsite_t*, gmx_constr*, gmx::IMDOutputProvider*, t_inputrec*, gmx_mtop_t*, t_fcdata*, t_state*, ObservablesHistory*, gmx::MDAtoms*, t_nrnb*, gmx_wallcycle*, t_forcerec*, ReplicaExchangeParameters const&, gmx_membed_t*, gmx_walltime_accounting*) (md.cpp:681)
==20709==    by 0x447A2B: gmx::Mdrunner::mdrunner() (runner.cpp:1331)

Without valgrind, the run seems successful.

I assume it means there is a relevant issue to fix here.

On a second thought, maybe this is just valgrind being valgrind and not recognizing intrinsics?

#18 Updated by Roland Schulz over 1 year ago

vagrind accepts some intrinsics. But which depends on the version of valgrind. Starting with 3.12 it should have FMA4: http://valgrind.org/docs/manual/dist.news.html.

#19 Updated by Aleksei Iupinov over 1 year ago

Thanks for the advice, Roland! Anyway, that was valgrind 3.11, and 3.13, the latest, terminates on

==18443== Process terminating with default action of signal 4 (SIGILL)
==18443==  Illegal opcode at address 0x532A857
==18443==    at 0x532A857: _mm_cmpeq_epi32 (emmintrin.h:1306)
==18443==    by 0x532A857: gmx::operator==(gmx::SimdFInt32, gmx::SimdFInt32) (impl_x86_sse2_simd_float.h:612)
==18443==    by 0x532B52E: gmx::sincos(gmx::SimdFloat, gmx::SimdFloat*, gmx::SimdFloat*) (simd_math.h:898)
==18443==    by 0x5336FB1: pdihs_noener_simd(int, int const*, t_iparams const*, float const (*) [3], float (*) [4], t_pbc const*, t_graph const*, float, t_mdatoms const*, t_fcdata*, int*) (bonded.cpp:2074)
==18443==    by 0x53422DC: (anonymous namespace)::calc_one_bond(int, int, t_idef const*, float const (*) [3], float (*) [4], float (*) [3], t_forcerec const*, t_pbc const*, t_graph const*, gmx_grppairener_t*, t_nrnb*, float const*, float*, t_mdatoms const*, t_fcdata*, int, int*) (listed-forces.cpp:348)
==18443==    by 0x5343AF2: calcBondedForces(t_idef const*, float const (*) [3], t_forcerec const*, t_pbc const*, t_graph const*, gmx_enerdata_t*, t_nrnb*, float const*, float*, t_mdatoms const*, t_fcdata*, int, int*) [clone ._omp_fn.1] (listed-forces.cpp:470)

That still doesn't seem terribly helpful.

#20 Updated by Aleksei Iupinov over 1 year ago

Reproduced the first issue with clang asan. It seems to break down with FPEs enabled on the line

c12i = mtop->ffparams.iparams[(ntypes + 1) * tpi].lj.c12;

where the value is 0, and the left one is double, and the right one is real(float).

Still don't see what's the deal here. There are some further divisions and roots, but before that there is a non-zero check.

Also, mtop is optimized out, and when compiled with O0 instead of default O1, there is no error anymore.

Sure seems like a false positive... There is a lot of union accesses here though.

#21 Updated by Roland Schulz over 1 year ago

The FPE goes away at -O0 independent of whether you use ASAN or not? I doubt that the FPE itself is a falsely emitted by the HW. You can look at the asm and I bet that you see that there is a FPE causing calculation. If it goes away at O0 it could be that the compiler does an invalid optimization.

#22 Updated by Aleksei Iupinov over 1 year ago

It has only happened with ASAN and O1.

#23 Updated by Roland Schulz over 1 year ago

How did you manage to reproduce it?

For me I don't see errors:
clang 5.0.1
git checkout ab6e08fa58b468
git revert ab4d87ef1b255504d23
CXX=clang++ cmake3 .. -GNinja -DCMAKE_BUILD_TYPE=ASAN

C++ compiler flags: -march=core-avx2 -std=c++11 -Wdeprecated -Wextra -Wno-missing-field-initializers -Wpointer-arith -Wmissing-prototypes -Wall -O1 -g -fsanitize=address -fno-omit-frame-pointer

#24 Updated by Aleksei Iupinov over 1 year ago

Ah, perhaps I should have clarified. We reverted the FP being enabled by default, but the hidden option is still there.
So I build ASAN with clang 5.0.0 and ran "gmx grompp -fpexcept yes" in regressiontests/complex/ljpme_LB.

#25 Updated by Roland Schulz over 1 year ago

I thought that because I reverted ab4d87ef1b25 (the revert commit) I didn't need any flag. I do get an error with the command line flag. Thanks.

#26 Updated by Roland Schulz over 1 year ago

I think this is a problem with clang not having proper fp-exception support. If a quick look at the asm isn't misleading me, than the divsion inside the if statement is moved outside and thus the zero check doesn't help. There were multiple patches proposed to fix this (e.g. search for HonorFPExceptions) but it seems that they aren't included yet. I suspect for now we should assume that fp-exceptions are only correct for clang with O0. With the other compilers it should be fine also with optimization.

#27 Updated by Aleksei Iupinov over 1 year ago

Interesting, thanks. Then the later TODO to resolve (1) is to not enable FPEs with clang, I think - we wouldn't easily be able to get at the optimization level in the preprocessor...

#28 Updated by Aleksei Iupinov over 1 year ago

Confirmed with an ordinary clang 5.0 build that the problem disappears only with no optimizations.

#29 Updated by Aleksei Iupinov over 1 year ago

  • Description updated (diff)

#31 Updated by Gerrit Code Review Bot 12 months ago

Gerrit received a related patchset '2' for Issue #2404.
Uploader: Roland Schulz ()
Change-Id: gromacs~master~Ib2f8b2e68eef31eeeaefc4c69cee5edf76e7508c
Gerrit URL: https://gerrit.gromacs.org/8058

#32 Updated by Roland Schulz 11 months ago

  • Status changed from New to Resolved

#33 Updated by Mark Abraham 11 months ago

  • Category set to testing
  • Target version set to 2019

#34 Updated by Mark Abraham 11 months ago

  • Status changed from Resolved to Closed

#35 Updated by Mark Abraham 11 months ago

  • Status changed from Closed to Accepted

There seems to still be some transient issues. e.g https://gerrit.gromacs.org/#/c/8093/ doesn't affect anything relevant, but an MdrunTest (MdrunCanWrite/Trajectories.ThatDifferInNstxout) two after the TPI test found a numerical issue.

#37 Updated by Gerrit Code Review Bot 11 months ago

Gerrit received a related patchset '2' for Issue #2404.
Uploader: Roland Schulz ()
Change-Id: gromacs~master~Ie347a29f25ed16836a3164b61c9fca87ca66fc44
Gerrit URL: https://gerrit.gromacs.org/8107

#38 Updated by Mark Abraham 11 months ago

After Roland's latest effort, we have seen

http://jenkins.gromacs.org/job/Matrix_PreSubmit_master/4894/OPTIONS=gcc-7%20gpu%20gpu_id=1%20cuda-9.2%20thread-mpi%20openmp%20cmake-3.6.1%20release-with-assert%20simd=avx2_256%20host=bs_nix1204,label=bs_nix1204/console fail with

00:24:21.846 Abnormal return value for ' gmx mdrun -ntmpi 2   -gpu_id 1 -ntomp 2 -ei sam.edi -eo radfix.xvg >mdrun.out 2>&1' was -1
00:24:25.106 Essential dynamics tests FAILED with 1 errors!
00:24:25.108 (exited with code 1)
00:24:25.108 Traceback (most recent call last):
00:24:25.108   File "/mnt/workspace/Matrix_PreSubmit_master/51663854/releng/releng/__init__.py", line 46, in run_build
00:24:25.108     BuildContext._run_build(factory, build, job_type, opts)
00:24:25.108   File "/mnt/workspace/Matrix_PreSubmit_master/51663854/releng/releng/context.py", line 464, in _run_build
00:24:25.108     script.do_build(context, factory.cwd)
00:24:25.108   File "/mnt/workspace/Matrix_PreSubmit_master/51663854/releng/releng/script.py", line 76, in do_build
00:24:25.108     self._do_build(context)
00:24:25.108   File "/home/jenkins/workspace/Matrix_PreSubmit_master/51663854/gromacs/admin/builds/gromacs.py", line 229, in do_build
00:24:25.108     context.run_cmd(cmd, shell=True, failure_message='Regression tests failed to execute')
00:24:25.108   File "/mnt/workspace/Matrix_PreSubmit_master/51663854/releng/releng/context.py", line 105, in run_cmd
00:24:25.108     raise BuildError(failure_message)
00:24:25.108 BuildError: Regression tests failed to execute

and

http://jenkins.gromacs.org/job/Matrix_PreSubmit_master/4893/OPTIONS=gcc-7%20gpu%20gpu_id=1%20cuda-9.2%20thread-mpi%20openmp%20cmake-3.6.1%20release-with-assert%20simd=avx2_256%20host=bs_nix1204,label=bs_nix1204/testReport/junit/(root)/complex/tip4p_continue/ fail with

GROMACS:      gmx mdrun, version 2019-dev-20180730-9811b1f
Executable:   /home/jenkins/workspace/Matrix_PreSubmit_master/51663854/gromacs/bin/gmx
Data prefix:  /home/jenkins/workspace/Matrix_PreSubmit_master/51663854/gromacs (source tree)
Working dir:  /mnt/workspace/Matrix_PreSubmit_master/51663854/regressiontests/complex/tip4p_continue
Command line:
  gmx mdrun -ntmpi 2 -gpu_id 1 -ntomp 2 -notunepme -cpi ./continue -noappend

The current CPU can measure timings more accurately than the code in
gmx mdrun was configured to use. This might affect your simulation
speed as accurate timings are needed for load-balancing.
Please consider rebuilding gmx mdrun with the GMX_USE_RDTSCP=ON CMake option.
Reading file topol.tpr, VERSION 2019-dev-20180730-9811b1f (single precision)
Can not increase nstlist because verlet-buffer-tolerance is not set or used

Using 2 MPI threads
Using 2 OpenMP threads per tMPI thread

On host bs-nix1 1 GPU auto-selected for this run.
Mapping of GPU IDs to the 2 GPU tasks in the 2 ranks on this node:
  PP:1,PP:1

NOTE: The number of threads is not equal to the number of (logical) cores
      and the -pin option is set to auto: will not pin threads to cores.
      This can lead to significant performance degradation.
      Consider using -pin on (and -pinoffset in case you run multiple jobs).
starting mdrun 'Water'
20 steps,      0.0 ps (continuing from step 5,      0.0 ps).
Floating point exception (core dumped)

(and also fail
complex.nbnxn-ljpme-geometric)

#39 Updated by Mark Abraham 11 months ago

and complex.pull_constraint on the same configuration

#40 Updated by Gerrit Code Review Bot 11 months ago

Gerrit received a related patchset '1' for Issue #2404.
Uploader: Roland Schulz ()
Change-Id: gromacs~master~I0ed3161c12cc8c33af3280f30a97a25c5d38cf1b
Gerrit URL: https://gerrit.gromacs.org/8136

#41 Updated by Mark Abraham 11 months ago

It appears that multiple simultaneous Jenkins verifications makes this issue much more likely to appear.

We did have an issue in the past that only appeared under such load when DLB completely emptied a domain. But our mdrun integration test that intends to have an empty domain is not reporting issues.

@Szilard has been trying to reproduce this, but his latest status report is somewhere else. Can you summarize/link here, please Szilard?

Otherwise, we could probably profit generally from having sanitizers+openmp+gpu build available. For now, I can offer to implement that on a container on the new container machine so that we can see if that finds issues, but I'm still mostly on holidays and can't prioritize this.

#42 Updated by Mark Abraham 10 months ago

On master, swap_x on the AMD OpenCL verification still segfaults http://jenkins.gromacs.org/job/Matrix_PreSubmit_master/5293/OPTIONS=gcc-5%20openmp%20simd=avx_128_fma%20opencl%20clFFT-2.14%20amdappsdk-3.0%20host=bs_nix-amd_gpu,label=bs_nix-amd_gpu/testReport/junit/(root)/complex/swap_x/ but at least now it's known to be happening after writing final coordinates... but perhaps https://gerrit.gromacs.org/c/8162/ on release-2018 will address this properly.

#43 Updated by Roland Schulz 10 months ago

Any idea why this is happening often again with the GPU builds. Anyone working on it?

#44 Updated by Mark Abraham 10 months ago

Roland Schulz wrote:

Any idea why this is happening often again with the GPU builds. Anyone working on it?

We assume it's a memory misuse. I have been working on a docker image able to do things like msan plus a GPU build and run, but haven't got as far as running tests.

#45 Updated by Roland Schulz 10 months ago

Has someone run ASAN/valgrind on a GPU build? What configurations are known to have an issue. I just checked the last 60 builds (I ignored all builds which had other issues) and could find 2:
http://jenkins.gromacs.org/job/Matrix_PreSubmit_master/5832/
http://jenkins.gromacs.org/job/Matrix_PreSubmit_master/5829/

Those two fail for gcc-6 openmp opencl clFFT-2.14 cuda-7.5 mpi simd=avx2_256 host=bs_nix1310. Have we had error for any other configuration than Nvidia Opencl?

#46 Updated by Berk Hess 10 months ago

  • Related to Bug #2642: mdrun with SIMD triggers floating point exceptions added

#47 Updated by Berk Hess 10 months ago

I filed #2642 which explains the source of the mdrun floating point exceptions. I don't know if that explains why it happens more often with GPU builds. It depends on the state of memory the state vectors are allocated in.

#48 Updated by Mark Abraham 9 months ago

Roland Schulz wrote:

Has someone run ASAN/valgrind on a GPU build? What configurations are known to have an issue. I just checked the last 60 builds (I ignored all builds which had other issues) and could find 2:
http://jenkins.gromacs.org/job/Matrix_PreSubmit_master/5832/
http://jenkins.gromacs.org/job/Matrix_PreSubmit_master/5829/

Those two fail for gcc-6 openmp opencl clFFT-2.14 cuda-7.5 mpi simd=avx2_256 host=bs_nix1310. Have we had error for any other configuration than Nvidia Opencl?

I have recently been seeing MdrunTest fail on a cuda build (probably gcc-7 gpu gpu_id=1 cuda-9.2 thread-mpi openmp cmake-3.6.1 release-with-assert simd=avx2_256 host=bs_nix1204) but there's none currently archived. However there are reports from that config above.

#49 Updated by Gerrit Code Review Bot 9 months ago

Gerrit received a related patchset '4' for Issue #2404.
Uploader: Mark Abraham ()
Change-Id: gromacs~master~I7b6753f9289a98e154158e7e70aa4019ebfb0d44
Gerrit URL: https://gerrit.gromacs.org/8474

#50 Updated by Mark Abraham 9 months ago

  • Status changed from Accepted to Fix uploaded

#51 Updated by Mark Abraham 8 months ago

  • Status changed from Fix uploaded to Resolved

#52 Updated by Mark Abraham 8 months ago

  • Status changed from Resolved to Closed

Also available in: Atom PDF