Project

General

Profile

Bug #1677

floating-point exceptions found

Added by Mark Abraham almost 5 years ago. Updated over 4 years ago.

Status:
Closed
Priority:
Low
Assignee:
Category:
testing
Target version:
Affected version - extra info:
Affected version:
Difficulty:
uncategorized
Close

Description

These happen irregularly (else we'd have seen them before merging the floating-point exception checking for Debug builds). Known occurences:

Fixing these would be nice, but there's no evidence that the problems are serious (other than irritating Jenkins failures).


Related issues

Related to GROMACS - Bug #1693: Jenkins Tests seldomly failingClosed
Related to GROMACS - Bug #2404: Enabling floating point exceptions makes some tests failClosed

Associated revisions

Revision 19f442ab (diff)
Added by Roland Schulz almost 5 years ago

Fix FP exception in FE

Part of #1677

Change-Id: I212d52a232763e3f3b3652ea488ba5e05683aab1

Revision 2d9d899e (diff)
Added by Mark Abraham over 4 years ago

Avoid possible floating-point exception

Refs #1677

Change-Id: I7cc73d50a64ecfd05120ecfd464a47665f1e9df3

Revision 7ca8a04c (diff)
Added by Teemu Murtola over 4 years ago

Fix potential FPE in selection tests

Always initialize dynamic evaluation results to zero during the static
analysis pass. The old code tried to do that, but missed a few cases
where the same expression was evaluated more than once. Even during
static analysis, some dynamic arithmetic expressions are actually
evaluated, which could lead to FPE if the memory was uninitialized, even
though the results of the evaluation were never used.

Part of #1677.

Change-Id: I0cf73eb656d58f7ad08f6e12705c6d8a273adf84

Revision f38571f9 (diff)
Added by Mark Abraham over 4 years ago

Handle timing better in Jenkins

Sometimes we get floating-point exceptions in Jenkins when timers get
strange values and e.g. the code divides by zero. Added several checks
for such behaviour, including a new early exit from the timing routine
when a non-positive number of CPU cycles were counted. Removed a
recently added check that is now redundant because of the early exit.

Added several assertions about numbers of threads and ranks, which
might help reducing future mysterious FP exceptions if things are
broken elsewhere.

Fixes #1677

Change-Id: I532e164fc13e91f5f109dd63bb99c1569bdc70cd

Revision 033fdfcc (diff)
Added by Berk Hess over 4 years ago

Added missing DD cycle counting

Cycle counting was missing for DD repartitioning after replica
exchange or coord swap. Removed DD cycle counting for initial DD.
Moved DD cycle counting into dd_partition_system and added
subract_cycles function with assertion to detect cycle wrapping.

Fixes #1677.

Change-Id: I7f1b19397b36456f1d120dbc0080146a384def5a

History

#1 Updated by Teemu Murtola almost 5 years ago

In the selection tests the likely reason is that some code operates on uninitialized buffer of reals, and then occasionally these happen to be NaN. The most likely place for this to happen is during selection compilation, where the coordinates of atoms are not yet known. But the result of the computation isn't used, either. This is very difficult to find, though, since there's quite a bit of code...

#2 Updated by Mark Abraham almost 5 years ago

  • Description updated (diff)

#3 Updated by Mark Abraham almost 5 years ago

I have started working on a MemorySanitizer build, which needs a full custom-built dependency stack in order to work well. But it should find some of the issues we are observing here.

#4 Updated by Mark Abraham almost 5 years ago

  • Description updated (diff)

#5 Updated by Roland Schulz almost 5 years ago

I was able to produce a core file for the freenenergy issue. I ran it for 100 times setting "ulimit -c unlimited" before. It once hit the FP exception and produce the core file. The core file is on bs_centos63 at /home/jenkins/testing/regressiontests/freeenergy/transformAtoB/core.885. The corresponding binary is at /home/jenkins/testing/gromacs/build.debug/bin/gmx. It crashes at pairs.cpp:323 because dvdl2 is nan. I uploaded a patch which I think should fix the problem. The problem should have been found by MSAN. Maybe we should add that as a configuration (even though it is a bit of a pain to set up). Also we might want to make Jenkins generate core files (ulimit -c unlimited) and archive the core files and the binary. Then one can look at the core file in those cases without having to reproduce it. And we could only keep the last ~20 or so to make it not take too much space.

#6 Updated by Gerrit Code Review Bot almost 5 years ago

Gerrit received a related patchset '1' for Issue #1677.
Uploader: Roland Schulz ()
Change-Id: I212d52a232763e3f3b3652ea488ba5e05683aab1
Gerrit URL: https://gerrit.gromacs.org/4420

#7 Updated by Roland Schulz almost 5 years ago

The non-critical FP exceptions are not found by MSAN because MSAN only flags things as uninitialized read if it affects program execution. And because the uninitialized value in the free-energy case was a dummy variable it wasn't affecting the rest of the program and thus wasn't flagged.

#8 Updated by Roland Schulz over 4 years ago

  • Related to Bug #1693: Jenkins Tests seldomly failing added

#10 Updated by Gerrit Code Review Bot over 4 years ago

Gerrit received a related patchset '1' for Issue #1677.
Uploader: Mark Abraham ()
Change-Id: I7cc73d50a64ecfd05120ecfd464a47665f1e9df3
Gerrit URL: https://gerrit.gromacs.org/4658

#11 Updated by Roland Schulz over 4 years ago

The numerical exception in the selection test happened again and I examined the core file:

* thread #1: tid = 0x0000, 0x000000010596fb03 libgromacs.1.0.0.dylib`_gmx_sel_evaluate_arithmetic(gmx_sel_evaluate_t*, boost::shared_ptr<gmx::SelectionTreeElement> const&, gmx_ana_index_t*) + 835, stop reason = signal SIGSTOP
    frame #0: 0x000000010596fb03 libgromacs.1.0.0.dylib`_gmx_sel_evaluate_arithmetic(gmx_sel_evaluate_t*, boost::shared_ptr<gmx::SelectionTreeElement> const&, gmx_ana_index_t*) + 835
libgromacs.1.0.0.dylib`_gmx_sel_evaluate_arithmetic(gmx_sel_evaluate_t*, boost::shared_ptr<gmx::SelectionTreeElement> const&, gmx_ana_index_t*) + 835:
-> 0x10596fb03:  addss  %xmm3, %xmm1
   0x10596fb07:  jmpq   0x10596fa07               ; _gmx_sel_evaluate_arithmetic(gmx_sel_evaluate_t*, boost::shared_ptr<gmx::SelectionTreeElement> const&, gmx_ana_index_t*) + 583
   0x10596fb0c:  nopl   (%rax)
   0x10596fb10:  unpcklps %xmm3, %xmm3

(lldb) bt all
* thread #1: tid = 0x0000, 0x000000010596fb03 libgromacs.1.0.0.dylib`_gmx_sel_evaluate_arithmetic(gmx_sel_evaluate_t*, boost::shared_ptr<gmx::SelectionTreeElement> const&, gmx_ana_index_t*) + 835, stop reason = signal SIGSTOP
  * frame #0: 0x000000010596fb03 libgromacs.1.0.0.dylib`_gmx_sel_evaluate_arithmetic(gmx_sel_evaluate_t*, boost::shared_ptr<gmx::SelectionTreeElement> const&, gmx_ana_index_t*) + 835
    frame #1: 0x000000010596aa41 libgromacs.1.0.0.dylib`analyze_static(gmx_sel_evaluate_t*, boost::shared_ptr<gmx::SelectionTreeElement> const&, gmx_ana_index_t*) + 1345
    frame #2: 0x00000001059707dd libgromacs.1.0.0.dylib`_gmx_sel_evaluate_subexpr(gmx_sel_evaluate_t*, boost::shared_ptr<gmx::SelectionTreeElement> const&, gmx_ana_index_t*) + 605
    frame #3: 0x000000010596b2d7 libgromacs.1.0.0.dylib`analyze_static(gmx_sel_evaluate_t*, boost::shared_ptr<gmx::SelectionTreeElement> const&, gmx_ana_index_t*) + 3543
    frame #4: 0x0000000105971a02 libgromacs.1.0.0.dylib`_gmx_sel_evaluate_subexprref(gmx_sel_evaluate_t*, boost::shared_ptr<gmx::SelectionTreeElement> const&, gmx_ana_index_t*) + 66
    frame #5: 0x000000010596a8ea libgromacs.1.0.0.dylib`analyze_static(gmx_sel_evaluate_t*, boost::shared_ptr<gmx::SelectionTreeElement> const&, gmx_ana_index_t*) + 1002
    frame #6: 0x000000010596feec libgromacs.1.0.0.dylib`_gmx_sel_evaluate_method_params(gmx_sel_evaluate_t*, boost::shared_ptr<gmx::SelectionTreeElement> const&, gmx_ana_index_t*) + 124
    frame #7: 0x000000010596aade libgromacs.1.0.0.dylib`analyze_static(gmx_sel_evaluate_t*, boost::shared_ptr<gmx::SelectionTreeElement> const&, gmx_ana_index_t*) + 1502
    frame #8: 0x000000010596eb75 libgromacs.1.0.0.dylib`_gmx_sel_evaluate_and(gmx_sel_evaluate_t*, boost::shared_ptr<gmx::SelectionTreeElement> const&, gmx_ana_index_t*) + 549
    frame #9: 0x000000010596b6eb libgromacs.1.0.0.dylib`analyze_static(gmx_sel_evaluate_t*, boost::shared_ptr<gmx::SelectionTreeElement> const&, gmx_ana_index_t*) + 4587
    frame #10: 0x000000010596de85 libgromacs.1.0.0.dylib`_gmx_sel_evaluate_subexpr_simple(gmx_sel_evaluate_t*, boost::shared_ptr<gmx::SelectionTreeElement> const&, gmx_ana_index_t*) + 37
    frame #11: 0x000000010596a7c8 libgromacs.1.0.0.dylib`analyze_static(gmx_sel_evaluate_t*, boost::shared_ptr<gmx::SelectionTreeElement> const&, gmx_ana_index_t*) + 712
    frame #12: 0x000000010596aaaf libgromacs.1.0.0.dylib`analyze_static(gmx_sel_evaluate_t*, boost::shared_ptr<gmx::SelectionTreeElement> const&, gmx_ana_index_t*) + 1455
    frame #13: 0x000000010596d186 libgromacs.1.0.0.dylib`gmx::SelectionCompiler::compile(gmx::SelectionCollection*) + 2166
    frame #14: 0x00000001059a6437 libgromacs.1.0.0.dylib`gmx::SelectionCollection::compile() + 87
    frame #15: 0x00000001052db409 selection-test`(anonymous namespace)::SelectionCollectionDataTest::runCompiler() + 57
    frame #16: 0x00000001052e1b95 selection-test`(anonymous namespace)::SelectionCollectionDataTest::runTest(char const*, gmx::ConstArrayRef<char const*> const&) + 645
    frame #17: 0x00000001052e2995 selection-test`(anonymous namespace)::SelectionCollectionDataTest_HandlesComplexNumericVariables_Test::TestBody() + 53
    frame #18: 0x0000000105355a93 selection-test`void testing::internal::HandleExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*) + 51
    frame #19: 0x0000000105349278 selection-test`testing::Test::Run() + 536
    frame #20: 0x0000000105349558 selection-test`testing::TestInfo::Run() + 728
    frame #21: 0x0000000105349638 selection-test`testing::TestCase::Run() + 216
    frame #22: 0x000000010534b9d0 selection-test`testing::internal::UnitTestImpl::RunAllTests() (.part.534) + 1248
    frame #23: 0x000000010534be29 selection-test`testing::UnitTest::Run() + 105
    frame #24: 0x000000010535a145 selection-test`main + 53

(lldb) print $xmm3
(unsigned char __attribute__((ext_vector_type(16)))) $0 = {
  [0] = '\0'
  [1] = '\0'
  [2] = '\0'
  [3] = '\0'
  [4] = '\0'
  [5] = '\0'
  [6] = '\0'
  [7] = '\0'
  [8] = '\0'
  [9] = '\0'
  [10] = '\0'
  [11] = '\0'
  [12] = '\0'
  [13] = '\0'
  [14] = '\0'
  [15] = '\0'
}
(lldb) print $xmm1
(unsigned char __attribute__((ext_vector_type(16)))) $1 = {
  [0] = '\0'
  [1] = '\0'
  [2] = '\0'
  [3] = '\0'
  [4] = '\0'
  [5] = '\0'
  [6] = '\0'
  [7] = '\0'
  [8] = '\0'
  [9] = '\0'
  [10] = '\0'
  [11] = '\0'
  [12] = '\0'
  [13] = '\0'
  [14] = '\0'
  [15] = '\0'
}

I don't understand why it ended with SIGSTOP. I thought it should be SIGFPE. Also I don't understand why it would have stopped on that addps if both xmm3 and xmm1 are both 0. Both could be explained if SIGFPE is caught by the OS and then the process is killed by it. But why would that happen? And is it possible to disable that, to get the exact line? My google search to sigfpe and mac didn't turned up extremely little.

#12 Updated by Teemu Murtola over 4 years ago

I have a guess at what could be the cause, but it might take a bit of time to come up with a fix.

#13 Updated by Gerrit Code Review Bot over 4 years ago

Gerrit received a related patchset '1' for Issue #1677.
Uploader: Teemu Murtola ()
Change-Id: I0cf73eb656d58f7ad08f6e12705c6d8a273adf84
Gerrit URL: https://gerrit.gromacs.org/4678

#14 Updated by Teemu Murtola over 4 years ago

  • Status changed from New to In Progress
  • Target version set to 5.1

Are all three the issues now fixed?

#15 Updated by Erik Lindahl over 4 years ago

  • Status changed from In Progress to Resolved

I would assume so. The commits are all in the repo, I haven't seen Jenkins act up lately, and nobody has said anything else for a week. Let's close it for now!

#17 Updated by Mark Abraham over 4 years ago

Teemu Murtola wrote:

http://jenkins.gromacs.org/job/Gromacs_Gerrit_master/9134/OPTIONS=Compiler=gcc%20CompilerVersion=4.7%20GMX_GPU=OFF%20CUDA=4.2%20GMX_DOUBLE=ON%20GMX_MPI=OFF%20GMX_SIMD=SSE4.1%20host=bs_nix1204,label=bs_nix1204/testReport/junit/(root)/complex/swap_z/

still fails with an fp exception (but this was not in the initial list of cases found).

That's similar to a recent issue. Heavyweight fix incoming.

#18 Updated by Gerrit Code Review Bot over 4 years ago

Gerrit received a related patchset '1' for Issue #1677.
Uploader: Mark Abraham ()
Change-Id: I532e164fc13e91f5f109dd63bb99c1569bdc70cd
Gerrit URL: https://gerrit.gromacs.org/4698

#19 Updated by Erik Lindahl over 4 years ago

  • Status changed from Resolved to Closed

#20 Updated by Mark Abraham over 4 years ago

  • Status changed from Closed to Accepted

Found another one at http://jenkins.gromacs.org/job/Gromacs_Gerrit_master/9185/OPTIONS=Compiler=gcc%20CompilerVersion=4.4%20GMX_GPU=OFF%20GMX_DOUBLE=ON%20GMX_MPI=ON%20GMX_SIMD=AVX_256%20host=bs_nix1310,label=bs_nix1310/testReport/ (which has the latest patch 4698 above in its ancestry).

looks like the offending FP operation is at line 920 (though without a stackdump, my gdb-fu may be off base here). Since tot is > 0 from patch 4698, then cyc_sum{ewcNS] must be garbage?

#21 Updated by Mark Abraham over 4 years ago

Aha, found the core file.

(gdb) print tot
$10 = 24551493408
(gdb) print ewcNS
$11 = ewcNS
(gdb) print cyc_sum[ewcNS]
$12 = 2159767016
(gdb) print cyc_sum[ewcNS]/tot
$13 = 0.087968865278730823
(gdb) print 100*cyc_sum[ewcNS]/tot
$14 = 8.796886527873081
(gdb) bt
#0  wallcycle_print (fplog=0x17b1fe0, nnodes=2, npme=0, realtime=0.90258157253265381, wc=0x18d22d0, gpu_t=0x0)
    at /mnt/workspace/Gromacs_Gerrit_master@2/bf0014eb/gromacs/src/gromacs/timing/wallcycle.c:924
#1  0x00007fbac7e1c711 in finish_run (fplog=0x17b1fe0, cr=0x17a8a50, inputrec=0x17a3040, nrnb=0x18d2890, wcycle=0x18d22d0, walltime_accounting=0x1f02be0, nbv=0x1c394c0, bWriteStat=1)
    at /mnt/workspace/Gromacs_Gerrit_master@2/bf0014eb/gromacs/src/gromacs/mdlib/sim_util.cpp:2629
#2  0x000000000041956d in mdrunner (hw_opt=0x7ffe4f716960, fplog=0x17b1fe0, cr=0x17a8a50, nfile=35, fnm=0x7ffe4f715900, oenv=0x17b4520, bVerbose=0, bCompact=1, nstglobalcomm=-1, ddxyz=0x7ffe4f716a90, 
    dd_node_order=1, rdd=0, rconstr=0, dddlb_opt=0x42bbaa "auto", dlb_scale=0.80000000000000004, ddcsx=0x0, ddcsy=0x0, ddcsz=0x0, nbpu_opt=0x42bbaa "auto", nstlist_cmdline=0, nsteps_cmdline=-2, 
    nstepout=100, resetstep=-1, nmultisim=0, repl_ex_nst=0, repl_ex_nex=0, repl_ex_seed=-1, pforce=-1, cpt_period=15, max_hours=-1, imdport=8888, Flags=7168)
    at /mnt/workspace/Gromacs_Gerrit_master@2/bf0014eb/gromacs/src/programs/mdrun/runner.cpp:1326
#3  0x000000000041b731 in gmx_mdrun (argc=2, argv=0x7ffe4f717e50) at /mnt/workspace/Gromacs_Gerrit_master@2/bf0014eb/gromacs/src/programs/mdrun/mdrun.cpp:600
#4  0x000000000040fe89 in (anonymous namespace)::NoNiceModule::run (this=0x179a5d0, argc=2, argv=0x7ffe4f717e50)
    at /mnt/workspace/Gromacs_Gerrit_master@2/bf0014eb/gromacs/src/programs/legacymodules.cpp:156
#5  0x00007fbac67d7ab0 in gmx::CommandLineModuleManager::run (this=0x7ffe4f717d30, argc=2, argv=0x7ffe4f717e50)
    at /mnt/workspace/Gromacs_Gerrit_master@2/bf0014eb/gromacs/src/gromacs/commandline/cmdlinemodulemanager.cpp:554
#6  0x000000000040dae3 in main (argc=3, argv=0x7ffe4f717e48) at /mnt/workspace/Gromacs_Gerrit_master@2/bf0014eb/gromacs/src/programs/gmx.cpp:60

This seems quite sane. What am I missing?

#22 Updated by Erik Lindahl over 4 years ago

Tried to log in and run on bs_nix1310, but I couldn't reproduce it. Where is the core file?

A couple of random ideas:

1) If the error occurs on line 924, isn't that the next md_print_warn() call? Then we should check cyc_sum[ewcDOMDEC] too.

2) I'm not sure if things could be delayed a few lines, so we might want to check other FP numbers in the file too?

#23 Updated by Mark Abraham over 4 years ago

Erik Lindahl wrote:

Tried to log in and run on bs_nix1310, but I couldn't reproduce it. Where is the core file?

Sorry didn't think to save it. Roland has set up a core-saving mechanism, but it didn't seem to work this time. Not sure why.

For the record, the core file got dumped in ~jenkins/workspace/path/shown/top/of/console/log/regressiontests/complex/swap_y

A couple of random ideas:

1) If the error occurs on line 924, isn't that the next md_print_warn() call? Then we should check cyc_sum[ewcDOMDEC] too.

2) I'm not sure if things could be delayed a few lines, so we might want to check other FP numbers in the file too?

#25 Updated by Erik Lindahl over 4 years ago

(removed duplicate link)

#26 Updated by Roland Schulz over 4 years ago

(gdb) print 100*cyc_sum[ewcDOMDEC]/tot+0.5
$7 = 105834145302.19266
(gdb) p cyc_sum[ewcDOMDEC]
$3 = 1.8446744073720431e+19
=> 0x00007ff2cc6ae609 <+4493>:  cvttsd2si %xmm0,%edx
(gdb) print $xmm0
$6 = {v4_float = {-1.77316193e+23, 46.1603546, 0, 0}, v2_double = {105834145302.19266, 0}, v16_int8 = {82, 
    49, 22, -26, 52, -92, 56, 66, 0, 0, 0, 0, 0, 0, 0, 0}, v8_int16 = {12626, -6634, -23500, 16952, 0, 0, 0, 
    0}, v4_int32 = {-434753198, 1111008308, 0, 0}, v2_int64 = {4771744352304509266, 0}, 
  uint128 = 4771744352304509266}

It seems the problem is the double->int conversion which overflows. They core file is at bs-nix1204:~/testing/roland if someone wants to look why cyc_sum[ewcDOMDEC] is so big.

#27 Updated by Mark Abraham over 4 years ago

Roland Schulz wrote:

[...]

It seems the problem is the double->int conversion which overflows. They core file is at bs-nix1204:~/testing/roland if someone wants to look why cyc_sum[ewcDOMDEC] is so big.

The values in the core file are insane, so there must be some garbage value going into the wallcycle summation, or similar.

#28 Updated by Erik Lindahl over 4 years ago

Hi,

I checked the raw values of the timestep counters on bs-nix1204, and they are around 1.5E+15. In other words, it's not merely an error of dumping a raw cycle counter (instead of a difference) there.

#29 Updated by Erik Lindahl over 4 years ago

Can't find that core file either... Roland/Mark: if it's still around (or next time it happens), could you see if you can check the values of the raw counters? I.e., the list of 64-bit integers in wc->wcc[] ?

#30 Updated by Erik Lindahl over 4 years ago

Another occurence:

http://jenkins.gromacs.org/view/Gerrit%20master/job/Gromacs_Gerrit_master/9237/OPTIONS=Compiler=gcc%20CompilerVersion=4.4%20GMX_GPU=OFF%20GMX_DOUBLE=ON%20GMX_MPI=ON%20GMX_SIMD=AVX_256%20host=bs_nix1310,label=bs_nix1310/testReport/junit/(root)/complex/swap_x/

Here I managed to copy the entire build tree to localadmin's homedir on bs-nix1310 (bf0014eb), including both the source, binaries and core file.

One observation I made is that a bunch of the files in gromacs/src/testutils/tests/refdata produced errors like

ls: cannot access /tmp/bf0014eb/gromacs/src/testutils/tests/refdata/ReferenceDataTest_HandlesPresenceChecks.xml: Permission denied
ls: cannot access /tmp/bf0014eb/gromacs/src/testutils/tests/refdata/ReferenceDataTest_HandlesFloatingPointData.xml: Permission denied
ls: cannot access /tmp/bf0014eb/gromacs/src/testutils/tests/refdata/ReferenceDataTest_HandlesSequenceItemIndices.xml: Permission denied
ls: cannot access /tmp/bf0014eb/gromacs/src/testutils/tests/refdata/ReferenceDataTest_HandlesMissingData.xml: Permission denied
ls: cannot access /tmp/bf0014eb/gromacs/src/testutils/tests/refdata/ReferenceDataTest_HandlesStringBlockData.xml: Permission denied
ls: cannot access /tmp/bf0014eb/gromacs/src/testutils/tests/refdata/..: Permission denied
ls: cannot access /tmp/bf0014eb/gromacs/src/testutils/tests/refdata/ReferenceDataTest_HandlesSpecialCharactersInStrings.xml: Permission denied
ls: cannot access /tmp/bf0014eb/gromacs/src/testutils/tests/refdata/ReferenceDataTest_HandlesSimpleData.xml: Permission denied
ls: cannot access /tmp/bf0014eb/gromacs/src/testutils/tests/refdata/ReferenceDataTest_HandlesMultipleNullIds.xml: Permission denied
ls: cannot access /tmp/bf0014eb/gromacs/src/testutils/tests/refdata/ReferenceDataTest_HandlesSequenceData.xml: Permission denied
ls: cannot access /tmp/bf0014eb/gromacs/src/testutils/tests/refdata/ReferenceDataTest_HandlesVectorData.xml: Permission denied
ls: cannot access /tmp/bf0014eb/gromacs/src/testutils/tests/refdata/.: Permission denied
ls: cannot access /tmp/bf0014eb/gromacs/src/testutils/tests/refdata/ReferenceDataTest_HandlesIncorrectData.xml: Permission denied
ls: cannot access /tmp/bf0014eb/gromacs/src/testutils/tests/refdata/ReferenceDataTest_HandlesMultipleChecksAgainstSameData.xml: Permission denied
total 0
-????????? ? ? ? ?            ? ReferenceDataTest_HandlesVectorData.xml
-????????? ? ? ? ?            ? ReferenceDataTest_HandlesStringBlockData.xml
-????????? ? ? ? ?            ? ReferenceDataTest_HandlesSpecialCharactersInStrings.xml
-????????? ? ? ? ?            ? ReferenceDataTest_HandlesSimpleData.xml
-????????? ? ? ? ?            ? ReferenceDataTest_HandlesSequenceItemIndices.xml
-????????? ? ? ? ?            ? ReferenceDataTest_HandlesSequenceData.xml
-????????? ? ? ? ?            ? ReferenceDataTest_HandlesPresenceChecks.xml
-????????? ? ? ? ?            ? ReferenceDataTest_HandlesMultipleNullIds.xml
-????????? ? ? ? ?            ? ReferenceDataTest_HandlesMultipleChecksAgainstSameData.xml
-????????? ? ? ? ?            ? ReferenceDataTest_HandlesMissingData.xml
-????????? ? ? ? ?            ? ReferenceDataTest_HandlesIncorrectData.xml
-????????? ? ? ? ?            ? ReferenceDataTest_HandlesFloatingPointData.xml
d????????? ? ? ? ?            ? ..
d????????? ? ? ? ?            ? .

This is for my temporary location in /tmp, but they looked the same in the workspace. Could there be something wrong with our filesystem that contributes to these errors?

#31 Updated by Erik Lindahl over 4 years ago

Only cyc_sum[ewcDOMDEC] seems to be screwed up.

The raw cycle counters on this process (which we have access to) also seem to look fine, so something appears to go wrong in the second MPI process.

#32 Updated by Gerrit Code Review Bot over 4 years ago

Gerrit received a related patchset '1' for Issue #1677.
Uploader: Erik Lindahl ()
Change-Id: I4b9b452ac8079e0586ba6249b36ff2bba5077161
Gerrit URL: https://gerrit.gromacs.org/4722

#33 Updated by Erik Lindahl over 4 years ago

I think I found it. The double data was casted back to (int) after MPI summation, which will overflow. Changed to gmx_cycles_t instead. Knock on wood...

#34 Updated by Erik Lindahl over 4 years ago

  • Status changed from Accepted to Fix uploaded
  • Assignee set to Erik Lindahl

#35 Updated by Erik Lindahl over 4 years ago

  • Status changed from Fix uploaded to Accepted

Didn't fix it.

#36 Updated by Berk Hess over 4 years ago

I looked at all uses of ewcDOMDEC and those look fine. There are corrections applied in wallcycle.c:
wcc[ewcDOMDEC].c -= wcc[ewcDDCOMMLOAD].c;
wcc[ewcDOMDEC].c -= wcc[ewcDDCOMMBOUND].c
but also the use of the other two looks ok.
I have difficulties accessing the files of the crashes. Is one of the other two counters misbehaving?

#37 Updated by Erik Lindahl over 4 years ago

We can only access the core file for the first MPI rank (presumably because only that process crashes).

There, cyc_sum[ewcDOMDEC] is screwed up while the entries for ewcDDCOMMLOAD and ewcDDCOMMBOUND look fine. The raw cycle counters are also fine on this rank. The only conclusion I can come up with is that something goes bad on MPI rank 1.

I haven't been able to reproduce anything with interactive logins either, despite testing ~50 executions. If we can solve it any other way, perhaps we should add code to send over the raw cycles from the other MPI processes too, so we can check the contents the next time we have a random exception?

It might also be a clue that it always seems to happen for the swap_x or swap_y tests.

#38 Updated by Erik Lindahl over 4 years ago

Memory sanitizer builds don't find anything either.

#39 Updated by Gerrit Code Review Bot over 4 years ago

Gerrit received a related patchset '1' for Issue #1677.
Uploader: Berk Hess ()
Change-Id: I7f1b19397b36456f1d120dbc0080146a384def5a
Gerrit URL: https://gerrit.gromacs.org/4723

#40 Updated by Berk Hess over 4 years ago

I uploaded a patch that might fix this.
My guess is that due to the extra dd_partition_system call with swap (and repl ex), which is not timed, we can sometimes accumulate more cycles in both DDCOMM counter that in ewcDOMDEC itself, so subtracting leads to a negative number. This is just a guess though.

#41 Updated by Erik Lindahl over 4 years ago

Yes, I think that's it. I looked into the numbers a bit more after writing the comment on your patch, and the values above (~1.84E+19) is roughly within 10,000,000 cycles of 2^64.

Since we typically use an unsigned long long for gmx_cycles, this means things will wrap around rather than generate a negative value in the subtraction. And, it also explains why we only saw it very infrequently - in most cases we likely get the right sign.

#42 Updated by Erik Lindahl over 4 years ago

  • Status changed from Accepted to Fix uploaded

#43 Updated by Erik Lindahl over 4 years ago

  • Status changed from Fix uploaded to Resolved

#44 Updated by Erik Lindahl over 4 years ago

  • Status changed from Resolved to Closed

#45 Updated by Aleksei Iupinov almost 2 years ago

  • Related to Bug #2404: Enabling floating point exceptions makes some tests fail added

Also available in: Atom PDF