Project

General

Profile

Bug #2776

FP exception in pullAllReduce

Added by Szilárd Páll almost 2 years ago. Updated almost 2 years ago.

Status:
Closed
Priority:
Normal
Assignee:
-
Category:
mdrun
Target version:
-
Affected version - extra info:
2019-beta3-dev-20181122-a59efce
Affected version:
Difficulty:
uncategorized
Close

Description

complex/pull_geometry_angle triggers an FP exception as shown here:
http://jenkins.gromacs.org/job/Matrix_PreSubmit_2019/OPTIONS=gcc-8%20openmp%20simd=avx2_256%20gpuhw=amd%20opencl-1.2%20clFFT-2.14%20host=bs_gpu01,label=bs_gpu01/251/testReport/junit/(root)/complex/pull_geometry_angle/

The above case has been rerun and is shown to trigger the exception here:

#0  0x00007f22a291b11a in tMPI_DOUBLE_sum (dest=0x11caef0, src_a=0x11caef0, src_b=0x7f228c7a1f70, count=36)
    at /home/jenkins/workspace/Matrix_PreSubmit_2019/cf01250f/gromacs/src/external/thread_mpi/src/tmpi_ops.h:99
#1  0x00007f22a2913c29 in tMPI_Reduce_run_op (dest=0x11caef0, src_a=0x11caef0, src_b=0x7f228c7a1f70, datatype=0x7f22a381d080 <tmpi_double>,
    count=36, op=TMPI_SUM, comm=0xe09f50)
    at /home/jenkins/workspace/Matrix_PreSubmit_2019/cf01250f/gromacs/src/external/thread_mpi/src/reduce.cpp:71
#2  0x00007f22a2913f4d in tMPI_Reduce_fast (sendbuf=0x11caef0, recvbuf=0x11caef0, count=36, datatype=0x7f22a381d080 <tmpi_double>, op=TMPI_SUM,
    root=0, comm=0xe09f50) at /home/jenkins/workspace/Matrix_PreSubmit_2019/cf01250f/gromacs/src/external/thread_mpi/src/reduce.cpp:176
#3  0x00007f22a29142fb in tMPI_Allreduce (sendbuf=0x11caef0, recvbuf=0x11caef0, count=36, datatype=0x7f22a381d080 <tmpi_double>, op=TMPI_SUM,
    comm=0xe09f50) at /home/jenkins/workspace/Matrix_PreSubmit_2019/cf01250f/gromacs/src/external/thread_mpi/src/reduce.cpp:312
#4  0x00007f22a1ecc954 in gmx_sumd (nr=36, r=0x11caef0, cr=0xe0a340)
    at /home/jenkins/workspace/Matrix_PreSubmit_2019/cf01250f/gromacs/src/gromacs/gmxlib/network.cpp:305
#5  0x00007f22a1bd7452 in gmxAllReduce (n=36, data=0x11caef0, cr=0xe0a340)
    at /home/jenkins/workspace/Matrix_PreSubmit_2019/cf01250f/gromacs/src/gromacs/pulling/pullutil.cpp:91
#6  0x00007f22a1bdb6e0 in pullAllReduce<double> (cr=0xe0a340, comm=0x122b900, n=36, data=0x11caef0)
    at /home/jenkins/workspace/Matrix_PreSubmit_2019/cf01250f/gromacs/src/gromacs/pulling/pullutil.cpp:106
#7  0x00007f22a1bd991e in pull_calc_coms (cr=0xe0a340, pull=0x122b850, md=0xe5e670, pbc=0x7fff11dd9d00, t=0, x=0x1902a80, xp=0x0)
    at /home/jenkins/workspace/Matrix_PreSubmit_2019/cf01250f/gromacs/src/gromacs/pulling/pullutil.cpp:716
#8  0x00007f22a1bf1ab6 in pull_potential (pull=0x122b850, md=0xe5e670, pbc=0x7fff11dd9d00, cr=0xe0a340, t=0, lambda=0, x=0x1902a80,
    force=0x7fff11dda080, dvdlambda=0x7fff11dd9cfc)
    at /home/jenkins/workspace/Matrix_PreSubmit_2019/cf01250f/gromacs/src/gromacs/pulling/pull.cpp:1527
#9  0x00007f22a27f6c7c in pull_potential_wrapper (cr=0xe0a340, ir=0x7fff11ddb6d0, box=0x18ff3d4, x=..., force=0x7fff11dda080, mdatoms=0xe5e670,
    enerd=0x18ffb30, lambda=0x18ff3b8, t=0, wcycle=0xdf04c0)
    at /home/jenkins/workspace/Matrix_PreSubmit_2019/cf01250f/gromacs/src/gromacs/mdlib/sim_util.cpp:281
#10 0x00007f22a27f816d in computeSpecialForces (fplog=0xdf4670, cr=0xe0a340, inputrec=0x7fff11ddb6d0, awh=0x0, enforcedRotation=0x0, step=0,
    t=0, wcycle=0xdf04c0, forceProviders=0xdf0510, box=0x18ff3d4, x=..., mdatoms=0xe5e670, lambda=0x18ff3b8, forceFlags=981,
    forceWithVirial=0x7fff11dda080, enerd=0x18ffb30, ed=0x0, bNS=true)
    at /home/jenkins/workspace/Matrix_PreSubmit_2019/cf01250f/gromacs/src/gromacs/mdlib/sim_util.cpp:844
#11 0x00007f22a27fa210 in do_force_cutsVERLET (fplog=0xdf4670, cr=0xe0a340, ms=0x0, inputrec=0x7fff11ddb6d0, awh=0x0, enforcedRotation=0x0,
    step=0, nrnb=0xe78600, wcycle=0xdf04c0, top=0x1355140, box=0x18ff3d4, x=..., hist=0x18ff638, force=..., vir_force=0x7fff11ddaae0,
    mdatoms=0xe5e670, enerd=0x18ffb30, fcd=0xdfac10, lambda=0x18ff3b8, graph=0x0, fr=0xe789a0, ic=0xe78c20, vsite=0x0, mu_tot=0x7fff11ddaa14,
    t=0, ed=0x0, flags=981, ddOpenBalanceRegion=DdOpenBalanceRegionBeforeForceComputation::yes,
    ddCloseBalanceRegion=DdCloseBalanceRegionAfterForceComputation::yes)
    at /home/jenkins/workspace/Matrix_PreSubmit_2019/cf01250f/gromacs/src/gromacs/mdlib/sim_util.cpp:1572
#12 0x00007f22a27fbe2e in do_force (fplog=0xdf4670, cr=0xe0a340, ms=0x0, inputrec=0x7fff11ddb6d0, awh=0x0, enforcedRotation=0x0, step=0,
    nrnb=0xe78600, wcycle=0xdf04c0, top=0x1355140, groups=0x7fff11ddb560, box=0x18ff3d4, x=..., hist=0x18ff638, force=...,
    vir_force=0x7fff11ddaae0, mdatoms=0xe5e670, enerd=0x18ffb30, fcd=0xdfac10, lambda=..., graph=0x0, fr=0xe789a0, vsite=0x0,
    mu_tot=0x7fff11ddaa14, t=0, ed=0x0, flags=981, ddOpenBalanceRegion=DdOpenBalanceRegionBeforeForceComputation::yes,
    ddCloseBalanceRegion=DdCloseBalanceRegionAfterForceComputation::yes)
    at /home/jenkins/workspace/Matrix_PreSubmit_2019/cf01250f/gromacs/src/gromacs/mdlib/sim_util.cpp:2151
#13 0x00007f22a2879d12 in gmx::Integrator::do_md (this=0x7fff11ddb1c0)
    at /home/jenkins/workspace/Matrix_PreSubmit_2019/cf01250f/gromacs/src/gromacs/mdrun/md.cpp:884
#14 0x00007f22a2874ccd in gmx::Integrator::run (this=0x7fff11ddb1c0, ei=0, doRerun=false)
    at /home/jenkins/workspace/Matrix_PreSubmit_2019/cf01250f/gromacs/src/gromacs/mdrun/integrator.cpp:72
#15 0x00007f22a2899bfd in gmx::Mdrunner::mdrunner (this=0x7fff11ddbf70)
    at /home/jenkins/workspace/Matrix_PreSubmit_2019/cf01250f/gromacs/src/gromacs/mdrun/runner.cpp:1425
#16 0x000000000040d819 in gmx::gmx_mdrun (argc=4, argv=0x7fff11ddd0d0)
    at /home/jenkins/workspace/Matrix_PreSubmit_2019/cf01250f/gromacs/src/programs/mdrun/mdrun.cpp:290
#17 0x00007f22a1a716e7 in gmx::(anonymous namespace)::CMainCommandLineModule::run (this=0xdde0b0, argc=4, argv=0x7fff11ddd0d0)
    at /home/jenkins/workspace/Matrix_PreSubmit_2019/cf01250f/gromacs/src/gromacs/commandline/cmdlinemodulemanager.cpp:133
#18 0x00007f22a1a731e3 in gmx::CommandLineModuleManager::run (this=0x7fff11ddcfa8, argc=4, argv=0x7fff11ddd0d0)
    at /home/jenkins/workspace/Matrix_PreSubmit_2019/cf01250f/gromacs/src/gromacs/commandline/cmdlinemodulemanager.cpp:589
#19 0x000000000040b0e8 in main (argc=5, argv=0x7fff11ddd0c8)
    at /home/jenkins/workspace/Matrix_PreSubmit_2019/cf01250f/gromacs/src/programs/gmx.cpp:60


Related issues

Related to GROMACS - Bug #2642: mdrun with SIMD triggers floating point exceptionsClosed

Associated revisions

Revision 27afba40 (diff)
Added by Berk Hess almost 2 years ago

Fix FP exception in pull code

There was a, harmless, floating point exception the pull COM parallel
reduction when a COM was not needed for a certain pull group. Now the
(unused) data is cleared.
Also made naming of the buffer consistent and introduced a constant
for the buffer stride instead on misusing DIM.

Fixes #2776

Change-Id: I4ef532f9a1ce6439eac94fa9b789d5259df38d88

History

#1 Updated by Szilárd Páll almost 2 years ago

  • Private changed from Yes to No

#2 Updated by Mark Abraham almost 2 years ago

Great work! I can repro a segfault with such a two-rank run with OpenCL on my laptop, but not reliably. When it segfaults in the debugger, it's on the line of clCreateContext in nbnxn_gpu_create_context, which looks like another issue, and probably not in our code.

#3 Updated by Szilárd Páll almost 2 years ago

Update: pullAllReduce() (source:src/gromacs/pulling/pullutil.cpp#L96) on rank #0 at step 0 ends up with *data = 1.3812744052142923e-309 in the case where I obtained a core dump. Can't see how does the summing end up (sometimes) with this result, but instead of starting at it more I'll set up some runs to try to repro faster and probe a few changes that touched the code -- in particular as Magnus has a case that may be easier to test with.

#4 Updated by Szilárd Páll almost 2 years ago

Couln't reproduce outside of the jenkins slave, will try elsewhere again later.

#5 Updated by Gerrit Code Review Bot almost 2 years ago

Gerrit received a related patchset '1' for Issue #2776.
Uploader: Berk Hess ()
Change-Id: gromacs~release-2019~I4ef532f9a1ce6439eac94fa9b789d5259df38d88
Gerrit URL: https://gerrit.gromacs.org/8748

#6 Updated by Berk Hess almost 2 years ago

  • Status changed from New to Resolved

#7 Updated by Mark Abraham almost 2 years ago

  • Status changed from Resolved to Closed

#8 Updated by Berk Hess almost 2 years ago

  • Related to Bug #2642: mdrun with SIMD triggers floating point exceptions added

Also available in: Atom PDF