Project

General

Profile

Bug #3305

Case gives FPE with Debug build when GPU update is enabled

Added by Alan Gray 6 months ago. Updated 6 months ago.

Status:
New
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
Affected version - extra info:
Affected version:
Difficulty:
uncategorized
Close

Description

Running with GPU comm feaures + GPU update, this case reliably crashes on step 7200 (see below). It runs fine with GPU comm feaures and without GPU update. tpr file is attached. Also runs fine with Release build.

gmx mdrun -s adhd.tpr -ntomp 1 -pme gpu -nb gpu -ntmpi 4 -npme 1 -nsteps 10000 -v -notunepme -pin on -bonded gpu -noconfout
…
This run will default to '-update gpu' as requested by the GMX_FORCE_UPDATE_DEFAULT_GPU environment variable. GPU update with domain decomposition lacks substantial testing and should       be used with caution.

Enabling GPU buffer operations required by GMX_GPU_DD_COMMS (equivalent with GMX_USE_GPU_BUFFER_OPS=1).

This run uses the 'GPU halo exchange' feature, enabled by the GMX_GPU_DD_COMMS environment variable.

This run uses the 'GPU PME-PP communications' feature, enabled by the GMX_GPU_PME_PP_COMMS e      nvironment variable.
...
step 7200, remaining wall clock time:     9 s          imb F  2% pme/F 0.18
Thread 11 "gmx" received signal SIGFPE, Arithmetic exception.
[Switching to Thread 0x7fffe59c1700 (LWP 44360)]
0x0000555556822023 in extract_binr (b=0x7fffcc2940d0, index=131, nr=2, r=0x7fffcc0b0048)
    at /gpfs/fs1/alang/Gromacs/gerrit-git/gromacs/src/gromacs/mdlib/rbin.cpp:156
156             r[i] = rbuf[i];
(gdb) backtrace
#0  0x0000555556822023 in extract_binr (b=0x7fffcc2940d0, index=131, nr=2,
    r=0x7fffcc0b0048)
    at /gpfs/fs1/alang/Gromacs/gerrit-git/gromacs/src/gromacs/mdlib/rbin.cpp:156
#1  0x000055555682207b in extract_binr (b=0x7fffcc2940d0, index=131, r=...)
    at /gpfs/fs1/alang/Gromacs/gerrit-git/gromacs/src/gromacs/mdlib/rbin.cpp:162
#2  0x0000555555e1115f in global_stat (gs=0x7fffcc0b4d40, cr=0x7fffcc00a120,
    enerd=0x7fffe59c0240, fvir=0x7fffe59be2e0, svir=0x7fffe59be310, mu_tot=0x7fffe59be2d4,
    inputrec=0x7fffe59bfb10, ekind=0x7fffe59bf740, constr=0x7fffcc0afcc0,
    vcm=0x7fffe59be1c0, nsig=3, sig=0x7fffe59be72c,
    totalNumberOfBondedInteractions=0x7fffe59bdfdc, bSumEkinhOld=true, flags=5080)
    at /gpfs/fs1/alang/Gromacs/gerrit-git/gromacs/src/gromacs/mdlib/stat.cpp:347
#3  0x0000555555e02dd6 in compute_globals (gstat=0x7fffcc0b4d40, cr=0x7fffcc00a120,
    ir=0x7fffe59bfb10, fr=0x7fffcc00b330, ekind=0x7fffe59bf740, x=0x7fffcd98c000,
    v=0x7fffce7e3000, box=0x7fffcc294164, vdwLambda=0, mdatoms=0x7fffcc0afaf0,
    nrnb=0x7fffe59bfea0, vcm=0x7fffe59be1c0, wcycle=0x7fffcc1a6020, enerd=0x7fffe59c0240,
    force_vir=0x7fffe59be2e0, shake_vir=0x7fffe59be310, total_vir=0x7fffe59be340,
    pres=0x7fffe59be3a0, mu_tot=0x7fffe59be2d4, constr=0x7fffcc0afcc0,
    signalCoordinator=0x7fffe59be710, lastbox=0x7fffe59be430,
    totalNumberOfBondedInteractions=0x7fffe59bdfdc, bSumEkinhOld=0x7fffe59bdfa7,
    flags=5080)
    at /gpfs/fs1/alang/Gromacs/gerrit-git/gromacs/src/gromacs/mdlib/md_support.cpp:244
#4  0x0000555555e83e4b in gmx::LegacySimulator::do_md (this=0x7fffcc0b21b0)
    at /gpfs/fs1/alang/Gromacs/gerrit-git/gromacs/src/gromacs/mdrun/md.cpp:1403
---Type <return> to continue, or q <return> to quit---
#5  0x0000555555e7cafb in gmx::LegacySimulator::run (this=0x7fffcc0b21b0)
    at /gpfs/fs1/alang/Gromacs/gerrit-git/gromacs/src/gromacs/mdrun/legacysimulator.cpp:73
#6  0x0000555555976875 in gmx::Mdrunner::mdrunner (this=0x7fffe59c0730)
    at /gpfs/fs1/alang/Gromacs/gerrit-git/gromacs/src/gromacs/mdrun/runner.cpp:1599
#7  0x000055555597165c in gmx::mdrunner_start_fn (arg=0x7fffffff93f0)
    at /gpfs/fs1/alang/Gromacs/gerrit-git/gromacs/src/gromacs/mdrun/runner.cpp:374
#8  0x0000555555b4bf11 in tMPI_Thread_starter (arg=0x555558a71498)
    at /gpfs/fs1/alang/Gromacs/gerrit-git/gromacs/src/external/thread_mpi/src/tmpi_init.cpp:      399
#9  0x0000555555b47080 in tMPI_Thread_starter (arg=0x555558268e70)
    at /gpfs/fs1/alang/Gromacs/gerrit-git/gromacs/src/external/thread_mpi/src/pthreads.cpp:2      35
#10 0x00007ffff7bbd6db in start_thread (arg=0x7fffe59c1700) at pthread_create.c:463
#11 0x00007fffed4cc88f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

(gdb) print rbuf[i]
$1 = 7.272166895370186e+38
(gdb) print r[i]
$2 = 2.42405563e+38

adhd.tpr (7.06 MB) adhd.tpr Alan Gray, 01/16/2020 05:54 PM

History

#1 Updated by Szilárd Páll 6 months ago

That sounds like the original issue I describe in #3240, it always tool at least a few thousand steps and only occurred with GPU DD and update.

#2 Updated by Alan Gray 6 months ago

I've tracked this down - it looks like the GPU constraints is missing code to update the lincsd->rmsdData structure, which is required when (do_log || do_ene), which in this case is true on step 500.

In the CPU version of lincs, we have (L2391 of lincs.cpp)

 if (computeRmsd || printDebugOutput || bWarn)
        {
            LincsDeviations deviations = makeLincsDeviations(*lincsd, xprime, pbc);

            if (computeRmsd)
            {
                // This is reduced across domains in compute_globals and
                // reported to the log file.
                lincsd->rmsdData[0] = deviations.numConstraints;
                lincsd->rmsdData[1] = deviations.sumSquaredDeviation;
            }
…

This code is triggered when computeRmsd is true, which is passed in to the fn as (bLog || bEner). This is later used in the global_stat() fn, L247 in stat.cpp

irmsd = add_binr(rb, 2, rmsdData.data());

@Artem, I think this just needs added into the GPU version as well.

#3 Updated by Artem Zhmurov 6 months ago

@Alan, can you check if https://gerrit.gromacs.org/#/c/gromacs/+/15457/ fixes it? It won't be hard to add RMSD to GPU version of LINCS, but I would rather remove constraints RMSD altogether --- just to simplify everything a little bit. Will talk to @Berk about it.

#4 Updated by Alan Gray 6 months ago

@Alan, can you check if https://gerrit.gromacs.org/#/c/gromacs/+/15457/ fixes it?

Afraid not, we still get the FPE with this patch.

#5 Updated by Alan Gray 6 months ago

Artem, would it be possible just to fall back to CPU update & constraints on energy steps? Then it should work correctly, I think.

#6 Updated by Artem Zhmurov 6 months ago

Alan Gray wrote:

Artem, would it be possible just to fall back to CPU update & constraints on energy steps? Then it should work correctly, I think.

This is a work-around, rather than a solution, I am afraid.

Also available in: Atom PDF