Project

General

Profile

Bug #1871

segfaults in three regressiontests with NVIDIA OpenCL multi-GPU runs

Added by Szilárd Páll about 4 years ago. Updated almost 4 years ago.

Status:
Closed
Priority:
Normal
Assignee:
-
Category:
mdrun
Target version:
-
Affected version - extra info:
Affected version:
Difficulty:
uncategorized
Close

Description

The following tests segfault when executed on two GPUs (gmx mdrun -ntmpi 2 -gpu_id 01):

complex.nbnxn-ljpme-LB-geometric
complex.nbnxn_rzero
complex.position-restraints

backtrace:

#0  0x00007fcd3d189f0b in ?? () from /usr/lib/x86_64-linux-gnu/libnvidia-opencl.so.1
#1  0x00007fcd3d189aa2 in ?? () from /usr/lib/x86_64-linux-gnu/libnvidia-opencl.so.1
#2  0x00007fcd3d184632 in ?? () from /usr/lib/x86_64-linux-gnu/libnvidia-opencl.so.1
#3  0x00007fcd426dbdca in sync_ocl_event (stream=0x7fcd34872d30, ocl_event=0x7fcd340831f0)
    at /mnt/workspace/Gromacs_Gerrit_5_1-test-opencl-slave/d27c5006/gromacs/src/gromacs/mdlib/nbnxn_ocl/nbnxn_ocl.cpp:331
#4  0x00007fcd426dcf7d in nbnxn_gpu_launch_cpyback (nb=0x7fcd34082db0, nbatom=0x7fcd34081350, flags=1013, aloc=0)
    at /mnt/workspace/Gromacs_Gerrit_5_1-test-opencl-slave/d27c5006/gromacs/src/gromacs/mdlib/nbnxn_ocl/nbnxn_ocl.cpp:952
#5  0x00007fcd426d1fcc in do_force_cutsVERLET (fplog=0x0, cr=0x7fcd340071f0, inputrec=0x7fcd34007270, step=0, nrnb=0x7fcd3406bb60, wcycle=0x7fcd3406b640, top=0x7fcd349ce990, groups=0x7fcd340077a0, 
    box=0x7fcd3497db10, x=0x7fcd349d1580, hist=0x7fcd3497dc40, f=0x7fcd349d7ca0, vir_force=0x7fcd3c07f450, mdatoms=0x7fcd34333d90, enerd=0x7fcd34987f30, fcd=0x7fcd34069920, lambda=0x7fcd349884d0, graph=0x0, 
    fr=0x7fcd3406bf30, ic=0x7fcd34082060, vsite=0x0, mu_tot=0x7fcd3c07f5f0, t=0, field=0x0, ed=0x0, bBornRadii=1, flags=1013)
    at /mnt/workspace/Gromacs_Gerrit_5_1-test-opencl-slave/d27c5006/gromacs/src/gromacs/mdlib/sim_util.cpp:1061
#6  0x00007fcd426d4e02 in do_force (fplog=0x0, cr=0x7fcd340071f0, inputrec=0x7fcd34007270, step=0, nrnb=0x7fcd3406bb60, wcycle=0x7fcd3406b640, top=0x7fcd349ce990, groups=0x7fcd340077a0, box=0x7fcd3497db10, 
    x=0x7fcd349d1580, hist=0x7fcd3497dc40, f=0x7fcd349d7ca0, vir_force=0x7fcd3c07f450, mdatoms=0x7fcd34333d90, enerd=0x7fcd34987f30, fcd=0x7fcd34069920, lambda=0x7fcd349884d0, graph=0x0, fr=0x7fcd3406bf30, 
    vsite=0x0, mu_tot=0x7fcd3c07f5f0, t=0, field=0x0, ed=0x0, bBornRadii=1, flags=1013) at /mnt/workspace/Gromacs_Gerrit_5_1-test-opencl-slave/d27c5006/gromacs/src/gromacs/mdlib/sim_util.cpp:2009
#7  0x000000000041ac0e in do_md (fplog=0x0, cr=0x7fcd340071f0, nfile=35, fnm=0x7fcd34005a90, oenv=0x155c7d0, bVerbose=0, bCompact=1, nstglobalcomm=5, vsite=0x0, constr=0x7fcd3497d110, stepout=100, 
    ir=0x7fcd34007270, top_global=0x7fcd340076d0, fcd=0x7fcd34069920, state_global=0x7fcd340078e0, mdatoms=0x7fcd34333d90, nrnb=0x7fcd3406bb60, wcycle=0x7fcd3406b640, ed=0x0, fr=0x7fcd3406bf30, 
    repl_ex_nst=0, repl_ex_nex=0, repl_ex_seed=-1, membed=0x0, cpt_period=15, max_hours=-1, imdport=0, Flags=7168, walltime_accounting=0x7fcd343338d0)
    at /mnt/workspace/Gromacs_Gerrit_5_1-test-opencl-slave/d27c5006/gromacs/src/programs/mdrun/md.cpp:1078
#8  0x000000000042835b in mdrunner (hw_opt=0x7fcd3c07fc50, fplog=0x0, cr=0x7fcd340071f0, nfile=35, fnm=0x7fcd34005a90, oenv=0x155c7d0, bVerbose=0, bCompact=1, nstglobalcomm=-1, ddxyz=0x7fcd3c07fccc, 
    dd_node_order=1, rdd=0, rconstr=0, dddlb_opt=0x42da2a "auto", dlb_scale=0.800000012, ddcsx=0x0, ddcsy=0x0, ddcsz=0x0, nbpu_opt=0x42da2a "auto", nstlist_cmdline=0, nsteps_cmdline=-2, nstepout=100, 
    resetstep=-1, nmultisim=0, repl_ex_nst=0, repl_ex_nex=0, repl_ex_seed=-1, pforce=-1, cpt_period=15, max_hours=-1, imdport=0, Flags=7168)
    at /mnt/workspace/Gromacs_Gerrit_5_1-test-opencl-slave/d27c5006/gromacs/src/programs/mdrun/runner.cpp:1282
#9  0x000000000042528e in mdrunner_start_fn (arg=0x155ddd0) at /mnt/workspace/Gromacs_Gerrit_5_1-test-opencl-slave/d27c5006/gromacs/src/programs/mdrun/runner.cpp:186
#10 0x00007fcd426ea0fb in tMPI_Thread_starter (arg=0x15cb518) at /mnt/workspace/Gromacs_Gerrit_5_1-test-opencl-slave/d27c5006/gromacs/src/external/thread_mpi/src/tmpi_init.c:397
#11 0x00007fcd426e1201 in tMPI_Thread_starter (arg=0x15cc550) at /mnt/workspace/Gromacs_Gerrit_5_1-test-opencl-slave/d27c5006/gromacs/src/external/thread_mpi/src/pthreads.c:234
#12 0x00007fcd40f07e9a in start_thread (arg=0x7fcd3c080700) at pthread_create.c:308
#13 0x00007fcd4041438d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:112
#14 0x0000000000000000 in ?? ()


Related issues

Related to GROMACS - Bug #1990: LJ-PME unstable with OpenCLClosed

Associated revisions

Revision ad8209b5 (diff)
Added by Mark Abraham almost 4 years ago

Fix some OpenCL issues

Added routine to convert error codes into more helpful
diagnostics. Called it in one place that needed some troubleshooting,
but an overhaul of OpenCL error handling is needed (in master branch).

Introduced new OpenCL control variable to indicate when there is a
non-local event upon which it is valid to wait, since it is an error
to wait upon an ocl_event that was never returned by an API call.

Converted mdrun integration tests to rely on the improved automated
resource assignment in 5.1, because that copes better with the
limitations of the OpenCL implementation.

Worked around limitation where real MPI + OpenCL can't use more than
one GPU on a node, by disabling GPU support for that test case.

Fixed inappropriate use of mdrun -nt, where the number of thread-MPI
ranks was intended.

Updated install guide.

Fixes #1871

Change-Id: I11e6b2bdb6f7f91489f3ec0d671081d99661fa62

History

#1 Updated by Gerrit Code Review Bot about 4 years ago

Gerrit received a related patchset '5' for Issue #1871.
Uploader: Mark Abraham ()
Change-Id: I11e6b2bdb6f7f91489f3ec0d671081d99661fa62
Gerrit URL: https://gerrit.gromacs.org/5430

#2 Updated by Szilárd Páll almost 4 years ago

  • Status changed from New to In Progress

#3 Updated by Mark Abraham almost 4 years ago

  • Status changed from In Progress to Resolved

#4 Updated by Szilárd Páll almost 4 years ago

  • Status changed from Resolved to Closed

#5 Updated by Mark Abraham about 2 years ago

  • Related to Bug #1990: LJ-PME unstable with OpenCL added

Also available in: Atom PDF